Rob's Ramblings

Monday, 17 September 2018

Some contemplating on frame storage formats and clashes therein

I recently posted a little bit on how I now store contributed videotex (teletext and viewdata) frames within a database, so as to make accessing them far easier on the application side.

To do this, I had to decide on exactly how to store the visible content of the frame.  Everything else is easy; I crated a secondary table holding key=>value pairs, which means it is very easily expandable, and any application needing particular data can go look for it's own, and not be confused by anything extra.

So. The frame content itself.  I didn't get much help looking at existing storage formats, as I've got at least 17 types documented, and others I know about.  I may however have been influenced somewhat by them.

When you think about a viewdata frame, or a teletext page, you automatically see the 23-25 lines by 40 columns of static image.  Almost every frame you will find that has been saved out by a terminal emulator, or teletext captures, will consist of those 920, 960 or 1000 bytes of data, perhaps with some meta-data accompanying, sometimes not.  I think that every third-party viewdata host that I have so far encountered also stored its pages so.  Individual characters took up a single bytes as per their ASCII character code, and colour and control characters were also stored as a single byte.  For teletext, this uses the non-display codes below the space, as there is no concept of cursor movements, carriage returns, etc, on a teletext screen, which is what these values are used for in a serial-terminal based service.

Prestel, and viewdata generally, is however serial.  Frames are sent to the user as ASCII characters, but the colour and control codes are sent as command sequences:  Escape then a capital letter.  So, what might be stored in a teletext page as "<01>RED<02>GREEN<07>WHITE" would be sent to a viewdata terminal as "<ESC>ARED<ESC>BGREEN<ESC>GWHITE".  Short lines would be terminated by a carriage return and linefeed, so reducing the need to send the whole 40 characters.

Now.. Prestel itself is known to have stored the frame data exactly as it would be sent to the user.  There was a hard limit of 920 bytes available to the editor to use, and colour codes, etc, took up two of them.  This made creating complicated graphical pages somewhat difficult, as too many colour changes could quickly eat up all the allocation.  (Response frames were even worse; you only got 716 bytes to play with!)  This is probably why all third party viewdata servers stored their page as the 22x40 character full image, with the control codes stored as per teletext.  Doing this allowed for much more colour and graphic rich content than was possible on Prestel itself - the conversion was done on transmission.  The actual codes stored varied - some systems used 7 bit data throughout, some used top-bit sett letters to indicate that letter needed the escape sending before it, some used 7 bits for visible characters, and top-bit set control codes (codes in the range 128-159) and at least one had everything with the top bit set!

So fast forward 30 years, and I'm writing code to handle saved viewdata pages and display them on this new-fangled World Wide Web thing.  There is zero support for viewdata and teletext format images, so we have to roll our own, converting saved pages in any number of formats into PNG or GIF (to account for flashing characters) images that a web browser can display.

As an intermediate stage, I have to pull that 22-24x40 matrix of characters out, before plotting them onto a graphics image for sending to the viewer.  This intermediate block of characters I called an "internal" format, and was 7-bit clean, so codes below space for the colour codes, and the rest visible.

For nearly ten years this worked fine, and this internal, intermediate format, was the format used when I created the page database.

It is only this week I hit a problem with this, and it is down to a peculiarity with how Prestel stores Response Frames.  (And, I assume, other frames that are not simple static pages.)

A response frame contains a number of fields that are defined by the editor when they create it, and are either filled automatically by the Prestel server when it displays the page, or  can contain text or data to be entered by the user.  When the user hits # on the last field, they are given the option to send (or not) the page to the IP.  It is then delivered to their mailbox in a filled-in state.

When defining a response frame in the standard Prestel online editor, a field is specified by typing, e,g. Crtl-L n 30 Ctrl-L will create a field of 30 characters length containing the subscribers' name - on pressing the second Ctrl-L the system will display 30 "n"s in the required position.  The same procedure is repeated for any other field you request.  What gets stored in the Prestel database is a single Ctrl-L and 30 "n"s.

When you retrieve a page from Prestel using the "Bulk" Online editor, it is sent exactly as stored, so you get the Ctrl-L and sequence of letters alongside the Escape'd colour codes and CR/LFs for short lines.  Uploading a replacement frame you specify the layout in the same manner.

Those of you familiar with the standard ASCII control codes will recognise that Ctrl-L is also known as "Clear Screen", and is a character that is usually sent before sending the frame content.  This is probably why it was used for this purpose - finding it in the middle of the frame content would not make sense, so it was re-purposed as a flag for start-of-field.  Obviously this is never actually sent to the user, but is replaced by a space when viewing on a terminal.

Now ...

I have two small databases in my posession that were pulled back down from Prestel at some point, and these include a number of Response Frames.

When I converted the data to my "internal" format to load them into the database, this normalised the control codes to 7-bit data, filling that lower 32 bytes of the table.  On displaying, these codes were sent as <Esc><code + 64>, this recreating the colour sequences.

When it comes to a <ctrl l>, however, this was never stored in the database - the normalisation routine ignored it.  However, even if it had been saved, on recall, it would have been translated into an <Esc>L, the sequence to end double-height text.

So, to summarise, the normalisation I did, in most cases, lost the start-of-field character because it wasn't expected in a frame.  And if it did make it though, it would be indistinguishable from the "Single Height" code, and as that was allowed anywhere in a response frame, it couldn't be deduced from context.

I never noticed, because there were so few frames affected, and there was no need to process the fields the code indicated, anyway!

This last month, however, I've been working on a viewdata host program that will run on a modern server, and which I could use to receate the look and feel of using the original Prestel service.  I've been testing this using an actual Prestel terminal, and it's been great fun!  It's only when I stumbled across one of these response frames, and decided to support them, that I discovered this problem!

Looking into how other file formats solved this, it seems that at least one of them uses <Esc> itself as the field indicator.  If  stored in the database like that, when expanded on recall this would translate into an unused code sequence, in viewdata, so is a suitable alternative.  I will translate the affected pages, eventually!


So, a decision taken about 10 years ago came back to bite me this week. And it's all to do with 25 year old data in a file format determined 40 years ago that everyone else decided needed to be done differently.

Well done for making it this far!


As an aside .. Prestel added support for "Dynamic frames" which were basically frames that could contain cursor movement characters.  This meant you could go back and change things after you had already drawn them.  This was easy for them, as they stored data in an as-transmitted form anyway.  It's no so easy for host software that expects it's frames to be stored in a fixed matrix!  I'll be working on this, one I find some original examples....


Labels: , , , , ,

Friday, 14 September 2018

The Videotex Database - submit your pages now!

When I started viewdata.org.uk (and teletext.org.uk), I just uploaded the pages and databases I had as-is, and had my scripts deal with them on an as-accessed basis.  This is because I wanted to preserve the data as much as possible - any translation to a new format (such as JPEGs) would inevitably lose data, as well as context.

As time has moved on, and as the variety of data formats I have had to deal with has proliferated, this has increasingly become somewhat unwieldy. I decided, therefore, to try and rationalise things somewhat.

Each of the various file formats I was dealing with had different properties. Each had strengths, and each had weaknesses.  I could not decide on a single common format to try and convert files into.

Rather than create a new "perfect" file format, I decided therefore to store the frames within a database.  By having a primary table for the page content and certain static data, and a separate table for meta-data, any particular properties a particular file format had could be accommodated.

Once the data is held within a standardised database, of course, it makes it much easier to access it and use it from many different applications.  The first, and most obvious, is the ability to search across the entire database for key words or phrases. This is implemented on the front page of the database.

The main in-browser viewer for the saved pages implements a timeline function, where you can see how a given page has changed over time.  See, for example, the CEEFAX news headlines.

And of course, for viewdata pages, once can implement a dial-up host, so 1980s terminals can connect directly into the service and browse it exactly as they did at the time.  (This is mostly done, just pending further tidying up!)

Currently the database contains page data I have collected myself or already been sent. However I am aware that there is a vast amount more out there.  Jason Robertson has been amazing at rescuing teletext pages off old video tapes, and I know of at least one previous Prestel IP that has a massive archive of pages still extant, albeit sat on very old hardware.  I've got part of The Gnome At Home, and I know the rest still exists.

This week's task (one of the various "I'll do something" for Retrochallenge 2018/09) was to create a page for viewers to directly submit their pages to the database.  This is now complete!  It actually places the data into a queue, after briefly validating it, so it can be checked and added later.  I would welcome any contributions, anything from a single frame to a complete service backup!  If you need help, feel free to drop me a line.


Labels: , , ,

Monday, 3 September 2018

A Viewdata Host

One of my aims when setting up viewdata.org.uk was to create a means by which readers could experience connecting to a viewdata service, and also to use such to present what saved pages we had in an appropriate context.

Sadly, there was nothing available that I could find that would allow me to run an actual host, and although I had some success firing up my old BBC Micro based viewdata BBS, this didn't last long due to multiple hardware failures.

Back up to today, and, as I mentioned yesterday, John Newcombe has written, and is running, his own viewdata host called TELSTAR.   I've discussed some things with John, and had been hoping to blag a copy of the software, but it seems that it's not quite what I am looking for.

Now, I have been building up a database of frames - this is yet another unfinished project - over at db.viewdata.org.uk.   This database is what I want to use as the source of the data for a host system.
Although it's mostly got teletext loaded up, I do have a complete copy of the PC Plus demo of Micronet loaded up, which can act as a starting point.

So, what to do?  Well it's obvious, write my own host software.  I've been putting this off for years, but, it's #retrochallenge time, and I do want to achieve something...

A few hours last night got the bare bones sorted out, and a bit of time debugging, and we're at a point where I can dial in and navigate between pages!  Woo!

Whereas John has been concentrating on content for his viewdata host, I'm going to be working on making mine more of a "Prestel Emulator"; it should feel as close to the original as possible.  I've a lot to do, obviously, but not bad for an few hours work.




Labels: , , , ,

Saturday, 1 September 2018

Modem Emulation - an RC2018/09 prologue

Most of you will know by now that I'm really into preserving the memory of Prestel and Viewdata systems generally.  I run www.viewdata.org.uk which, while a bit long in the tooth, is going to get a massive update "soon" ...   But today I'm going to talk about hardware.

Some time back, I fired up my old viewdata BBS "Ringworld" - this operated on a collection of BBC Micros - one per connected user - and an Acorn A5000 acting as fileserver.  I connected these to the internet using a motly selection of modems, ATA telephony adapters, and serial terminal adapters.

The long shot was, for a user dialling in, the call was answered either by the exact same modem it always had been, connected to a SIP ATA - the digital data was transformed to analogue, before being turned back to digital by the modem.  This always seemed like a poor idea to me. What would be better is if some bit of software answered that digitised telephone call, looked at the whistles and warbles, and turned it directly into a sequence of ASCII bytes for delivery to a telnet port.

I had found an program called iaxmodem that allowed an asterisk based PBX to emulate a modem, but it was focused on faxing, and I just couldn't get it to work with the V23 dial-up I wanted.  But it was close.   I spent the next few years, off and on, searching for changes to that, or SIP based alternatives, with no luck.

In the meantime, John Newcombe decided to write his own viewdata host service, called Telstar, in Python, and that can be accessed via a raw-socket. (like telnet, but without the features!)  There's not a lot of software out there that can talk both Viewdata display protocols and connect to a socket, however.  Richard Russell wrote a example viewdata client that could do it, and you can connect from BeebEm if you load up a suitable comms package and set the RS423 IP parameters. 

There have also been a couple of projects to produce a "WiFi Modem" that, basically, looks like a hayes-compatible modem that you connect to via RS232, but it in turn connects to your WiFi, and onwards to a telnet port out on the internet.  This is great for things like BBC Micros, Commodore 64s, etc., where you can just swap out your period modem for this new device.   Not so good for dedicated terminals, or e.g. the ZX Spectrum VTX5000 where the modems are built in.

Then, out of the blue, an old friend, Darren Storer, posted on the BBC Micro facebook group (I think it was there..) that he'd set up a dial-up number for Telstar, and could people test it.  It took me a week or two to get there, but I pulled out a terminal, dialled the number ... and it didn't work.  Not at all.  I did, however find out the software he was using ...  asterisk-Softmodem.  This was exactly the sort of project I'd been looking for all those years.  But, it didn't work for him/

I pulled the code and had a look, and could see nothing wrong.  So, firing up an asterisk server, and installing it, I tried to debug.  The first issue was my terminal was not locking onto the carrier, so I added a t(-10) to increase the volume, and that sorted that!

Next problem, I wasn't getting much data on screen - many characters were just missing!   This was somewhat easy to diagnose, as I had an inkling after seeing how you configured asterisk to use softmodem - you specified the number of data bits, being between 5 and 8.  The example had it as 8. Now Prestel, and of course the terminals, all used 7 bits with even parity. What I was seeing was the terminal being sent 8-bit data, and of course interpreting that as most of the characters having an incorrect parity bit, and ignoring those!

Now, I can set a software terminal to 8bit data, but not the termnal - there  is very little you can configure as a user on these things.  Because the project had no support for parity it looked like a dead end, but that wasn't going to stop me - I'd waited years to find this, and wasn't going to give up now!

Delving into the code, it actually turned out to be a nice simple and straightforward bit of programming.  Adding parity support turned out to be fairly easy... I've published the modifications to my own github fork and submitted a pull request to send them back to the original author.

So now, I can dial into Telstar, CCL4, or anywhere I want to set up a number for!

If you want to try it, the number for Telstar is 0333 340 3311 (from outside UK, +44 333 340 3311). Calls cost the same as an 01 or 02 and are included in any inclusive minutes you may have. Call s are free for A&A customers.)

I can't guarantee that number will stay up, and it may not work from time to time if I'm tweaking things, but if it turns out useful to you, please let me know in the comments below!


Labels: , , , , , ,

Friday, 16 March 2018

Archiving my Filing with tesseract

I hate filing.  We get lots of paperwork that needs keeping, and it's a pain in the neck.  I've got boxes of the stuff, and every time I need something it's a major task to find the item we want.

So, the plan is to digitise it all.  We have a network connected photocopier that will also act as a sheet-fed scanner, saving as PDF files directly onto a network share.  That's the first step, scan things.

But what to do next.  A pile of random image-within-a-PDFs isn't much use, not without being sorted into, at least, some sort of order.

I could just browse the folder, and drag-and-drop the files into the relevant folders, but that's a lot of work and time consuming.  I'm a great believer in "let the computer do the work", so I threw together a little script to do the job.  Here we go:


#!/bin/ksh

thisFILE="$(whence ${0})"
progName="${0##*/}"

     myPID="$$"

     FUSERout=$(fuser ${thisFILE} 2>/dev/null)
     typeset -i numProc=$(echo "${FUSERout}" | nawk '{print NF}')
     if [[ "${numProc}" -gt 1 ]]; then
        echo "${progName}: another instance(s) of [${thisFILE}] is currently still running\
             [$(echo ${FUSERout} | sed -e 's/  */ /g')] - exiting THIS ${myPID}] run."
        exit 1
     fi


DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
IN="/TheDisc/DocumentArchive/Scans/new/"
OCRD="/TheDisc/DocumentArchive/Scans/ocrd"
BASE="/TheDisc/DocumentArchive"


if [ -z "$1" ]
then

 cd $IN

 for i in *.jpg; do

  if [ -e "$i" ]  
  then
   echo "Processing $i"
    tesseract "$i" "$OCRD/$i" pdf
   if [ $? -eq 0 ]
   then
    mv "$i" "$OCRD/$i"
   else
    mv "$i" "failed/"
   fi
  else
   break
  fi
 done;


  for i in *.pdf; do

                if [ -e "$i" ]
                then
                        pdfsandwich "$i" -o "$OCRD/$i"
                        if [ $? -eq 0 ]
                        then
                                rm "$i"
    chgrp users "$OCRD/$i"
    chmod g+rw "$OCRD/$i"
                        else
                                mv "$i" "failed/"
                        fi
                else
                        break
                fi

 done

fi



cat $DIR/movematrix.txt | tr -d "\r" | sed 's/\\/\//g' | while read STR 
do
 if [[ $STR ]]
 then
  
  srch=${STR%:*}
  dest=${STR#*:}

  if [[ $srch ]]
  then
   echo "Scanning for $srch";
 
   pdfgrep -i -r -H -m 1  "$srch" "$OCRD" | cut -d: -f1 | while read line
   do
       echo "Moving $line to $dest"
       mv --backup=existing --suffix=.dupe "$line" "$BASE/$dest"
   done
  fi
 fi
done

This can be broken down into three sections.  The first just makes sure that this is the only instance of the script running - it's intended to be run from a cron task, but it can take some time, and I quickly found out that if I allow another instance to fire up before the previous one finishes, then you can very quickly bring your server to its knees!  I can't honestly remember where I got this bit of code from; somewhere on the 'net!  Stackoverflow, probably!

The next section scans through the incoming scans folder and OCRs them!  If it's an image file, it uses tesseract-ocr to do the job, creating a nice new PDF at the end.  If it's already a PDF, then we use pdfsandwich, which handles all the image extraction, OCR (using tesseract-ocr) and re-compilation with the text layer.

Finally, load up a "what-goes-where" matrix file, and use pdfgrep to scan all those nice new PDFs to find known matches and move the files off to where they should go.

movematrix.txt controls all this part.  It's a simple file format :

National Savings:Bank/NS&I
TSB Bank plc:Bank/TSB
Bank of Scotland plc:Bank/Halifax BoS
Dental Department|Dentist:Medical/Dental
npower:Utilities/npower
TV LICENSING:Utilities\TV Licensing

Basically, it's <search string>:<folder to place matches>, one entry per line. Blank line are ignored.

You can use multiple search strings, as in the dentist example, separated by |, or indeed any other search parameter syntax allowed by pdfgrep.  I'd recommend actually using something like an account number or other unique reference that will allow you to identify correspondence more accurately.  But it does the searches in sequence, so if you get "false positives" for some search terms, move them to the end so that others get a chance to catch the documents first.

By default, the script does a case-insensitive search, and the consequent move does backup sequencing, so you won't lose anything if a file of that name already exists. I also swap about all / and \ so that you can paste in (relative) paths in a Microsoft  format, as shown above, and it'll cope.  Similarly, we ignore CRs and blank lines, so you can safely edit the file using a Windows editor such as Notepad, and we won't get all messed up.

Just run the script fairly regularly via cron, and it'll do all your filing for you!

And of course, if it misses a file, you can still drag-and-drop it manually!  Or edit the matrix file to add a search term.


Dependencies - ksh, simply for the first "don't run me twice" bit.  tesseract-ocr, pdfsandwich, pdfgrep.

Labels: ,

Tuesday, 30 May 2017

Trying hard



It's six weeks since my dad died.  It's been .... difficult.

Then the suicide bomber at Manchester Arena last week.  Targeting kids, and the parents waiting to pick them up.  Kids!  One casualty was only 8!!

Saffie Rose Roussos
My daughter is 10.  Had her music tastes been different, it's possible we would have been there - she reports one of the kids in her class actually was at the concert; school sent round a letter saying several pupils had been there, but all were safely accounted for.  It's only a couple of miles away, I saw some of the ambulances rushing in that night!

So that hit me hard.  I spent the next few days feeling stunned. Shell-shocked, I guess.  Fighting back tears all the time.  Just like most people in the area!  I've been glued to the TV News, glued to Twitter, had local radio on in the car and kitchen... watching the local paper websites..  It does seem that the Police have a handle on things, which is gratifying, and they are making progress tracking down all those whom have connections with the bomber.

To try and recover, I've tried throwing myself at various unrelated projects. Here are some of them -


  • Finishing the stuff I started for Retrochallenge..
  • Cataloguing and Imaging the vast number of BBC Micro floppies I have
  • Adding BBC Micro SSD/DSD/ADx image support to TC4Shell
  • Getting a VPN client working on my Sophos UTM 9 firewall
  • Planning content for the new viewdata.org.uk (but not yet writing any..)
  • and watching some TV!  (Well, netflix..) Normally I hardly ever watch TV..


Each works for a bit, then I get distracted, interrupted, or just dispirited.  So I swap to something else. Right now, I don't feel up to doing any of them..  I've tried ignoring the news this evening, as there's not been anything new anyway.. but that's not helped either.

Strangely, Mr Biffo wrote a piece this week that resonated with me. It's one of the reasons I'm writing this... I was hoping it would help.

I'm not much of a sharer.. this is new to me,  So, I'm going to leave this here. If anyone wants to jump in help with any of those projects, feel free to get in touch.  Apart from watching Telly - the Mrs is happy to do that with me!

Stay safe, people.  Love you all.


Labels: , , , , , , ,

Sunday, 23 April 2017

End of an Era


Those of you who follow this blog, or my social media accounts, will know that I don't share too much that is personal on-line.  Some people do too much of that - I don't need to check Facebook to see what you had for breakfast, and your tweets about how you met your mates for lunch are just white noise.  So, I don't do it.

There will be exceptions, of course, and this is one.

Given it is the Easter holidays at school, we decided to grab a few days away with our little one.  We don't generally go far, so just stuck with Pontins at Prestatyn.  Monday to Friday, just a little break.

Tuesday afternoon, I get the telephone call I never wanted to get.  It was my sister, Jenny: she had picked mum up from shopping in town and took her home, and they had found my dad dead at the foot of their stairs.  He had fallen while attempting to carry a folding bookshelf back upstairs.

Obviously I dashed up there immediately - they live in Wigan - and spent the evening with Mum and Jenny, and the police!  Any unexpected death needs investigating, apparently.  His body was collected eventually, and will be dealt with by the Coroner.  I had to return to Wales, but have been in constant contact with Mum and Jenny ever since,  Obviously they are both distraught.

This put a bit of a damper on the holiday, to put it mildly.  Obviously I am going to be exceptionally busy too, helping with Mum, dealing with all the funeral, estate, paperwork and so on.  I am pretty sure all my hobbies are going on a back burner for now.  So, that's it for Retrochallenge this year...


Dad was John O'Donnell.  He was a world-renowned aeromodeller, having been involved in the hobby for over 70 years, and having held many records.  He contributed to the model press regularly and frequently with articles and model plans,  He was a keen photographer, and ran a professional photography business doing commercial and wedding photography for a time.  As a mathematician he worked in the aeronautical and chemical industries, before finding a home as a lecturer (in statistics.)  He enjoyed serious Science Fiction, and was becoming recognised as intensely knowledgeable on the subject. He was intensely organised, and pretty much everything he did is recorded and filed away neatly.  We intend to publish as much as we can.

We have created a memorial site for him at jod.org.uk - please feel free to visit.

Labels: ,