Rob's Ramblings

Friday, 16 March 2018

Archiving my Filing with tesseract

I hate filing.  We get lots of paperwork that needs keeping, and it's a pain in the neck.  I've got boxes of the stuff, and every time I need something it's a major task to find the item we want.

So, the plan is to digitise it all.  We have a network connected photocopier that will also act as a sheet-fed scanner, saving as PDF files directly onto a network share.  That's the first step, scan things.

But what to do next.  A pile of random image-within-a-PDFs isn't much use, not without being sorted into, at least, some sort of order.

I could just browse the folder, and drag-and-drop the files into the relevant folders, but that's a lot of work and time consuming.  I'm a great believer in "let the computer do the work", so I threw together a little script to do the job.  Here we go:


thisFILE="$(whence ${0})"


     FUSERout=$(fuser ${thisFILE} 2>/dev/null)
     typeset -i numProc=$(echo "${FUSERout}" | nawk '{print NF}')
     if [[ "${numProc}" -gt 1 ]]; then
        echo "${progName}: another instance(s) of [${thisFILE}] is currently still running\
             [$(echo ${FUSERout} | sed -e 's/  */ /g')] - exiting THIS ${myPID}] run."
        exit 1

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

if [ -z "$1" ]

 cd $IN

 for i in *.jpg; do

  if [ -e "$i" ]  
   echo "Processing $i"
    tesseract "$i" "$OCRD/$i" pdf
   if [ $? -eq 0 ]
    mv "$i" "$OCRD/$i"
    mv "$i" "failed/"

  for i in *.pdf; do

                if [ -e "$i" ]
                        pdfsandwich "$i" -o "$OCRD/$i"
                        if [ $? -eq 0 ]
                                rm "$i"
    chgrp users "$OCRD/$i"
    chmod g+rw "$OCRD/$i"
                                mv "$i" "failed/"



cat $DIR/movematrix.txt | tr -d "\r" | sed 's/\\/\//g' | while read STR 
 if [[ $STR ]]

  if [[ $srch ]]
   echo "Scanning for $srch";
   pdfgrep -i -r -H -m 1  "$srch" "$OCRD" | cut -d: -f1 | while read line
       echo "Moving $line to $dest"
       mv --backup=existing --suffix=.dupe "$line" "$BASE/$dest"

This can be broken down into three sections.  The first just makes sure that this is the only instance of the script running - it's intended to be run from a cron task, but it can take some time, and I quickly found out that if I allow another instance to fire up before the previous one finishes, then you can very quickly bring your server to its knees!  I can't honestly remember where I got this bit of code from; somewhere on the 'net!  Stackoverflow, probably!

The next section scans through the incoming scans folder and OCRs them!  If it's an image file, it uses tesseract-ocr to do the job, creating a nice new PDF at the end.  If it's already a PDF, then we use pdfsandwich, which handles all the image extraction, OCR (using tesseract-ocr) and re-compilation with the text layer.

Finally, load up a "what-goes-where" matrix file, and use pdfgrep to scan all those nice new PDFs to find known matches and move the files off to where they should go.

movematrix.txt controls all this part.  It's a simple file format :

National Savings:Bank/NS&I
TSB Bank plc:Bank/TSB
Bank of Scotland plc:Bank/Halifax BoS
Dental Department|Dentist:Medical/Dental
TV LICENSING:Utilities\TV Licensing

Basically, it's <search string>:<folder to place matches>, one entry per line. Blank line are ignored.

You can use multiple search strings, as in the dentist example, separated by |, or indeed any other search parameter syntax allowed by pdfgrep.  I'd recommend actually using something like an account number or other unique reference that will allow you to identify correspondence more accurately.  But it does the searches in sequence, so if you get "false positives" for some search terms, move them to the end so that others get a chance to catch the documents first.

By default, the script does a case-insensitive search, and the consequent move does backup sequencing, so you won't lose anything if a file of that name already exists. I also swap about all / and \ so that you can paste in (relative) paths in a Microsoft  format, as shown above, and it'll cope.  Similarly, we ignore CRs and blank lines, so you can safely edit the file using a Windows editor such as Notepad, and we won't get all messed up.

Just run the script fairly regularly via cron, and it'll do all your filing for you!

And of course, if it misses a file, you can still drag-and-drop it manually!  Or edit the matrix file to add a search term.

Dependencies - ksh, simply for the first "don't run me twice" bit.  tesseract-ocr, pdfsandwich, pdfgrep.

Labels: ,

Tuesday, 30 May 2017

Trying hard

It's six weeks since my dad died.  It's been .... difficult.

Then the suicide bomber at Manchester Arena last week.  Targeting kids, and the parents waiting to pick them up.  Kids!  One casualty was only 8!!

Saffie Rose Roussos
My daughter is 10.  Had her music tastes been different, it's possible we would have been there - she reports one of the kids in her class actually was at the concert; school sent round a letter saying several pupils had been there, but all were safely accounted for.  It's only a couple of miles away, I saw some of the ambulances rushing in that night!

So that hit me hard.  I spent the next few days feeling stunned. Shell-shocked, I guess.  Fighting back tears all the time.  Just like most people in the area!  I've been glued to the TV News, glued to Twitter, had local radio on in the car and kitchen... watching the local paper websites..  It does seem that the Police have a handle on things, which is gratifying, and they are making progress tracking down all those whom have connections with the bomber.

To try and recover, I've tried throwing myself at various unrelated projects. Here are some of them -

  • Finishing the stuff I started for Retrochallenge..
  • Cataloguing and Imaging the vast number of BBC Micro floppies I have
  • Adding BBC Micro SSD/DSD/ADx image support to TC4Shell
  • Getting a VPN client working on my Sophos UTM 9 firewall
  • Planning content for the new (but not yet writing any..)
  • and watching some TV!  (Well, netflix..) Normally I hardly ever watch TV..

Each works for a bit, then I get distracted, interrupted, or just dispirited.  So I swap to something else. Right now, I don't feel up to doing any of them..  I've tried ignoring the news this evening, as there's not been anything new anyway.. but that's not helped either.

Strangely, Mr Biffo wrote a piece this week that resonated with me. It's one of the reasons I'm writing this... I was hoping it would help.

I'm not much of a sharer.. this is new to me,  So, I'm going to leave this here. If anyone wants to jump in help with any of those projects, feel free to get in touch.  Apart from watching Telly - the Mrs is happy to do that with me!

Stay safe, people.  Love you all.

Labels: , , , , , , ,

Sunday, 23 April 2017

End of an Era

Those of you who follow this blog, or my social media accounts, will know that I don't share too much that is personal on-line.  Some people do too much of that - I don't need to check Facebook to see what you had for breakfast, and your tweets about how you met your mates for lunch are just white noise.  So, I don't do it.

There will be exceptions, of course, and this is one.

Given it is the Easter holidays at school, we decided to grab a few days away with our little one.  We don't generally go far, so just stuck with Pontins at Prestatyn.  Monday to Friday, just a little break.

Tuesday afternoon, I get the telephone call I never wanted to get.  It was my sister, Jenny: she had picked mum up from shopping in town and took her home, and they had found my dad dead at the foot of their stairs.  He had fallen while attempting to carry a folding bookshelf back upstairs.

Obviously I dashed up there immediately - they live in Wigan - and spent the evening with Mum and Jenny, and the police!  Any unexpected death needs investigating, apparently.  His body was collected eventually, and will be dealt with by the Coroner.  I had to return to Wales, but have been in constant contact with Mum and Jenny ever since,  Obviously they are both distraught.

This put a bit of a damper on the holiday, to put it mildly.  Obviously I am going to be exceptionally busy too, helping with Mum, dealing with all the funeral, estate, paperwork and so on.  I am pretty sure all my hobbies are going on a back burner for now.  So, that's it for Retrochallenge this year...

Dad was John O'Donnell.  He was a world-renowned aeromodeller, having been involved in the hobby for over 70 years, and having held many records.  He contributed to the model press regularly and frequently with articles and model plans,  He was a keen photographer, and ran a professional photography business doing commercial and wedding photography for a time.  As a mathematician he worked in the aeronautical and chemical industries, before finding a home as a lecturer (in statistics.)  He enjoyed serious Science Fiction, and was becoming recognised as intensely knowledgeable on the subject. He was intensely organised, and pretty much everything he did is recorded and filed away neatly.  We intend to publish as much as we can.

We have created a memorial site for him at - please feel free to visit.

Labels: ,

Friday, 14 April 2017

Retrochallenge Day 14

A little more work today.

The teletext viewer javascript currently expects to find the entire teletext service held within the html of the web page as encoded links.  This is great for a static archive, as it places very little load on the webserver.

I had created a quick bit of php that could construct the html page when loaded, which was the source of the pages I linked to last time.  This is all well and good, but does not help with the sort of interactive services I would like to use the viewdata version for.

So today's work has been adding functionality to request pages from a server, thus allowing for ever-changing pages to be served up instead.  This involved not only modifying the JavaScript, but writing a server in php to deliver the requested pages.   This also brought out some previously unnoticed limitations in my viewdataviewer class, which I was using to parse the stored data serverside.

So, the class has had some fixes added, server written, and viewer updated.  You can play with it here, although there isn't actually anything dynamic on it at the moment.  Subpages aren't quite working right yet, but most of the rest is!

(I did break the first demo, btw, in case you were reading yesterday's post today, when I fixed a "bug" in the class file, but hadn't removed the work-around in the html generator!  It's fixed, now!)

Labels: , , ,

Thursday, 13 April 2017

Retrochallenge day 13

Blimey, where are the days going...

OK. I've taken the hacked-about teletext-editor-that-acts-as-a-viewer, and split off the hacks.  Then I modified the (latest version of) the teletext editor so that it exposes a bit more of it's internals, as the viewer needs that access...

It took a bit if trial and error, but seems to work.  Code is up on github and a demo is (temporarilly) here. I added a touch of code to allow direct linking to specific pages while I was at it!

Now to look at doing the viewdata browser that I was supposed to be doing in the first place!

Labels: , , , ,

Monday, 10 April 2017

Retrochallenge Day 2.. er... 8 .. 10

Blimey, has it been a week already?

OK.  I've not done much coding since last time, but I have been reading code and daydreaming planning out my next move.

Now Javascript is not my strongest language.  I can read it, and modify it, but actually writing new code is a bit of a challenge.  Part of the rationale behind this task was to get myself a bit more familiar with this hideously back to front language...

The teletext browser I used is that created by Adam Dawes based on Simon Rawles ( editor, and grabbed from Jason's captures at  The modifications are to add a pile of new functions, and truncate and or redirect others.

As the original editor has moved on somewhat since this was done, it seems logical that, if I want to do more mods to it, then i should base my code on the latest version.  If I can do it in such a way that I do not need to actually modify the editor, just call it, then that would be best.   In PHP I would, assuming it was a class, extend the class in a new file and override the relevant functions.  So... How to do this in Javascript...

I tried using prototypes ... but hit the problem that the editor is written with lots of private variables and functions, which the new functions in the viewer refer to.  Using the existing editor as the viewer's prototype doesn't work because it cannot access the private variables.  Drat.

<days pass>

After spending more time than I ever expected looking at javascript objects, inheritances, etc., I have decided not to commit myself to ever having to do anything major in this language!!

Sticking with Javascript, I think the best approach at this point would, after all, be to fork and modify it to separate out the actual display part from the editor part, that way I can provide for a viewer, indeed, different viewers...  Might even be a mod Simon would like...

Sigh.  Bloody Javascript.

The other option would be to go back to my own viewdata viewer class, which runs serverside to create the images.  I understand this, but I was hoping not to have to do this, as it makes updating the "screen" with the page number being keyed dependant on the server, rather than being local.

So, ten days in, and all I've achieved is discovering that what I thought would be a simple task is much more complicated than I thought it would be.

Labels: , , , , ,

Saturday, 1 April 2017

Retrochallenge: Day 1

OK. First day, and I have to do something.. .whether I can keep this up is another matter....

I had a look at the code used for browsing Jason's teletext captures.   These use a modified version of the teletext editor (the 'viewer'), driven by an html page consisting of a mass of links!  The viewer grabs all these, displays the first one, then accepts key-presses to get the next page number, as per a teletext page.  Plus it allows up/down arrow shenanigans to skip through.

Teletext, as you should know, shares the exact same display format as Viewdata, namely 24 (or 25) lines of 40 characters of primary colour text and simple block graphics.  As control codes take up a space on the line, this makes it harder than you might think to do multicoloured images..

So, I've got a pile of dumps of teletext pages over at so the obvious thing to do is use one of those, see if I can use the viewer just as it is.  That way I have a starting point, and can begin to understand the code and decide on the particular direction I want to go.

I've got a php class in working-but-incomplete state that allows me to manipulate viewdata and teletext pages.  A quick bit of code to load one of the teletext archives and then spit out each page as a link took me significantly less than an hour, and only 14 lines of new code!

So...  from this,  to this.    I think that's a positive step.

Today has, however, shown up a lot of features currently missing in vv.class that I need to add in, particularly to deal with the viewdata side of things.  That is partly what the whole point of what this was for, though: to get an idea of what I need to do next!

Labels: , , , ,