Rob's Ramblings

Friday 16 March 2018

Archiving my Filing with tesseract

I hate filing.  We get lots of paperwork that needs keeping, and it's a pain in the neck.  I've got boxes of the stuff, and every time I need something it's a major task to find the item we want.

So, the plan is to digitise it all.  We have a network connected photocopier that will also act as a sheet-fed scanner, saving as PDF files directly onto a network share.  That's the first step, scan things.

But what to do next.  A pile of random image-within-a-PDFs isn't much use, not without being sorted into, at least, some sort of order.

I could just browse the folder, and drag-and-drop the files into the relevant folders, but that's a lot of work and time consuming.  I'm a great believer in "let the computer do the work", so I threw together a little script to do the job.  Here we go:


#!/bin/ksh

thisFILE="$(whence ${0})"
progName="${0##*/}"

     myPID="$$"

     FUSERout=$(fuser ${thisFILE} 2>/dev/null)
     typeset -i numProc=$(echo "${FUSERout}" | nawk '{print NF}')
     if [[ "${numProc}" -gt 1 ]]; then
        echo "${progName}: another instance(s) of [${thisFILE}] is currently still running\
             [$(echo ${FUSERout} | sed -e 's/  */ /g')] - exiting THIS ${myPID}] run."
        exit 1
     fi


DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
IN="/TheDisc/DocumentArchive/Scans/new/"
OCRD="/TheDisc/DocumentArchive/Scans/ocrd"
BASE="/TheDisc/DocumentArchive"


if [ -z "$1" ]
then

 cd $IN

 for i in *.jpg; do

  if [ -e "$i" ]  
  then
   echo "Processing $i"
    tesseract "$i" "$OCRD/$i" pdf
   if [ $? -eq 0 ]
   then
    mv "$i" "$OCRD/$i"
   else
    mv "$i" "failed/"
   fi
  else
   break
  fi
 done;


  for i in *.pdf; do

                if [ -e "$i" ]
                then
                        pdfsandwich "$i" -o "$OCRD/$i"
                        if [ $? -eq 0 ]
                        then
                                rm "$i"
    chgrp users "$OCRD/$i"
    chmod g+rw "$OCRD/$i"
                        else
                                mv "$i" "failed/"
                        fi
                else
                        break
                fi

 done

fi



cat $DIR/movematrix.txt | tr -d "\r" | sed 's/\\/\//g' | while read STR 
do
 if [[ $STR ]]
 then
  
  srch=${STR%:*}
  dest=${STR#*:}

  if [[ $srch ]]
  then
   echo "Scanning for $srch";
 
   pdfgrep -i -r -H -m 1  "$srch" "$OCRD" | cut -d: -f1 | while read line
   do
       echo "Moving $line to $dest"
       mv --backup=existing --suffix=.dupe "$line" "$BASE/$dest"
   done
  fi
 fi
done

This can be broken down into three sections.  The first just makes sure that this is the only instance of the script running - it's intended to be run from a cron task, but it can take some time, and I quickly found out that if I allow another instance to fire up before the previous one finishes, then you can very quickly bring your server to its knees!  I can't honestly remember where I got this bit of code from; somewhere on the 'net!  Stackoverflow, probably!

The next section scans through the incoming scans folder and OCRs them!  If it's an image file, it uses tesseract-ocr to do the job, creating a nice new PDF at the end.  If it's already a PDF, then we use pdfsandwich, which handles all the image extraction, OCR (using tesseract-ocr) and re-compilation with the text layer.

Finally, load up a "what-goes-where" matrix file, and use pdfgrep to scan all those nice new PDFs to find known matches and move the files off to where they should go.

movematrix.txt controls all this part.  It's a simple file format :

National Savings:Bank/NS&I
TSB Bank plc:Bank/TSB
Bank of Scotland plc:Bank/Halifax BoS
Dental Department|Dentist:Medical/Dental
npower:Utilities/npower
TV LICENSING:Utilities\TV Licensing

Basically, it's <search string>:<folder to place matches>, one entry per line. Blank line are ignored.

You can use multiple search strings, as in the dentist example, separated by |, or indeed any other search parameter syntax allowed by pdfgrep.  I'd recommend actually using something like an account number or other unique reference that will allow you to identify correspondence more accurately.  But it does the searches in sequence, so if you get "false positives" for some search terms, move them to the end so that others get a chance to catch the documents first.

By default, the script does a case-insensitive search, and the consequent move does backup sequencing, so you won't lose anything if a file of that name already exists. I also swap about all / and \ so that you can paste in (relative) paths in a Microsoft  format, as shown above, and it'll cope.  Similarly, we ignore CRs and blank lines, so you can safely edit the file using a Windows editor such as Notepad, and we won't get all messed up.

Just run the script fairly regularly via cron, and it'll do all your filing for you!

And of course, if it misses a file, you can still drag-and-drop it manually!  Or edit the matrix file to add a search term.


Dependencies - ksh, simply for the first "don't run me twice" bit.  tesseract-ocr, pdfsandwich, pdfgrep.

Labels: ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]



<< Home