Archiving my Filing with tesseract
I hate filing. We get lots of paperwork that needs keeping, and it's a pain in the neck. I've got boxes of the stuff, and every time I need something it's a major task to find the item we want.
So, the plan is to digitise it all. We have a network connected photocopier that will also act as a sheet-fed scanner, saving as PDF files directly onto a network share. That's the first step, scan things.
But what to do next. A pile of random image-within-a-PDFs isn't much use, not without being sorted into, at least, some sort of order.
I could just browse the folder, and drag-and-drop the files into the relevant folders, but that's a lot of work and time consuming. I'm a great believer in "let the computer do the work", so I threw together a little script to do the job. Here we go:
#!/bin/ksh thisFILE="$(whence ${0})" progName="${0##*/}" myPID="$$" FUSERout=$(fuser ${thisFILE} 2>/dev/null) typeset -i numProc=$(echo "${FUSERout}" | nawk '{print NF}') if [[ "${numProc}" -gt 1 ]]; then echo "${progName}: another instance(s) of [${thisFILE}] is currently still running\ [$(echo ${FUSERout} | sed -e 's/ */ /g')] - exiting THIS ${myPID}] run." exit 1 fi DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" IN="/TheDisc/DocumentArchive/Scans/new/" OCRD="/TheDisc/DocumentArchive/Scans/ocrd" BASE="/TheDisc/DocumentArchive" if [ -z "$1" ] then cd $IN for i in *.jpg; do if [ -e "$i" ] then echo "Processing $i" tesseract "$i" "$OCRD/$i" pdf if [ $? -eq 0 ] then mv "$i" "$OCRD/$i" else mv "$i" "failed/" fi else break fi done; for i in *.pdf; do if [ -e "$i" ] then pdfsandwich "$i" -o "$OCRD/$i" if [ $? -eq 0 ] then rm "$i" chgrp users "$OCRD/$i" chmod g+rw "$OCRD/$i" else mv "$i" "failed/" fi else break fi done fi cat $DIR/movematrix.txt | tr -d "\r" | sed 's/\\/\//g' | while read STR do if [[ $STR ]] then srch=${STR%:*} dest=${STR#*:} if [[ $srch ]] then echo "Scanning for $srch"; pdfgrep -i -r -H -m 1 "$srch" "$OCRD" | cut -d: -f1 | while read line do echo "Moving $line to $dest" mv --backup=existing --suffix=.dupe "$line" "$BASE/$dest" done fi fi done
This can be broken down into three sections. The first just makes sure that this is the only instance of the script running - it's intended to be run from a cron task, but it can take some time, and I quickly found out that if I allow another instance to fire up before the previous one finishes, then you can very quickly bring your server to its knees! I can't honestly remember where I got this bit of code from; somewhere on the 'net! Stackoverflow, probably!
The next section scans through the incoming scans folder and OCRs them! If it's an image file, it uses tesseract-ocr to do the job, creating a nice new PDF at the end. If it's already a PDF, then we use pdfsandwich, which handles all the image extraction, OCR (using tesseract-ocr) and re-compilation with the text layer.
Finally, load up a "what-goes-where" matrix file, and use pdfgrep to scan all those nice new PDFs to find known matches and move the files off to where they should go.
movematrix.txt controls all this part. It's a simple file format :
National Savings:Bank/NS&I TSB Bank plc:Bank/TSB Bank of Scotland plc:Bank/Halifax BoS Dental Department|Dentist:Medical/Dental npower:Utilities/npower TV LICENSING:Utilities\TV Licensing
Basically, it's <search string>:<folder to place matches>, one entry per line. Blank line are ignored.
You can use multiple search strings, as in the dentist example, separated by |, or indeed any other search parameter syntax allowed by pdfgrep. I'd recommend actually using something like an account number or other unique reference that will allow you to identify correspondence more accurately. But it does the searches in sequence, so if you get "false positives" for some search terms, move them to the end so that others get a chance to catch the documents first.
By default, the script does a case-insensitive search, and the consequent move does backup sequencing, so you won't lose anything if a file of that name already exists. I also swap about all / and \ so that you can paste in (relative) paths in a Microsoft format, as shown above, and it'll cope. Similarly, we ignore CRs and blank lines, so you can safely edit the file using a Windows editor such as Notepad, and we won't get all messed up.
Just run the script fairly regularly via cron, and it'll do all your filing for you!
And of course, if it misses a file, you can still drag-and-drop it manually! Or edit the matrix file to add a search term.
Dependencies - ksh, simply for the first "don't run me twice" bit. tesseract-ocr, pdfsandwich, pdfgrep.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home