Rob's Ramblings

Wednesday 12 June 2024

IPv6 on a Sophos UTM

 This has been bugging me for a while... I accidentally clicked on a long forgotten tab in my browser that was set to test-ipv6.com which was failing, so I decided I might as well look into it again.

I am currently with BT Internet as an ISP.  Their router admits that it has an IPv6 address.  I don't really use the router, though - WiFi is OFF, and all incoming connections are forwarded to the single device, (set as DMZ host in Advanced>Firewall>Configuration,) connected to it, a box running Sophos UTM as my main firewall/router.  This is IPv6 capable, but it's been a bit confusing as to how to set it up. I could never find a simple "set these things and it should work" guide... so having managed to get it working, here's the simple guide I wish I'd been able to find in the first place..

Ok. Obviously, under Interfaces & Routing > IPv6 > General, make sure IPv6 is actually enabled. 

On the Interfaces & Routing > Interfaces tab, for your upstream connection, make sure you've got Dynamic IPv6 and IPv6 Default Gateway on:

Now, for the interface for your LAN, make sure Dynamic IPv6 is OFF, and set yourself a fixed IPv6 address.  Unless you need to make all the machines on your LAN visible to the world, you should pick a private range for this.  This post details how you can pick this, but TL;DR, use fdxx:xxxx:xxxx:yyyy:zzzz:zzzz:zzzz:zzzz where xxx... is ten random hexadecimal digits, yyyy is a network number, and will usually be 0001. The zzz... is the number that identifies the individual device on the network.  I used 0000:0000:0000:0001 for the UTM, which compresses down to ::1


Under IPv6 > Prefix Advertisement, create a record for the LAN interface, and enable Stateless Integrated Server. The DNS Entry here was picked up from that advertised by the BT router, and will be their server, but you can put any valid addresses in here.


Under IPv6 > Interfaces > Multipath Rules, create a new entry to route everything out via the uplink interfaces.  This seems to be the equivalent of setting a default gateway. Until I did this, I could ping the UTM, but not get any further:


You can obviously customise this as you see fit!  But after doing this, it all seems to work!!





Labels:

Wednesday 16 November 2022

Adventures in the Fediverse

As many of you reading this close to the date of publication, Twitter is in meltdown.  Elon Musk has bought it, taking it into private ownership, and has been implementing his own brand of reform.  Much has been written elsewhere, so I'll not go into details, but the trashing of the verified status system, sacking of , or resignations by, almost all senior staff, trust & privacy, moderators and developers, this has all caused chaos.  As such, anybody with any sense has been looking for an alternative to Twitter.

This is where the fediverse, principally championed by Mastodon, has stepped up.  A distributed, federated, collection of servers, all of which can talk to each other, not operated by any one individual or company, but each instance by their own users.  Anybody can set up a server and join in.  So I did.  This is my tale..

I won't detail all the problems I had doing this.  I'm merely going to indicate what worked. This is why you're here, after all!

I'm installing on a virtual server, that actually runs on a physical machine at home. This is behind a normal home broadband connection, with a Sophos UTM firewall between then.  The ISP supplied router has the UTM as it's DMZ, so all incoming connections are passed directly to that.  The UTM has reverse DNAT set up to forward ports 80 and 443 to a virtual machine which ruin Nginx Proxy Manager in a docker instance.

The domain name is handled by namecheap.com, and in their DNS, I set hostname "@" as "A+Dynamic DNS Record".  The UTM supports their flavour of Dynamic DNS directly, so it is set to update this should my visible external IP change.

Mastodon was installed on a new VM freshly installed using Ubuntu 20.04 and updated with apt update && apt upgrade.  (I tried first using the latest version, 22.04, but failed multiple times. YMMV but the docs say 20.04, so I went back to that and succeeded.) I followed this installation guide on the official github repository.  This almost worked; almost at the end I got a complaint that Ruby 3.0.4 was required - the guide tells you to install 3.0.3, so I went back and reinstalled ruby, and then continued.  

Do follow the directions to install nginx on your mastodon server, as its used to handle static content as well as passing things to and from the mastodon code., however DO NOT INSTALL CERTBOT/LetsEncrypt.

THIS is the /etc/nginx/sites-enabled/mastodon  file I ended up using:

map $http_upgrade $connection_upgrade {
  default upgrade;
  ''      close;
}

server {
  listen 80;
  listen [::]:80;
  server_name irrelevant.me.uk  192.168.200.109  "";

  keepalive_timeout    70;
  sendfile             on;
  client_max_body_size 0;

  root /home/mastodon/live/public;

  gzip on;
  gzip_disable "msie6";
  gzip_vary on;
  gzip_proxied any;
  gzip_comp_level 6;
  gzip_buffers 16 8k;
  gzip_http_version 1.1;
  gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;

  add_header Strict-Transport-Security "max-age=31536000";

  location / {
    try_files $uri @proxy;
  }

  location ~ ^/(emoji|packs|system/accounts/avatars|system/media_attachments/files) {
    add_header Cache-Control "public, max-age=31536000, immutable";
    try_files $uri @proxy;
  }
 location @proxy {
    proxy_set_header Host irrelevant.me.uk;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header Proxy "";
    proxy_pass_header Server;

    proxy_pass http://localhost:3000;
    proxy_buffering off;
    proxy_redirect off;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;

    tcp_nodelay on;
  }

  location /api/v1/streaming {
    proxy_set_header Host irrelevant.me.uk;
proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto https; proxy_set_header Proxy ""; proxy_pass http://localhost:4000; proxy_buffering off; proxy_redirect off; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $connection_upgrade; tcp_nodelay on; } error_page 500 501 502 503 504 /500.html; }

In server name, I set the official name of the instance, the IP address, and, because all else failed, "" to make it accept anything.

The rest based on code in a github post I've now lost the link to, sorry! Note that there is no SSL in here any more.

Note also that I has to hard-code the hostname in the proxy set header Host directives.  This means that whatever address you use in the URL to connect to the server, it passes the correct hostname through to mastodon.  This was the critical bit that stopped me getting 403 errors on any access - it seems that mastodon is locked tight to the domain name you set when you install it.

Check you can connect locally via http://ip_of_your_server/ and get the front page!


In nginx proxy manager, I set nothing unusual.  Set the domain name you want to serve, point it at http://ip_of_your_server:80,  set Force SSL and allow NPM to deal with letsencrypt to get your SSL set up. 

Your instance should be accessible at the domain name you have set up!

There are some things I want to tidy up.  I've not tested the streaming part yet - I'm seeing 502 errors in the logs, but I'm not sure why yet, or what it is that's using it. Also, every line in the log seems to be coming from the ip of NPM, so there's more to set up here to identify the original IP.  However, this is the first time I've used nginx, so it's all a bit of a learning curve. For now, I'm just glad I can actually read and post things!

Monday 17 September 2018

Some contemplating on frame storage formats and clashes therein

I recently posted a little bit on how I now store contributed videotex (teletext and viewdata) frames within a database, so as to make accessing them far easier on the application side.

To do this, I had to decide on exactly how to store the visible content of the frame.  Everything else is easy; I crated a secondary table holding key=>value pairs, which means it is very easily expandable, and any application needing particular data can go look for it's own, and not be confused by anything extra.

So. The frame content itself.  I didn't get much help looking at existing storage formats, as I've got at least 17 types documented, and others I know about.  I may however have been influenced somewhat by them.

When you think about a viewdata frame, or a teletext page, you automatically see the 23-25 lines by 40 columns of static image.  Almost every frame you will find that has been saved out by a terminal emulator, or teletext captures, will consist of those 920, 960 or 1000 bytes of data, perhaps with some meta-data accompanying, sometimes not.  I think that every third-party viewdata host that I have so far encountered also stored its pages so.  Individual characters took up a single bytes as per their ASCII character code, and colour and control characters were also stored as a single byte.  For teletext, this uses the non-display codes below the space, as there is no concept of cursor movements, carriage returns, etc, on a teletext screen, which is what these values are used for in a serial-terminal based service.

Prestel, and viewdata generally, is however serial.  Frames are sent to the user as ASCII characters, but the colour and control codes are sent as command sequences:  Escape then a capital letter.  So, what might be stored in a teletext page as "<01>RED<02>GREEN<07>WHITE" would be sent to a viewdata terminal as "<ESC>ARED<ESC>BGREEN<ESC>GWHITE".  Short lines would be terminated by a carriage return and linefeed, so reducing the need to send the whole 40 characters.

Now.. Prestel itself is known to have stored the frame data exactly as it would be sent to the user.  There was a hard limit of 920 bytes available to the editor to use, and colour codes, etc, took up two of them.  This made creating complicated graphical pages somewhat difficult, as too many colour changes could quickly eat up all the allocation.  (Response frames were even worse; you only got 716 bytes to play with!)  This is probably why all third party viewdata servers stored their page as the 22x40 character full image, with the control codes stored as per teletext.  Doing this allowed for much more colour and graphic rich content than was possible on Prestel itself - the conversion was done on transmission.  The actual codes stored varied - some systems used 7 bit data throughout, some used top-bit sett letters to indicate that letter needed the escape sending before it, some used 7 bits for visible characters, and top-bit set control codes (codes in the range 128-159) and at least one had everything with the top bit set!

So fast forward 30 years, and I'm writing code to handle saved viewdata pages and display them on this new-fangled World Wide Web thing.  There is zero support for viewdata and teletext format images, so we have to roll our own, converting saved pages in any number of formats into PNG or GIF (to account for flashing characters) images that a web browser can display.

As an intermediate stage, I have to pull that 22-24x40 matrix of characters out, before plotting them onto a graphics image for sending to the viewer.  This intermediate block of characters I called an "internal" format, and was 7-bit clean, so codes below space for the colour codes, and the rest visible.

For nearly ten years this worked fine, and this internal, intermediate format, was the format used when I created the page database.

It is only this week I hit a problem with this, and it is down to a peculiarity with how Prestel stores Response Frames.  (And, I assume, other frames that are not simple static pages.)

A response frame contains a number of fields that are defined by the editor when they create it, and are either filled automatically by the Prestel server when it displays the page, or  can contain text or data to be entered by the user.  When the user hits # on the last field, they are given the option to send (or not) the page to the IP.  It is then delivered to their mailbox in a filled-in state.

When defining a response frame in the standard Prestel online editor, a field is specified by typing, e,g. Crtl-L n 30 Ctrl-L will create a field of 30 characters length containing the subscribers' name - on pressing the second Ctrl-L the system will display 30 "n"s in the required position.  The same procedure is repeated for any other field you request.  What gets stored in the Prestel database is a single Ctrl-L and 30 "n"s.

When you retrieve a page from Prestel using the "Bulk" Online editor, it is sent exactly as stored, so you get the Ctrl-L and sequence of letters alongside the Escape'd colour codes and CR/LFs for short lines.  Uploading a replacement frame you specify the layout in the same manner.

Those of you familiar with the standard ASCII control codes will recognise that Ctrl-L is also known as "Clear Screen", and is a character that is usually sent before sending the frame content.  This is probably why it was used for this purpose - finding it in the middle of the frame content would not make sense, so it was re-purposed as a flag for start-of-field.  Obviously this is never actually sent to the user, but is replaced by a space when viewing on a terminal.

Now ...

I have two small databases in my posession that were pulled back down from Prestel at some point, and these include a number of Response Frames.

When I converted the data to my "internal" format to load them into the database, this normalised the control codes to 7-bit data, filling that lower 32 bytes of the table.  On displaying, these codes were sent as <Esc><code + 64>, this recreating the colour sequences.

When it comes to a <ctrl l>, however, this was never stored in the database - the normalisation routine ignored it.  However, even if it had been saved, on recall, it would have been translated into an <Esc>L, the sequence to end double-height text.

So, to summarise, the normalisation I did, in most cases, lost the start-of-field character because it wasn't expected in a frame.  And if it did make it though, it would be indistinguishable from the "Single Height" code, and as that was allowed anywhere in a response frame, it couldn't be deduced from context.

I never noticed, because there were so few frames affected, and there was no need to process the fields the code indicated, anyway!

This last month, however, I've been working on a viewdata host program that will run on a modern server, and which I could use to receate the look and feel of using the original Prestel service.  I've been testing this using an actual Prestel terminal, and it's been great fun!  It's only when I stumbled across one of these response frames, and decided to support them, that I discovered this problem!

Looking into how other file formats solved this, it seems that at least one of them uses <Esc> itself as the field indicator.  If  stored in the database like that, when expanded on recall this would translate into an unused code sequence, in viewdata, so is a suitable alternative.  I will translate the affected pages, eventually!


So, a decision taken about 10 years ago came back to bite me this week. And it's all to do with 25 year old data in a file format determined 40 years ago that everyone else decided needed to be done differently.

Well done for making it this far!


As an aside .. Prestel added support for "Dynamic frames" which were basically frames that could contain cursor movement characters.  This meant you could go back and change things after you had already drawn them.  This was easy for them, as they stored data in an as-transmitted form anyway.  It's no so easy for host software that expects it's frames to be stored in a fixed matrix!  I'll be working on this, one I find some original examples....


Labels: , , , , ,

Friday 14 September 2018

The Videotex Database - submit your pages now!

When I started viewdata.org.uk (and teletext.org.uk), I just uploaded the pages and databases I had as-is, and had my scripts deal with them on an as-accessed basis.  This is because I wanted to preserve the data as much as possible - any translation to a new format (such as JPEGs) would inevitably lose data, as well as context.

As time has moved on, and as the variety of data formats I have had to deal with has proliferated, this has increasingly become somewhat unwieldy. I decided, therefore, to try and rationalise things somewhat.

Each of the various file formats I was dealing with had different properties. Each had strengths, and each had weaknesses.  I could not decide on a single common format to try and convert files into.

Rather than create a new "perfect" file format, I decided therefore to store the frames within a database.  By having a primary table for the page content and certain static data, and a separate table for meta-data, any particular properties a particular file format had could be accommodated.

Once the data is held within a standardised database, of course, it makes it much easier to access it and use it from many different applications.  The first, and most obvious, is the ability to search across the entire database for key words or phrases. This is implemented on the front page of the database.

The main in-browser viewer for the saved pages implements a timeline function, where you can see how a given page has changed over time.  See, for example, the CEEFAX news headlines.

And of course, for viewdata pages, once can implement a dial-up host, so 1980s terminals can connect directly into the service and browse it exactly as they did at the time.  (This is mostly done, just pending further tidying up!)

Currently the database contains page data I have collected myself or already been sent. However I am aware that there is a vast amount more out there.  Jason Robertson has been amazing at rescuing teletext pages off old video tapes, and I know of at least one previous Prestel IP that has a massive archive of pages still extant, albeit sat on very old hardware.  I've got part of The Gnome At Home, and I know the rest still exists.

This week's task (one of the various "I'll do something" for Retrochallenge 2018/09) was to create a page for viewers to directly submit their pages to the database.  This is now complete!  It actually places the data into a queue, after briefly validating it, so it can be checked and added later.  I would welcome any contributions, anything from a single frame to a complete service backup!  If you need help, feel free to drop me a line.


Labels: , , ,

Monday 3 September 2018

A Viewdata Host

One of my aims when setting up viewdata.org.uk was to create a means by which readers could experience connecting to a viewdata service, and also to use such to present what saved pages we had in an appropriate context.

Sadly, there was nothing available that I could find that would allow me to run an actual host, and although I had some success firing up my old BBC Micro based viewdata BBS, this didn't last long due to multiple hardware failures.

Back up to today, and, as I mentioned yesterday, John Newcombe has written, and is running, his own viewdata host called TELSTAR.   I've discussed some things with John, and had been hoping to blag a copy of the software, but it seems that it's not quite what I am looking for.

Now, I have been building up a database of frames - this is yet another unfinished project - over at db.viewdata.org.uk.   This database is what I want to use as the source of the data for a host system.
Although it's mostly got teletext loaded up, I do have a complete copy of the PC Plus demo of Micronet loaded up, which can act as a starting point.

So, what to do?  Well it's obvious, write my own host software.  I've been putting this off for years, but, it's #retrochallenge time, and I do want to achieve something...

A few hours last night got the bare bones sorted out, and a bit of time debugging, and we're at a point where I can dial in and navigate between pages!  Woo!

Whereas John has been concentrating on content for his viewdata host, I'm going to be working on making mine more of a "Prestel Emulator"; it should feel as close to the original as possible.  I've a lot to do, obviously, but not bad for an few hours work.




Labels: , , , ,

Saturday 1 September 2018

Modem Emulation - an RC2018/09 prologue

Most of you will know by now that I'm really into preserving the memory of Prestel and Viewdata systems generally.  I run www.viewdata.org.uk which, while a bit long in the tooth, is going to get a massive update "soon" ...   But today I'm going to talk about hardware.

Some time back, I fired up my old viewdata BBS "Ringworld" - this operated on a collection of BBC Micros - one per connected user - and an Acorn A5000 acting as fileserver.  I connected these to the internet using a motly selection of modems, ATA telephony adapters, and serial terminal adapters.

The long shot was, for a user dialling in, the call was answered either by the exact same modem it always had been, connected to a SIP ATA - the digital data was transformed to analogue, before being turned back to digital by the modem.  This always seemed like a poor idea to me. What would be better is if some bit of software answered that digitised telephone call, looked at the whistles and warbles, and turned it directly into a sequence of ASCII bytes for delivery to a telnet port.

I had found an program called iaxmodem that allowed an asterisk based PBX to emulate a modem, but it was focused on faxing, and I just couldn't get it to work with the V23 dial-up I wanted.  But it was close.   I spent the next few years, off and on, searching for changes to that, or SIP based alternatives, with no luck.

In the meantime, John Newcombe decided to write his own viewdata host service, called Telstar, in Python, and that can be accessed via a raw-socket. (like telnet, but without the features!)  There's not a lot of software out there that can talk both Viewdata display protocols and connect to a socket, however.  Richard Russell wrote a example viewdata client that could do it, and you can connect from BeebEm if you load up a suitable comms package and set the RS423 IP parameters. 

There have also been a couple of projects to produce a "WiFi Modem" that, basically, looks like a hayes-compatible modem that you connect to via RS232, but it in turn connects to your WiFi, and onwards to a telnet port out on the internet.  This is great for things like BBC Micros, Commodore 64s, etc., where you can just swap out your period modem for this new device.   Not so good for dedicated terminals, or e.g. the ZX Spectrum VTX5000 where the modems are built in.

Then, out of the blue, an old friend, Darren Storer, posted on the BBC Micro facebook group (I think it was there..) that he'd set up a dial-up number for Telstar, and could people test it.  It took me a week or two to get there, but I pulled out a terminal, dialled the number ... and it didn't work.  Not at all.  I did, however find out the software he was using ...  asterisk-Softmodem.  This was exactly the sort of project I'd been looking for all those years.  But, it didn't work for him/

I pulled the code and had a look, and could see nothing wrong.  So, firing up an asterisk server, and installing it, I tried to debug.  The first issue was my terminal was not locking onto the carrier, so I added a t(-10) to increase the volume, and that sorted that!

Next problem, I wasn't getting much data on screen - many characters were just missing!   This was somewhat easy to diagnose, as I had an inkling after seeing how you configured asterisk to use softmodem - you specified the number of data bits, being between 5 and 8.  The example had it as 8. Now Prestel, and of course the terminals, all used 7 bits with even parity. What I was seeing was the terminal being sent 8-bit data, and of course interpreting that as most of the characters having an incorrect parity bit, and ignoring those!

Now, I can set a software terminal to 8bit data, but not the termnal - there  is very little you can configure as a user on these things.  Because the project had no support for parity it looked like a dead end, but that wasn't going to stop me - I'd waited years to find this, and wasn't going to give up now!

Delving into the code, it actually turned out to be a nice simple and straightforward bit of programming.  Adding parity support turned out to be fairly easy... I've published the modifications to my own github fork and submitted a pull request to send them back to the original author.

So now, I can dial into Telstar, CCL4, or anywhere I want to set up a number for!

If you want to try it, the number for Telstar is 0333 340 3311 (from outside UK, +44 333 340 3311). Calls cost the same as an 01 or 02 and are included in any inclusive minutes you may have. Call s are free for A&A customers.)

I can't guarantee that number will stay up, and it may not work from time to time if I'm tweaking things, but if it turns out useful to you, please let me know in the comments below!


Labels: , , , , , ,

Friday 16 March 2018

Archiving my Filing with tesseract

I hate filing.  We get lots of paperwork that needs keeping, and it's a pain in the neck.  I've got boxes of the stuff, and every time I need something it's a major task to find the item we want.

So, the plan is to digitise it all.  We have a network connected photocopier that will also act as a sheet-fed scanner, saving as PDF files directly onto a network share.  That's the first step, scan things.

But what to do next.  A pile of random image-within-a-PDFs isn't much use, not without being sorted into, at least, some sort of order.

I could just browse the folder, and drag-and-drop the files into the relevant folders, but that's a lot of work and time consuming.  I'm a great believer in "let the computer do the work", so I threw together a little script to do the job.  Here we go:


#!/bin/ksh

thisFILE="$(whence ${0})"
progName="${0##*/}"

     myPID="$$"

     FUSERout=$(fuser ${thisFILE} 2>/dev/null)
     typeset -i numProc=$(echo "${FUSERout}" | nawk '{print NF}')
     if [[ "${numProc}" -gt 1 ]]; then
        echo "${progName}: another instance(s) of [${thisFILE}] is currently still running\
             [$(echo ${FUSERout} | sed -e 's/  */ /g')] - exiting THIS ${myPID}] run."
        exit 1
     fi


DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
IN="/TheDisc/DocumentArchive/Scans/new/"
OCRD="/TheDisc/DocumentArchive/Scans/ocrd"
BASE="/TheDisc/DocumentArchive"


if [ -z "$1" ]
then

 cd $IN

 for i in *.jpg; do

  if [ -e "$i" ]  
  then
   echo "Processing $i"
    tesseract "$i" "$OCRD/$i" pdf
   if [ $? -eq 0 ]
   then
    mv "$i" "$OCRD/$i"
   else
    mv "$i" "failed/"
   fi
  else
   break
  fi
 done;


  for i in *.pdf; do

                if [ -e "$i" ]
                then
                        pdfsandwich "$i" -o "$OCRD/$i"
                        if [ $? -eq 0 ]
                        then
                                rm "$i"
    chgrp users "$OCRD/$i"
    chmod g+rw "$OCRD/$i"
                        else
                                mv "$i" "failed/"
                        fi
                else
                        break
                fi

 done

fi



cat $DIR/movematrix.txt | tr -d "\r" | sed 's/\\/\//g' | while read STR 
do
 if [[ $STR ]]
 then
  
  srch=${STR%:*}
  dest=${STR#*:}

  if [[ $srch ]]
  then
   echo "Scanning for $srch";
 
   pdfgrep -i -r -H -m 1  "$srch" "$OCRD" | cut -d: -f1 | while read line
   do
       echo "Moving $line to $dest"
       mv --backup=existing --suffix=.dupe "$line" "$BASE/$dest"
   done
  fi
 fi
done

This can be broken down into three sections.  The first just makes sure that this is the only instance of the script running - it's intended to be run from a cron task, but it can take some time, and I quickly found out that if I allow another instance to fire up before the previous one finishes, then you can very quickly bring your server to its knees!  I can't honestly remember where I got this bit of code from; somewhere on the 'net!  Stackoverflow, probably!

The next section scans through the incoming scans folder and OCRs them!  If it's an image file, it uses tesseract-ocr to do the job, creating a nice new PDF at the end.  If it's already a PDF, then we use pdfsandwich, which handles all the image extraction, OCR (using tesseract-ocr) and re-compilation with the text layer.

Finally, load up a "what-goes-where" matrix file, and use pdfgrep to scan all those nice new PDFs to find known matches and move the files off to where they should go.

movematrix.txt controls all this part.  It's a simple file format :

National Savings:Bank/NS&I
TSB Bank plc:Bank/TSB
Bank of Scotland plc:Bank/Halifax BoS
Dental Department|Dentist:Medical/Dental
npower:Utilities/npower
TV LICENSING:Utilities\TV Licensing

Basically, it's <search string>:<folder to place matches>, one entry per line. Blank line are ignored.

You can use multiple search strings, as in the dentist example, separated by |, or indeed any other search parameter syntax allowed by pdfgrep.  I'd recommend actually using something like an account number or other unique reference that will allow you to identify correspondence more accurately.  But it does the searches in sequence, so if you get "false positives" for some search terms, move them to the end so that others get a chance to catch the documents first.

By default, the script does a case-insensitive search, and the consequent move does backup sequencing, so you won't lose anything if a file of that name already exists. I also swap about all / and \ so that you can paste in (relative) paths in a Microsoft  format, as shown above, and it'll cope.  Similarly, we ignore CRs and blank lines, so you can safely edit the file using a Windows editor such as Notepad, and we won't get all messed up.

Just run the script fairly regularly via cron, and it'll do all your filing for you!

And of course, if it misses a file, you can still drag-and-drop it manually!  Or edit the matrix file to add a search term.


Dependencies - ksh, simply for the first "don't run me twice" bit.  tesseract-ocr, pdfsandwich, pdfgrep.

Labels: ,