Scanning with the Digital Anarchists

🖊️ 🔖 code books 💬 0

Scanning with the Digital Archivists

Noisebridge. I’ve only been there twice now and it’s already become one of my favorite places to hang out in San Francisco. Noisebridge is a unique place where the anarchistic ideals of my youth and my current love of technology exist in tandem. Not only is there amazing hacking going down but I’ve also found myself once again doing things like trash talking Crimethinc and comparing dumpster diving stories. Ah, it feels good (and smells bad!).

Depending on the kind of “hacker” you are you will either love or hate this place. Are you (A) the kind of hacker that loves technology and learning as an end in of itself? Or (B) the kind of hacker that would do questionable things in the back room of a VC’s office to secure funding for your snapchat for cats app? In this case B stands for don’t Bother.

One of the cooler projects to emerge from the chaos is the work being done on a DIY book scanner. The Digital Archivists meet every Thursday in the space and hack away at it. I got to join them this week and had a stab at getting some OCR (Optical Character Recognition) software working so that we can break some copyright law convert images of pages into actual text.

Tesseract is some amazing software that takes images as input and spits out text files. In fact the software is so simple (at least by default) and effective that converting an actual .tiff of a page to a text file is as simple as:

$tesseract page0001.tiff page0001.txt

Considering Tesseract is doing all the hard work, all I had to do was write a simple shell script to wrap it and convert entire directories of images to text.

As dorky as it may seem I find it something very satisfying about seeing a physical book go from ink on paper to bytes in memory. Pretty dorky actually. Goodnight.