Saturday, May 15, 2010

The Process

I'm somewhat unique in what I do... while sites like Hyperwar and the Historic Naval Ships Association post documents from archives. I spend more time on formatting with HTML; the downside is slower "production" but I believe it makes some of the documents easier to read and "get" where one page transitions to another. All-in-all, I think we have a good ecosystem of postings with good variety and flavor. In fact, I've made arrangements that if I get hit by a bus or something similar, the documents I've posted will wind up on Hyperwar in some form, so there should be good coverage over the years.

I thought I'd write a bit about how I post a document. I could simply post jpgs or convert to PDF, but the disadvantage is that the text isn't indexable and for people with slower connections, it can take an awful long time to load 10 pages of large jpgs.

So I still hand-write my pages with the thought of trimming as much as possible out of the code to keep it fast-loading while at the same time preserving the formatting of the original. I've been aided in the last five years by the progression of OCR software, and the picture below shows a little bit of my workflow:



Left screen has the OCR program and the right screen is Homesite, my web page software of choice for a scarily long time now. I started with notepad back in 1995, and at some point soon after switched to a freeware program called DerekWare, which had buttons for dropping in pieces of code, and was a little more friendly for web design. It had one limitation I discovered after a bit, in that it couldn't handle pages larger than 21k, which was fine for a bit, but today some of my documents would blow it out of the water (the USS San Francisco Guadalcanal Damage Report is just about 100k in text and code alone, and the 1941 US Navy Fleet listing is over 125k). I purchased Homesite before the millennium and upgraded to version 5 maybe a year or two after... it's worked fine in Vista and Windows 7 so I'll probably be using it for another five years at least.

The current OCR program is Abbyy Fine Reader, which came bundled with my microtek scanner. It's a definite improvement over the OmniPage software I started with, but it's also newer... albeit itself it's at least two years old now. It does fairly well as long as the scans weren't of onionskin copies that were a couple of generations away from the original... in that cases it's faster to re-type things from scratch, and for the most part I tend to avoid those projects now. There are a couple of quirks; it consistently reads "ltr." as "Itr," and sometimes "planes" as "pianos," which can lead to some fun mental images when you read about pianos strafing or dropping depth charges. Because some of these are subtle differences in characters, I always do two passes of proofreading; one in Homesite's code view for the obvious stuff, and then another in the actual web browser to find things that are subtle, such as Os when I need 0s. Below is an example of what I consider an OK OCR pass:

As was suggested in reference (c}^ th® Bureau has mad© an evaluation of the gasoline capacity and steaming radius of the several carriers in service. These data are furnished below and are bfised on information available to the Bureau o« fuel oil and gasoline consumption for the second and third quarters of the fiscal year 1942,


Each paragraph takes maybe a minute or two to proofread and format. Document headers and things with more formatting-per-text take longer, of course.

I test the documents in IE8, Firefox, SeaMonkey (a firefox derivative that's my browser and e-mail client of choice) and recently Chrome. There are some differences in how they handle spacing so I don't sweat a space off here and there between the browsers, but what you see is probably 90% of the formatting on the original.

No comments: