Comments on Marian's Roots and Rambles: Not All PDFs are Alike

Over the winter, I created a PDF from the scanned ...

2012-03-27T11:25:23.047-04:00

Over the winter, I created a PDF from the scanned images of a 1940-era typewritten document of a cemetery survey. The document that I scanned was not the original - it was probably a copy of a copy of a copy.

After scanning, I spent many hours cleaning up the images via Adobe's Photoshop Elements (removing lots of stray spots and removing the black edges on the pages).

From there, I assembled the images into a PDF, and added bookmarks. The OCR produced a somewhat searchable PDF (but not fully searchable), as typewriters inherently produce imperfect type - remember backspacing and re-typing, creating blobs of letters? Or when some letters weren't aligned horizontally with the rest of the line - they floated above the line? OCR can only interpret what can be clearly seen.

I then printed the cleaned up images and ran them through a different OCR program, which produced better results, but it still wasn't fully searchable.

To make it fully searchable, I bit the bullet and created a Word document based on the OCR results. This required many hours of proofing every character against the original, as well as fighting with formatting issues in Word.

During proofing, I discovered two dozen discrepancies in the original. For example, some names had a different middle initial in the index, so which spelling was correct - the name in the document or the one in the index? And where did the error get introduced - when the info was written down or when it was typed up?

And what was the goal? Was it to accurately reproduce the original document and leave it at that, or go beyond that and fix the mistakes that were made in 1940 (and hopefully not introduce errors of my own)?

In hindsight, it would've been quicker to hire an accurate typist (thereby skipping OCR) and then proof that version.

Better yet, it would've been even quicker to have simply left it as a somewhat searchable PDF.

Having done the Epsom NH Cemeteries, and also usin...

2012-03-25T22:36:01.615-04:00

Having done the Epsom NH Cemeteries, and also using the typewritten 1940's version as a guide (no OCR, just old type written manuscript)I found the same type of errors in reading the stones and missing stones. Since there was no OCR of the document, the errors were in the original readings.

I am blind, so I have quite a bit of experience wi...

2012-03-16T08:22:14.711-04:00

I am blind, so I have quite a bit of experience with OCR. It is the only way I have access to printed materials. It is definitely much more accurate that it was even a couple of years ago, but it is definitely not 100% accurate. This is especially true when it comes to numbers. It is essential that someone proof the OCRed document. That said, I truly appreciate all those who have OCRed the images, and especially those who have proofed them. Otherwise, I have to get someone with eyes that work for me to get anything from a PDF, as screen readers are unable to read images.

Marian -- Cemetery transcriptions are, in an of th...

2012-03-15T21:40:45.083-04:00

Marian -- Cemetery transcriptions are, in an of themselves, a troubled breed. Your original transcription likely contained errors and the recreation to an online form likely contained even more errors. Even if OCR was never used in the process.

Many online transcriptions were created by someone studying bad photographs. Others were created in the cemetery but were scribbled by hand and errors made again when the pages were typed. Add into these difficulties, Microsoft Excel's tendency to convert "years only" entries from the 1800's into some arbitrary 1904 date (including a month and day!)... and based on some whim known only to Microsoft...and we are all in big trouble.

We recorded one cemetery directly into the computer while at the cemetery. We made a digital voice recording as we read the stones aloud and listened to the recording as we proofed the file. We then compared our new transcription to one done in 1982 and returned to the cemetery to see the original stone if we found discrepancies. Now, as I am matching the tombstone photos to the transcription, I'm still finding some errors and some stones totally missed! (The photographer missed some too...but I'm hoping we didn't miss the same stone!)

Our current process is to record the transcription at the cemetery and then to type the transcription at home. A second person compares the new transcription to an earlier transcribed version and double checks any differences against the tombstone photos to confirm which transcription is correct. It is ugly and it's time consuming but we owe it to the Genealogy Community to give them the best information we can. We are all only human and we will make mistakes in spite of our best efforts. Posting a photo of every stone will go a long way toward ensuring our researchers an accurate record.

I applaud your efforts to photograph your historic cemetery!

Marian, your post today resonated with me for thre...

2012-03-15T20:43:14.328-04:00

Marian, your post today resonated with me for three reasons:

-I've been the recipient of data from such transcribed cemetery listings and always had that can't-put-your-finger-on-it queasy feeling it might not be reliable

-I've been to some old, old cemeteries where the transcription says that yep, this is the stone, while my eyes tell me something entirely different about that nearly illegible marker

-Both I and my daughter have recently been wanting to upload cemetery pictures to online resources such as Find A Grave (she and a friend are considering doing some traveling to old country graveyards to capture info on out-of-the-way records). Reading what you wrote makes me wonder if it wouldn't hurt, even in those places already having existent documentation, to have a double-check and see what might turn up needing correction.

Yes, genealogists are particularly picky. They want to get it right. And document it while they're at it. So why not go back and check out those discrepancies? And while we're at it, let everyone else know the correct entries. Whether the source of the discrepancy was OCR incompetencies or "operator error," I am not too concerned with--just as long as the end result of the recorded information is correct.

I heard it said long ago when workplaces still had "typists" that everyone needed to have a second set of eyes review their work. Typists or genealogists, we all could use a second proof-reader from time to time.

Personally, I've found that documents processe...

2012-03-15T13:52:05.966-04:00

Personally, I've found that documents processed through OCR software and NOT reviewed may be useful for building indexes but that's about all. In my experience the resulting document of any OCR output must be reviewed for errors. Depending on the "training" of the software and the fonts in the documents you may be lucky to not have any errors but do you want to take that risk?

I wonder if it is the 'non-Adobe' pdf soft...

2012-03-15T13:34:51.506-04:00

I wonder if it is the 'non-Adobe' pdf software that may be the problem? I would think you would get the best pdf from Adobe Acrobat.
Linda

When I transcribed old Town Meeting records for Me...

2012-03-15T11:05:24.470-04:00

When I transcribed old Town Meeting records for Methuen, I did it all in MSWord - when the person in charge put them online, she dedided to convert to pdf format. I couldn't believe it when I saw my work online. Nothing was aligned anymore etc., etc. That has never been my experience when converting to pdf format so my take on something like this is that the person converting any original files needs to 1. really know what they are doing but 2. verify that nothing has been changed in the conversion.

You are not alone...I am also working on an invent...

2012-03-15T10:19:06.862-04:00

You are not alone...I am also working on an inventory of a local historic cemetery in conjunction with a genealogical sketch of the town from that period.

On the flip side, I am in possession of a very detailed (and, so far, accurate) transcription project done pre-computers, typed on index cards and cataloged on onion skin! My dilemma is how to digitize the collection without the OCR errors. One consideration is to retype the whole kaboodle on a spreadsheet, risking my own transcription errors in the process. The other, which I have an inkling is the better choice, is to copy and scan the collection along with digital photos of the stones. (And I will upload to BillionGraves.com as I go.)

Please let us know how you progress. :)

Fortunately, there is now technology to OCR an ima...

2012-03-15T10:18:45.785-04:00

Fortunately, there is now technology to OCR an image of printed or typed text, so you can have it both ways, sort of. If the PDF is an image, you can make it searchable using a service like Evernote. At least then, you can check the accuracy of the original PDF.