Thursday, March 15, 2012

Not All PDFs are Alike

I've been working on photographing my local historic cemetery for the past seven years. I did it fairly casually for the first five years. I'd wander through the cemetery and randomly take photos. Eventually I realized this was not the most effective way to ensure that I got a photo of each stone.

My local library's website has a transcription of the old section of the cemetery that was created in 1900 by a local town historian. I used this as the basis for my cataloging. Last week I started cataloging in earnest, making sure I photographed all the stones I had missed. As I photographed I took the time to compare the information on the stone to the epitaph in the transcription.

Imagine my surprise to find errors in the transcription! There were cases where the dates in the transcription had the day wrong, the year, the month or all three. A few gravestones weren't cataloged at all. And some of the transcription mixed up people with the same name.

I got to thinking about this dilemma. As a genealogist and historical researcher I want to be able to pinpoint the problem.  The transcription on the website was typed and not an image of the original transcription.  So I had to ask myself two questions 1) Were the errors in the original transcription or 2) Were the errors introduced when converted over to an html file?

I can imagine two scenarios. Either someone manually typed up the list and posted it online or some kind of OCR software was used and then copied and pasted into the html file.  In either manner errors could be introduced.

Last night I was at a meeting where there were some folks from the local historical society. I asked them if they knew the background behind the online file.  They didn't but they did have a pdf of the original 1900 transcription that they could share with me. Then the conversation got a bit sticky.

I have to stop here and say that genealogists have a reputation, for better or for worse, of being very precise. Honestly, it drives most non-genealogists crazy!  The folks at the historical society are very patient with me. They do their best to answer my questions without saying, "Are you crazy? What does this really matter anyway?"

So the big question I popped for them was whether their pdf file was images of the original transcription or an OCR translation. OCR, for those of you unfamiliar with it, is optical character recognition. The special software not only captures the information but turns the content into searchable text. This is a huge advantage over simple images of a publication which are not searchable. The problem with OCR scanning is that the conversion is not 100% perfect and errors are introduced into the text during the process.

They couldn't remember the answer to that off the top of their heads but the gentleman who did the scanning says what is important is when the scanning was done. More recent OCR software is much better than old OCR software. To me, that logically make sense. But the genealogist inside of me is still wary and wants to know the exact likelihood of errors being introduced. I will never really have my answer without comparing the pdf file to the original publication.

Do you ever think about that when reviewing a document? (I hope so!) How many generations removed is it away from the source? In this case the online transcription is two generations away from the source. The first generation is the 1900 published transcription and the original source is the gravestones themselves.

When we are thinking about pdf files we have several things to consider as well.

1) Is the pdf an image of the original?

This can be a benefit for genealogists because there is no question that we are seeing the document as it was published. The disadvantage is that it is not searchable with a computer. Let's hope it's not a long document.

2) Was the OCR scan done recently or quite awhile ago?

Without knowing the person who did the scanning this question could be impossible to answer. Should we trust old OCR scanned pdfs less than more recent ones? Yes, OCR scans to pdf do have the incredible advantage of being searchable but we must remember that the document is now one generation further removed from the original.

This is just some food for thought to get your morning started. I am no technical expert nor do I know the ins and outs of pdfs. If you are more technically advanced perhaps you could share some feedback in the comments.

In the meantime, I'm going to have to compare the pdf from the printed original or just works strictly from the original publication.


11 comments:

  1. Fortunately, there is now technology to OCR an image of printed or typed text, so you can have it both ways, sort of. If the PDF is an image, you can make it searchable using a service like Evernote. At least then, you can check the accuracy of the original PDF.

    ReplyDelete
  2. You are not alone...I am also working on an inventory of a local historic cemetery in conjunction with a genealogical sketch of the town from that period.

    On the flip side, I am in possession of a very detailed (and, so far, accurate) transcription project done pre-computers, typed on index cards and cataloged on onion skin! My dilemma is how to digitize the collection without the OCR errors. One consideration is to retype the whole kaboodle on a spreadsheet, risking my own transcription errors in the process. The other, which I have an inkling is the better choice, is to copy and scan the collection along with digital photos of the stones. (And I will upload to BillionGraves.com as I go.)

    Please let us know how you progress. :)

    ReplyDelete
  3. When I transcribed old Town Meeting records for Methuen, I did it all in MSWord - when the person in charge put them online, she dedided to convert to pdf format. I couldn't believe it when I saw my work online. Nothing was aligned anymore etc., etc. That has never been my experience when converting to pdf format so my take on something like this is that the person converting any original files needs to 1. really know what they are doing but 2. verify that nothing has been changed in the conversion.

    ReplyDelete
  4. Editing PDFs is easy if you have the right software. I use Smart PDF Converter because it is the most accurate one, and it's very easy to use. http://www.pdftodocconverterpro.com

    ReplyDelete
  5. I wonder if it is the 'non-Adobe' pdf software that may be the problem? I would think you would get the best pdf from Adobe Acrobat.
    Linda

    ReplyDelete
  6. Personally, I've found that documents processed through OCR software and NOT reviewed may be useful for building indexes but that's about all. In my experience the resulting document of any OCR output must be reviewed for errors. Depending on the "training" of the software and the fonts in the documents you may be lucky to not have any errors but do you want to take that risk?

    ReplyDelete
  7. Marian, your post today resonated with me for three reasons:

    -I've been the recipient of data from such transcribed cemetery listings and always had that can't-put-your-finger-on-it queasy feeling it might not be reliable

    -I've been to some old, old cemeteries where the transcription says that yep, this is the stone, while my eyes tell me something entirely different about that nearly illegible marker

    -Both I and my daughter have recently been wanting to upload cemetery pictures to online resources such as Find A Grave (she and a friend are considering doing some traveling to old country graveyards to capture info on out-of-the-way records). Reading what you wrote makes me wonder if it wouldn't hurt, even in those places already having existent documentation, to have a double-check and see what might turn up needing correction.

    Yes, genealogists are particularly picky. They want to get it right. And document it while they're at it. So why not go back and check out those discrepancies? And while we're at it, let everyone else know the correct entries. Whether the source of the discrepancy was OCR incompetencies or "operator error," I am not too concerned with--just as long as the end result of the recorded information is correct.

    I heard it said long ago when workplaces still had "typists" that everyone needed to have a second set of eyes review their work. Typists or genealogists, we all could use a second proof-reader from time to time.

    ReplyDelete
  8. Marian -- Cemetery transcriptions are, in an of themselves, a troubled breed. Your original transcription likely contained errors and the recreation to an online form likely contained even more errors. Even if OCR was never used in the process.

    Many online transcriptions were created by someone studying bad photographs. Others were created in the cemetery but were scribbled by hand and errors made again when the pages were typed. Add into these difficulties, Microsoft Excel's tendency to convert "years only" entries from the 1800's into some arbitrary 1904 date (including a month and day!)... and based on some whim known only to Microsoft...and we are all in big trouble.

    We recorded one cemetery directly into the computer while at the cemetery. We made a digital voice recording as we read the stones aloud and listened to the recording as we proofed the file. We then compared our new transcription to one done in 1982 and returned to the cemetery to see the original stone if we found discrepancies. Now, as I am matching the tombstone photos to the transcription, I'm still finding some errors and some stones totally missed! (The photographer missed some too...but I'm hoping we didn't miss the same stone!)

    Our current process is to record the transcription at the cemetery and then to type the transcription at home. A second person compares the new transcription to an earlier transcribed version and double checks any differences against the tombstone photos to confirm which transcription is correct. It is ugly and it's time consuming but we owe it to the Genealogy Community to give them the best information we can. We are all only human and we will make mistakes in spite of our best efforts. Posting a photo of every stone will go a long way toward ensuring our researchers an accurate record.

    I applaud your efforts to photograph your historic cemetery!

    ReplyDelete
  9. I am blind, so I have quite a bit of experience with OCR. It is the only way I have access to printed materials. It is definitely much more accurate that it was even a couple of years ago, but it is definitely not 100% accurate. This is especially true when it comes to numbers. It is essential that someone proof the OCRed document. That said, I truly appreciate all those who have OCRed the images, and especially those who have proofed them. Otherwise, I have to get someone with eyes that work for me to get anything from a PDF, as screen readers are unable to read images.

    ReplyDelete
  10. Having done the Epsom NH Cemeteries, and also using the typewritten 1940's version as a guide (no OCR, just old type written manuscript)I found the same type of errors in reading the stones and missing stones. Since there was no OCR of the document, the errors were in the original readings.

    ReplyDelete
  11. Over the winter, I created a PDF from the scanned images of a 1940-era typewritten document of a cemetery survey. The document that I scanned was not the original - it was probably a copy of a copy of a copy.

    After scanning, I spent many hours cleaning up the images via Adobe's Photoshop Elements (removing lots of stray spots and removing the black edges on the pages).

    From there, I assembled the images into a PDF, and added bookmarks. The OCR produced a somewhat searchable PDF (but not fully searchable), as typewriters inherently produce imperfect type - remember backspacing and re-typing, creating blobs of letters? Or when some letters weren't aligned horizontally with the rest of the line - they floated above the line? OCR can only interpret what can be clearly seen.

    I then printed the cleaned up images and ran them through a different OCR program, which produced better results, but it still wasn't fully searchable.

    To make it fully searchable, I bit the bullet and created a Word document based on the OCR results. This required many hours of proofing every character against the original, as well as fighting with formatting issues in Word.

    During proofing, I discovered two dozen discrepancies in the original. For example, some names had a different middle initial in the index, so which spelling was correct - the name in the document or the one in the index? And where did the error get introduced - when the info was written down or when it was typed up?

    And what was the goal? Was it to accurately reproduce the original document and leave it at that, or go beyond that and fix the mistakes that were made in 1940 (and hopefully not introduce errors of my own)?

    In hindsight, it would've been quicker to hire an accurate typist (thereby skipping OCR) and then proof that version.

    Better yet, it would've been even quicker to have simply left it as a somewhat searchable PDF.

    ReplyDelete