Thursday, March 15, 2012
Not All PDFs are Alike
My local library's website has a transcription of the old section of the cemetery that was created in 1900 by a local town historian. I used this as the basis for my cataloging. Last week I started cataloging in earnest, making sure I photographed all the stones I had missed. As I photographed I took the time to compare the information on the stone to the epitaph in the transcription.
Imagine my surprise to find errors in the transcription! There were cases where the dates in the transcription had the day wrong, the year, the month or all three. A few gravestones weren't cataloged at all. And some of the transcription mixed up people with the same name.
I got to thinking about this dilemma. As a genealogist and historical researcher I want to be able to pinpoint the problem. The transcription on the website was typed and not an image of the original transcription. So I had to ask myself two questions 1) Were the errors in the original transcription or 2) Were the errors introduced when converted over to an html file?
I can imagine two scenarios. Either someone manually typed up the list and posted it online or some kind of OCR software was used and then copied and pasted into the html file. In either manner errors could be introduced.
Last night I was at a meeting where there were some folks from the local historical society. I asked them if they knew the background behind the online file. They didn't but they did have a pdf of the original 1900 transcription that they could share with me. Then the conversation got a bit sticky.
I have to stop here and say that genealogists have a reputation, for better or for worse, of being very precise. Honestly, it drives most non-genealogists crazy! The folks at the historical society are very patient with me. They do their best to answer my questions without saying, "Are you crazy? What does this really matter anyway?"
So the big question I popped for them was whether their pdf file was images of the original transcription or an OCR translation. OCR, for those of you unfamiliar with it, is optical character recognition. The special software not only captures the information but turns the content into searchable text. This is a huge advantage over simple images of a publication which are not searchable. The problem with OCR scanning is that the conversion is not 100% perfect and errors are introduced into the text during the process.
They couldn't remember the answer to that off the top of their heads but the gentleman who did the scanning says what is important is when the scanning was done. More recent OCR software is much better than old OCR software. To me, that logically make sense. But the genealogist inside of me is still wary and wants to know the exact likelihood of errors being introduced. I will never really have my answer without comparing the pdf file to the original publication.
Do you ever think about that when reviewing a document? (I hope so!) How many generations removed is it away from the source? In this case the online transcription is two generations away from the source. The first generation is the 1900 published transcription and the original source is the gravestones themselves.
When we are thinking about pdf files we have several things to consider as well.
1) Is the pdf an image of the original?
This can be a benefit for genealogists because there is no question that we are seeing the document as it was published. The disadvantage is that it is not searchable with a computer. Let's hope it's not a long document.
2) Was the OCR scan done recently or quite awhile ago?
Without knowing the person who did the scanning this question could be impossible to answer. Should we trust old OCR scanned pdfs less than more recent ones? Yes, OCR scans to pdf do have the incredible advantage of being searchable but we must remember that the document is now one generation further removed from the original.
This is just some food for thought to get your morning started. I am no technical expert nor do I know the ins and outs of pdfs. If you are more technically advanced perhaps you could share some feedback in the comments.
In the meantime, I'm going to have to compare the pdf from the printed original or just works strictly from the original publication.
Posted by Marian Pierre-Louis at 9:16 AM