16 September 2013

OCR and Genealogy Go Hand-in-Hand


Optical Character Recognition (OCR) is a big, though often hidden part, of our family history research.

Often, when we access digitized newspapers or any large digitized collection via search terms, it’s because of OCR.  OCR greatly enhances our ability to access the information contained in these collections though it is less than perfect.  You might consider reading 8 Ways to Overcome OCR Errors when Searching Newspapers, http://www.theancestorhunt.com/1/post/2013/09/8-ways-to-overcome-ocr-errors-when-searching-newspapers.html, (Kenneth R Marks, The Ancestor Hunt) to increase your success with using a search engine based on OCR.

When our society needed to “recreate” old journals for our archive (an on-going process), it was found for some of the oldest editions that we had NO electronic version available to us.  This necessitated our scanning the journal pages, converting those pages to Portable Document Format (PDF), using OCR to “get at” the text, place that text in a word-processing format and then recreate the journal.  We’ve similarly done that with a book that we were given the publishing rights to and yet, again, had no digital version of.  BTW, the original process was to scan the images into .tiff and then covert to OCR using software that came with my scanner (OmniPage SE).  I still think that OmniPage does a better job with doing OCR and the Adobe process is so much faster that it mostly compensates for those differences.

And, you don’t have to own an Adobe product to do this.  My husband swears by PDF-XChange Viewer ($37.50 – which does way more than OCR!) and the article mentioned below by James Tanner talks about other options.

Recently I was involved with a project involving photographing a whole collection of private papers – most of which were typewritten documents.  As part of the project, I was requested to provide images and then a searchable PDF file.  This is how I learned that my Adobe Acrobat software has an option for “OCR Text Recognition.”  Amazing how one is always learning how to better use the tools they already have!  So, after taking the images, I then created a PDF file and then used the OCR Text Recognition option to create a “searchable” PDF document.  Isn’t that really neat?!?!

For this same project, I found out that the David M. Rubenstein Rare Book & Manuscript Library (Duke University) has a new scanner with a variety of output options.  It’s amazing. You can save as .jpg images, .pdf files and also as searchable .pdf files!  Now, I can scan each page of the document collection and then let the machine create a searchable .pdf file; I then walk away with my USB stick loaded with .jpeg and .pdf files!  Unfortunately, I cannot spend all day monopolizing the machine and it’s a gem for when you need to photograph books and/or create searchable .pdf files!  For small jobs, I won’t bother lugging my laptop, camera, tripod, cables, etc, when I know that I might not use them.  Though, it is a machine, and does break and/or there is a queue to use it, and so, I’ll at least pack the usual accouterments in my car for back-up.

James Tanner (Genealogy’s Star) recently posted A Look at Optical Character Recognition (OCR) for Genealogists which talks a bit about his use of OCR with his genealogy.  A very important point he mentions is that the quality of the OCR conversion is highly dependent on the quality of the original “image” and it’s ability to handle hand-written documents is quite limited.

Earlier this year, Dick Eastman (EOGN) talked about an Android phone app that performs OCR in  The Easy and Free Way to Perform OCR Conversions of Documents.  Given that for my recent project I have photographed over 2000 pages, I couldn’t and wouldn’t use my phone and for small documents or small collections, it is a viable option.

Have you used OCR with your genealogy research?

How have you used it?

How might family historians use this technology in the future?






~~~~~~~~~~~~~~~~~~~~
copyright © National Genealogical Society, 3108 Columbia Pike, Suite 300, Arlington, Virginia 22204-4370. http://www.ngsgenealogy.org.
~~~~~~~~~~~~~~~~~~~~~
Want to learn more about interacting with the blog, please read Hyperlinks, Subscribing and Comments -- How to Interact with Upfront with NGS Blog posts!
~~~~~~~~~~~~~~~~~~~~~
NGS does not imply endorsement of any outside advertiser or other vendors appearing in this blog.
~~~~~~~~~~~~~~~~~~~~~ 
Republication of UpFront articles is permitted and encouraged for non-commercial purposes without express permission from NGS. Please drop us a note telling us where and when you are using the article. Express written permission is required if you wish to republish UpFront articles for commercial purposes. You may send a request for express written permission to UpFront@ngsgenealogy.org. All republished articles may not be edited or reworded and must contain the copyright statement found at the bottom of each UpFront article.
~~~~~~~~~~~~~~~~~~~~~
Follow NGS via Facebook, YouTube, Google+, Twitter
~~~~~~~~~~~~~~~~~~~~~
Think your friends, colleagues, or fellow genealogy researchers would find this blog post interesting? If so, please let them know that anyone can read past UpFront with NGS posts or subscribe!
~~~~~~~~~~~~~~~~~~~~~

Suggestions for topics for future UpFront with NGS posts are always welcome. Please send any suggested topics to UpfrontNGS@mosaicrpm.com