UpFront with NGS: OCR

06 October 2014

Love Hate Relationship with OCR -- So much digital content now available though accessing can still be a challenge

When you search GenealogyBank, Newspapers.com, Internet Archive (digitized books), or any large wholesale digitization effort of published/written materials, our ability to search is typically because OCR was used (OCR and Genealogy Go Hand-in-Hand (September 2013, Upfront with NGS).

I am reminded of this daily when I search on newspapers for information and have adapted my search strategy to think of all the ways (which is impossible) that OCR might not be able to read the name I seek or might misinterpret a name. A favorite lately has been Abbott yielding search results for About ... “close but no cigar!” Obviously, I love that I even have the option to do wholesale searches of a newspaper collection though there is some hate that it’s not a perfect system. How quickly we forget that in the past I would have had to locate the newspaper of interest and then manually scroll through each page on a microfilm or flip each page of a physical copy!

It is hard to get automatically scanned and OCR’d files fully and accurately searchable. A post on the Library of Congress, The Signal, Digital Preservation blog, titled Making Scanned Content Accessible Using Full-text Search and OCR, a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

I suggest you read this article to gain more of an appreciation for what goes into creating an archive where search results are correlated to an original image that us human readers can view using an entirely automated system.

With our unlimited appetite for searchable digital material, it’s important to understand what goes into providing us with the ability to search on material versus linearly browsing the same in the hopes of finding what we seek.

~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~

NGS does not imply endorsement of any outside advertiser or other vendors appearing in this blog. Any opinions expressed by guest authors are their own and do not necessarily reflect the view of NGS.

~~~~~~~~~~~~~~~~~~~~~

Republication of UpFront articles is permitted and encouraged for non-commercial purposes without express permission from NGS. Please drop us a note telling us where and when you are using the article. Express written permission is required if you wish to republish UpFront articles for commercial purposes. You may send a request for express written permission to UpFront@ngsgenealogy.org. All republished articles may not be edited or reworded and must contain the copyright statement found at the bottom of each UpFront article.

~~~~~~~~~~~~~~~~~~~~~

Think your friends, colleagues, or fellow genealogy researchers would find this blog post interesting? If so, please let them know that anyone can read past UpFront with NGS posts or subscribe!

~~~~~~~~~~~~~~~~~~~~~

Suggestions for topics for future UpFront with NGS posts are always welcome. Please send any suggested topics to UpfrontNGS@mosaicrpm.com

~~~~~~~~~~~~~~~~~~~~~

Unless indicated otherwise or clearly an NGS Public Relations piece, Upfront with NGS posts are written by Diane L Richard, editor, Upfront with NGS.

~~~~~~~~~~~~~~~~~~~~~

Want to learn more about interacting with the blog, please read Hyperlinks, Subscribing and Comments -- How to Interact with Upfront with NGS Blog posts!

~~~~~~~~~~~~~~~~~~~~~

Follow NGS via Facebook, YouTube, Google+, Twitter

16 September 2013

OCR and Genealogy Go Hand-in-Hand

Optical Character Recognition (OCR) is a big, though often hidden part, of our family history research.

Often, when we access digitized newspapers or any large digitized collection via search terms, it’s because of OCR. OCR greatly enhances our ability to access the information contained in these collections though it is less than perfect. You might consider reading 8 Ways to Overcome OCR Errors when Searching Newspapers, http://www.theancestorhunt.com/1/post/2013/09/8-ways-to-overcome-ocr-errors-when-searching-newspapers.html, (Kenneth R Marks, The Ancestor Hunt) to increase your success with using a search engine based on OCR.

When our society needed to “recreate” old journals for our archive (an on-going process), it was found for some of the oldest editions that we had NO electronic version available to us. This necessitated our scanning the journal pages, converting those pages to Portable Document Format (PDF), using OCR to “get at” the text, place that text in a word-processing format and then recreate the journal. We’ve similarly done that with a book that we were given the publishing rights to and yet, again, had no digital version of. BTW, the original process was to scan the images into .tiff and then covert to OCR using software that came with my scanner (OmniPage SE). I still think that OmniPage does a better job with doing OCR and the Adobe process is so much faster that it mostly compensates for those differences.

And, you don’t have to own an Adobe product to do this. My husband swears by PDF-XChange Viewer ($37.50 – which does way more than OCR!) and the article mentioned below by James Tanner talks about other options.

Recently I was involved with a project involving photographing a whole collection of private papers – most of which were typewritten documents. As part of the project, I was requested to provide images and then a searchable PDF file. This is how I learned that my Adobe Acrobat software has an option for “OCR Text Recognition.” Amazing how one is always learning how to better use the tools they already have! So, after taking the images, I then created a PDF file and then used the OCR Text Recognition option to create a “searchable” PDF document. Isn’t that really neat?!?!

For this same project, I found out that the David M. Rubenstein Rare Book & Manuscript Library (Duke University) has a new scanner with a variety of output options. It’s amazing. You can save as .jpg images, .pdf files and also as searchable .pdf files! Now, I can scan each page of the document collection and then let the machine create a searchable .pdf file; I then walk away with my USB stick loaded with .jpeg and .pdf files! Unfortunately, I cannot spend all day monopolizing the machine and it’s a gem for when you need to photograph books and/or create searchable .pdf files! For small jobs, I won’t bother lugging my laptop, camera, tripod, cables, etc, when I know that I might not use them. Though, it is a machine, and does break and/or there is a queue to use it, and so, I’ll at least pack the usual accouterments in my car for back-up.

James Tanner (Genealogy’s Star) recently posted A Look at Optical Character Recognition (OCR) for Genealogists which talks a bit about his use of OCR with his genealogy. A very important point he mentions is that the quality of the OCR conversion is highly dependent on the quality of the original “image” and it’s ability to handle hand-written documents is quite limited.

Earlier this year, Dick Eastman (EOGN) talked about an Android phone app that performs OCR in The Easy and Free Way to Perform OCR Conversions of Documents. Given that for my recent project I have photographed over 2000 pages, I couldn’t and wouldn’t use my phone and for small documents or small collections, it is a viable option.

Have you used OCR with your genealogy research?

How have you used it?

How might family historians use this technology in the future?

~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~

Want to learn more about interacting with the blog, please read Hyperlinks, Subscribing and Comments -- How to Interact with Upfront with NGS Blog posts!

~~~~~~~~~~~~~~~~~~~~~

NGS does not imply endorsement of any outside advertiser or other vendors appearing in this blog.

~~~~~~~~~~~~~~~~~~~~~

Follow NGS via Facebook, YouTube, Google+, Twitter

~~~~~~~~~~~~~~~~~~~~~

Think your friends, colleagues, or fellow genealogy researchers would find this blog post interesting? If so, please let them know that anyone can read past UpFront with NGS posts or subscribe!

~~~~~~~~~~~~~~~~~~~~~

Suggestions for topics for future UpFront with NGS posts are always welcome. Please send any suggested topics to UpfrontNGS@mosaicrpm.com