06 October 2014
Love Hate Relationship with OCR -- So much digital content now available though accessing can still be a challenge
When you search GenealogyBank, Newspapers.com, Internet Archive (digitized books), or any large wholesale digitization effort of published/written materials, our ability to search is typically because OCR was used (OCR and Genealogy Go Hand-in-Hand (September 2013, Upfront with NGS).
I am reminded of this daily when I search on newspapers for information and have adapted my search strategy to think of all the ways (which is impossible) that OCR might not be able to read the name I seek or might misinterpret a name. A favorite lately has been Abbott yielding search results for About ... “close but no cigar!” Obviously, I love that I even have the option to do wholesale searches of a newspaper collection though there is some hate that it’s not a perfect system. How quickly we forget that in the past I would have had to locate the newspaper of interest and then manually scroll through each page on a microfilm or flip each page of a physical copy!
It is hard to get automatically scanned and OCR’d files fully and accurately searchable. A post on the Library of Congress, The Signal, Digital Preservation blog, titled Making Scanned Content Accessible Using Full-text Search and OCR, a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.
We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.
I suggest you read this article to gain more of an appreciation for what goes into creating an archive where search results are correlated to an original image that us human readers can view using an entirely automated system.
With our unlimited appetite for searchable digital material, it’s important to understand what goes into providing us with the ability to search on material versus linearly browsing the same in the hopes of finding what we seek.
copyright © National Genealogical Society, 3108 Columbia Pike, Suite 300, Arlington, Virginia 22204-4370. http://www.ngsgenealogy.org.
NGS does not imply endorsement of any outside advertiser or other vendors appearing in this blog. Any opinions expressed by guest authors are their own and do not necessarily reflect the view of NGS.
Republication of UpFront articles is permitted and encouraged for non-commercial purposes without express permission from NGS. Please drop us a note telling us where and when you are using the article. Express written permission is required if you wish to republish UpFront articles for commercial purposes. You may send a request for express written permission to UpFront@ngsgenealogy.org. All republished articles may not be edited or reworded and must contain the copyright statement found at the bottom of each UpFront article.
Think your friends, colleagues, or fellow genealogy researchers would find this blog post interesting? If so, please let them know that anyone can read past UpFront with NGS posts or subscribe!
Suggestions for topics for future UpFront with NGS posts are always welcome. Please send any suggested topics to UpfrontNGS@mosaicrpm.com
Unless indicated otherwise or clearly an NGS Public Relations piece, Upfront with NGS posts are written by Diane L Richard, editor, Upfront with NGS.
Want to learn more about interacting with the blog, please read Hyperlinks, Subscribing and Comments -- How to Interact with Upfront with NGS Blog posts!
I'm reading: Love Hate Relationship with OCR -- So much digital content now available though accessing can still be a challenge