When you search GenealogyBank, Newspapers.com, Internet Archive (digitized books), or any large wholesale digitization effort of published/written materials, our ability to search is typically because OCR was used (OCR and Genealogy Go Hand-in-Hand (September 2013, Upfront with NGS).
I am reminded of this daily when I
search on newspapers for information and have adapted my search strategy to
think of all the ways (which is impossible) that OCR might not be able to read
the name I seek or might misinterpret a name. A favorite lately has been Abbott yielding
search results for About ... “close but no cigar!”
Obviously, I love that I even have the option to do wholesale searches
of a newspaper collection though there is some hate that it’s not a perfect
system. How quickly we forget that in
the past I would have had to locate the newspaper of interest and then manually
scroll through each page on a microfilm or flip each page of a physical copy!
It is hard to get automatically
scanned and OCR’d files fully and accurately searchable. A post on the Library of Congress, The Signal,
Digital Preservation blog, titled Making
Scanned Content Accessible Using Full-text Search and OCR, a guest post by Chris Adams from the
Repository Development Center at the Library of Congress, the technical lead
for the World Digital Library.
We live in
an age of cheap bits: scanning objects en masse has never been easier, storage
has never been cheaper and large-scale digitization has become routine for many
organizations. This poses an interesting challenge: our capacity to generate
scanned images has greatly outstripped our ability to generate the metadata
needed to make those items discoverable. Most people use search engines to find
the information they need but our terabytes of carefully produced and
diligently preserved TIFF files are effectively invisible for text-based search.
I suggest you read this article to
gain more of an appreciation for what goes into creating an archive where search
results are correlated to an original image that us human readers can view
using an entirely automated system.
With our unlimited appetite for
searchable digital material, it’s important to understand what goes into
providing us with the ability to search on material versus linearly browsing
the same in the hopes of finding what we seek.
~~~~~~~~~~~~~~~~~~~~
copyright © National
Genealogical Society, 3108 Columbia Pike, Suite 300, Arlington, Virginia
22204-4370. http://www.ngsgenealogy.org.
~~~~~~~~~~~~~~~~~~~~~
NGS does not imply
endorsement of any outside advertiser or other vendors appearing in this blog.
Any opinions expressed by guest authors are their own and do not necessarily
reflect the view of NGS.
~~~~~~~~~~~~~~~~~~~~~
Republication
of UpFront articles is permitted and encouraged for non-commercial
purposes without express permission from NGS. Please drop us a note telling us
where and when you are using the article. Express written permission is
required if you wish to republish UpFront articles for
commercial purposes. You may send a request for express written permission to [email protected]. All republished articles may not be
edited or reworded and must contain the copyright statement found at the bottom
of each UpFront article.
~~~~~~~~~~~~~~~~~~~~~
Think your friends,
colleagues, or fellow genealogy researchers would find this blog post
interesting? If so, please let them know that anyone can read past UpFront with NGS posts or subscribe!
~~~~~~~~~~~~~~~~~~~~~
Suggestions for
topics for future UpFront with NGS posts are always welcome. Please send
any suggested topics to [email protected]
~~~~~~~~~~~~~~~~~~~~~
Unless indicated
otherwise or clearly an NGS Public Relations piece, Upfront with NGS posts are written by Diane L Richard, editor, Upfront with NGS.
~~~~~~~~~~~~~~~~~~~~~
Want to learn more
about interacting with the blog, please read Hyperlinks,
Subscribing and Comments -- How to Interact with Upfront with NGS Blog posts!
~~~~~~~~~~~~~~~~~~~~~
No comments:
Post a Comment