Old Newspaper Research
  • Home
  • Blog
  • Newspaper Links
  • Tips
  • About
  • Contact
  • Privacy Policy
Newspaper Research Articles

7 Ways to Overcome OCR Errors when Searching Newspapers

7/1/2021

0 Comments

 
Picture
Everyone who has searched newspapers online will fail to find something. It happens incredibly often. 

I have often heard researchers say “I can’t find a single article about a person or an event, even though I have searched for hours!”

Laying competency aside as a factor, the biggest reason is that scanning of one and two hundred year old newspapers, either from paper or from microfilm, produces way less than optimal results.

More importantly, one must know that searching through an index created by humans who have read the source material and then typed the index is far superior to having a machine/software scan and process a dusty old newspaper. Yet the massive size of newspaper collections prevents the creation of the index manually. You must expect inferior results and set your expectations accordingly.

Please take a look at the following list, and hopefully some of these errors and anomalies will provide you with some hints to overcome them and actually find what you are looking for. There are many others – but these are ones that I have personally experienced:
​
  1. Hyphenated words were often used because of fixed width type as well as the experience and capability of the typesetter. Hyphens are less utilized today but were a staple years ago. Take that into consideration if you are searching for a surname or other search criteria with many letters in one word. Try splitting the search into two words where the hyphen may have been normally used.
  2. If there is an “h” in your search term, try exchanging a “b”, since b’s and h’s are quite similar and can “confuse” the OCR process. As an example, searching the California Digital Newspaper Collection for one of my surnames – “Braunhart” yields 1,507 results. Replacing the “h” with a “b”, hence searching for “Braunbart” yields 96 results – for the SAME person. That is approximately another 6%!
  3. For a similar reason as “h” and “b” are confused – the same holds true of “c” and “e”.  I have not had as many difficulties with this pair as with “h” and “b”.
  4. Likewise, lower case m’s and n’s are often confused. The m’s are often converted to several combinations of letters.  Also r’s and n’s can be confused.
  5. I’s in lower as well as upper case can often be converted to slashes or exclamation points and the numeral 1. And vice versa.​
  6. if the original newspaper is “dirty,” by that I mean there is excessive ink or the scan is dark – many times spaces will be scanned but not presented as spaces. There are a variety of strange characters that may be picked up.
  7. If the newspaper was scanned and then processed directly with OCR, that is one pass. If the newspaper was scanned to microfilm and scanned again and then OCR’d that is two passes. Thus a two pass operation has the potential to have a decreased quality of results. There isn’t much that you can do about it – but it is nice to know.

So don’t be discouraged by “lack of results” from doing online newspaper searches. You just need to “outsmart” OCR and try various combinations to get to those elusive ancestors. Be persistent.

A crowdsourcing example that I personally have used is that of correction on the actual online newspaper site, such as the aforementioned California Digital Newspaper Collection. In this example, registered users can provide edited text that is then incorporated into future searches. Kind of like a newspaper-related “pay it forward.” This capability is provided on that site and many others from the fine folks at DL Consulting who created the software used by the California collection as well as many other online newspaper sites.

For many more details about scanning, OCR and related subjects please read 
an old article that is very informative -  Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs, from the March/April 2009 publication of D-Lib magazine.

Good luck – be persistent and have reasonable expectations.
0 Comments



Leave a Reply.

    SEARCH



    Picture

    Archives

    January 2022
    November 2021
    October 2021
    August 2021
    July 2021
    June 2021
    May 2021
    February 2021
    December 2020
    November 2020
    May 2020
    December 2018
    October 2018
    April 2018
    July 2017
    June 2017


    Categories

    All
    Alabama
    Alaska
    Arizona
    Arkansas
    California
    Chronicling America
    Colorado
    Connecticut
    Delaware
    District Of Columbia
    Florida
    Georgia
    Hathitrust
    Hawaii
    Idaho
    Illinois
    Indiana
    Iowa
    Kansas
    Kentucky
    Louisiana
    Maine
    Maryland
    Massachusetts
    Michigan
    Minnesota
    Mississippi
    Missouri
    Montana
    Nebraska
    Nevada
    New Hampshire
    New Jersey
    New Mexico
    New York
    North Carolina
    North Dakota
    OCR
    Ohio
    Oklahoma
    Oregon
    Pennsylvania
    Rhode Island
    South Carolina
    South Dakota
    Tennessee
    Texas
    Tips
    Utah
    Vermont
    Virginia
    Washington
    West Virginia
    Wisconsin
    Wyoming

    RSS Feed

  • Home
  • Blog
  • Newspaper Links
  • Tips
  • About
  • Contact
  • Privacy Policy