5 Secrets for Searching Old Digitized Newspapers

5/11/2020

Everyone who has searched newspapers online will fail to find something. It happens incredibly often. The stakes are high for genealogy researchers, where finding an article about an ancestor can make a huge difference in filling out a family tree or for historians, where one can find an extraordinary story unknown before.

I have often heard researchers say "I can't find a single article about my subject, even though I have searched for hours!"

Laying competency aside as a factor, the biggest reason is that the scanning and OCR of one and two hundred year old newspapers, either from paper or from microfilm, produces way less than optimal results.

Here are 5 secrets to success:

1. Understand that you are trying to search against an imperfect database

Every collection of digitized newspapers has two parts. The first is the collection of all the scanned images. The second is the OCR text that contains all the "words" in the newspapers. In fact, the OCR text does NOT contain words or names - it really is just a string of letters and spaces. You need to know AND remember that the key is that it is a string of letters, not NECESSARILY a word or a person's name that you are searching against.

The OCR process is intended to convert the dots on the page to letters, numbers and symbols (such as punctuation). If the original newspaper is dirty, brittle, creased, etc., the OCR process, which is applied to the scanned image, may not convert the dots to the correct letters. There are a variety of strange characters and combinations of strange characters that may be picked up.

The bottom line is that the OCR text - a representation of the dots from the original pre-scanned newspaper page is only as good as the quality of the original, the quality of the scanner, and the quality of the OCR software. That's a lot of "quality" that has to exist.

You must expect inferior results and set your expectations accordingly.

2. Dates and Location Matter

A digitized newspaper collection represents a number (sometimes a large number) of newspaper pages that when combined are a newspaper edition for a specific day (or part of a day if there is a morning and evening edition). So, a page is part of a newspaper title's publication for that day. Several days and multiple titles become a collection.

Let's say that the digitized collection of the "Herald News" has publication dates of November 12, 1887 through January 31, 1894, which represents all the newspapers in the collection.

No matter how hard you try you cannot make the collection give you results for your great grandfather's obituary if it was published in the Herald News on March 14, 1896.

A very common mistake that online newspaper researchers make is to ignore the dates of newspapers that are in the collection and the publication locations of that collection. Always check the collection dates and publication title and location prior to starting a search project.

3, Remember you are searching for a string of letters, NOT a Name or a Word

For example, if you are looking for a person named "Smith" (good luck with THAT common name, but I digress), try using different letter strings, such as "Smitb", or "sm1th", or "8mith" or combinations of those different letters/numbers.

If there is an "h" in your search term, try exchanging a "b", since b's and h's are quite similar and thus the OCR text may include the "b" rather than the desired "h" (e.g., an undesired ink spot in just the right location may make the "h" look like a "b". As a real world example, searching the California Digital Newspaper Collection for the surnames "Braunhart" yields 1,507 results. Replacing the "h" with a "b", hence searching for "Braunbart" yields 96 results - for the SAME person. That is approximately another 6%!

For a similar reason as "h" and "b" are confused - the same holds true of "c" and "e".

Likewise, lower case m's and n's are often confused. The m's are often converted to several combinations of letters. Also r's and n's can be confused.

For a complete list go to The Best Way to Find More Pertinent Articles in Historical Newspaper Research

To repeat - you may "wish" that when you enter a perfectly spelled name or word, that you will get positive results when searching against the OCR text - but actually it is the string of letters that appear in the text representation of the page AFTER the OCR process that you are searching against. So, outsmart the system by changing your search string of letters.

4. Remember to Take Hyphenated Words into Consideration

Hyphenated words were often used because of fixed width type as well as the experience and capability of the typesetter. Hyphens are less utilized today but were a staple years ago. Take that into consideration if you are searching for a surname or other search criteria with many letters in one word. Try splitting the search into two words where the hyphen may have been normally used.

I have seen in older newspapers, as much as 20% of a column have hyphenated words. So try splitting up someone's name. For example, MacDonald could have "Mac-" on one line, followed by "Donald" on the second line. Or MacDon-" on one line and "ald" on the second line.

You will get many more positive results if you take hyphenation into consideration when formulating your search string.

Here is more about searching for hyphenated words - Use Hyphenated Search

5. Use Crowdsourcing to Improve the OCR Textual representation of the Page

More and more digitized collections are using crowdsourcing to improve the OCR text. In effect you are helping the next person by correcting OCR errors and changing the letter string to words or names that are spelled correctly.

Additional help for all our searching success would be more crowdsourcing to correct OCR errors and improve the text. An example is reCaptcha processing that has been used by Google Books.

Another crowdsourcing example that I personally have used is that of correction on the actual online newspaper site, such as the aforementioned California Digital Newspaper Collection. In this example, registered users can provide edited text that is then incorporated into future searches. Kind of like a newspaper-related "pay it forward." This capability is provided on that site and many others from the fine folks at Elephind.com, who created the software used by the California collection as well as many other online newspaper sites. I have recently noticed more software that is incorporating this feature.

So if you help improve the text index and others do as well, everyone benefits!

So, in summary - you can win with online newspaper searching - you just need to be smart about the process.

Good luck - be persistent and have reasonable expectations.

0 Comments

5 Secrets for Searching Old Digitized Newspapers

Leave a Reply.

Archives

Categories