OCR of blackletter fonts ("Fraktur")
A sub-type of the gothic letters, the so-called “Fraktur” (black letter) has been widely used from the 18th to the middle of the 20th century in many European countries such as Germany, Austria, Switzerland, Netherlands, Norway, Sweden, Finland, Czech, Estonia or Lithuania. In the case of German language texts Fraktur has been used until 1941 in around 80 % of printed documents.
It is a matter of fact that a big number of readers are nowadays no more able to easily read texts which are set in Fraktur. It has therefore been of high importance to develop an omni-font OCR which is able to read Fraktur without any training. In addition to special classifiers the quality of OCR results is strongly dependent on the dictionaries used in the background. A good dictionary, and not only a list of words, increases the recognition rate significantly.
Five historical dictionaries, reflecting the historical state of orthography in English, French, German, Italian and Spanish have therefore been developed. These dictionaries contain between 50.000 and 100.000 historical word stems and shall cover more than 90% of the words found in historical texts.
A SDK (software development kid) including the Fraktur classifiers and the historical dictionaries is available as commercial product, marketed by ABBYY.
In order to give an impression of the capability of the Fraktur OCR we have prepared three sample pages from typical books of the 19th and 20th century. The OCR results are not corrected, mistakes have been marked with red colour.
Contact: Jupp Stoepetie
Results
Example 1: Page image from: Die österreichisch-ungarische Monarchie in Wort und Bild. Vol. 1. Wien 1887, p.92.
Paper and print quality: very good.
300 dpi, 1 bit, book scanner
The page contains 2564 characters, 8 of them are not correct. This is a recognition rate of 99,68%.
[doc] [tif] [gif]
Example 2: Page image from: Ernst Häckel: Natürliche Schöpfungsgeschichte. Berlin: 1879, p. 6.
Paper and print quality: very good
400 dpi, 1 bit, document scanner with ADF
The page contains 1856 characters, 4 of them are not correct. This is a recognition rate of 99,78%.
[doc] [tif] [gif]
Example 3. Page image from: Anonym: Hofscheu und ländliches Heimweh. Eine Biographie. Hamburg: Heroldsche Buchhandlung 1818
Paper and print quality: low quality
300 dpi, 1 bit, book scanner
The page contains 631 characters, 13 of them are not correct. This is a recognition rate of 97,9%.
[doc] [tif] [gif]