Revolutionizing Historical OCR post-correction with AI

mikahama

2 years ago

In this post, we will take a look at two different approaches for doing OCR post-correction with limited or no gold standard data. One approach works well with English text while the other one is more tailored towards morphologically rich languages such as Finnish.

Transforming Historical Document Analysis with AI: A New Horizon for Text Digitization and OCR post-correction

The digitalization of historical texts, a process essential for preserving our global heritage and facilitating research, is often marred by errors introduced during Optical Character Recognition (OCR). These errors significantly hinder analysis, making the need for efficient, accurate correction methods paramount. Enter a groundbreaking OCR post-correction approach from researchers Mika Hämäläinen and Simon Hengchen of the University of Helsinki, who have developed a fully automatic, unsupervised method for OCR post-correction that promises to revolutionize how we interact with historical documents (Hämäläinen & Hengchen, 2019).

Their method leverages the power of Neural Machine Translation (NMT) and word embeddings to correct OCR errors without the need for manually annotated training data. By automatically extracting parallel data (correct and incorrect versions of the same text), they train a character-based sequence-to-sequence model that understands the context of errors and suggests accurate corrections. This approach is not only innovative but also significantly more accessible, as it relies on readily available digital tools and frameworks.

The method learns to correct erroneous words

What’s more, this technology is not confined to the academic paper it was introduced in; it has been integrated into the Natas Python library, making it accessible for researchers, historians, and technologists worldwide. This integration into Natas ensures that the method can be easily applied to various historical texts, opening new doors for research and analysis in humanities and social sciences.

The implications of this research are profound. By improving the accuracy of OCR text, we can unlock the full potential of historical documents, making them more accessible for study and analysis. This could lead to new discoveries and insights into our past, further enriching our understanding of human history.

For anyone interested in the intersection of technology and history, the work of Hämäläinen and Hengchen represents a significant step forward. Their method not only addresses a longstanding challenge in digital humanities but also provides a scalable, efficient solution that could be applied to multiple languages and periods. The Natas Python library’s inclusion of their model ensures that this innovative approach will benefit a wide range of projects, making the past more accessible than ever before.

Revolutionizing OCR Correction with AI: A Leap Forward in Processing Finnish Historical Texts

In the digital humanities, preserving the integrity of historical documents through digitization is pivotal. However, the process often encounters a significant hurdle: errors introduced by Optical Character Recognition (OCR) technologies. The recent study “An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish” by Quan Duong, Mika Hämäläinen, and Simon Hengchen (2021), offers a groundbreaking unsupervised solution tailored to the Finnish language, known for its rich morphology.

This novel method relies on Neural Machine Translation (NMT) to automatically correct OCR errors and standardize spelling, without the need for manually annotated data. By creating parallel data from texts with OCR errors and their corrected versions, the researchers trained a model that significantly enhances text accuracy. This approach is not just a leap forward for Finnish texts but also offers a blueprint for other languages with limited NLP resources.

What makes this advancement particularly accessible is its availability in an easy-to-use format on Github. This move democratizes access to cutting-edge OCR correction techniques, enabling researchers and historians to refine their digitized texts with unprecedented ease and precision. The work of Duong, Hämäläinen and Hengchen marks a significant step towards the broader application of NLP in preserving and exploring our historical heritage, making it a beacon for future explorations in the field of digital humanities.

References

Mika Hämäläinen and Simon Hengchen. 2019. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 431–436, Varna, Bulgaria. INCOMA Ltd..

Quan Duong, Mika Hämäläinen, and Simon Hengchen. 2021. An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 240–248, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.