Text Normalization Tricks Every Digital Humanist Should Know

Text Normalization Tricks Every Digital Humanist Should Know

Delving into historical texts with modern Natural Language Processing (NLP) techniques presents a formidable challenge, primarily due to the pervasive issue of non-standard spelling variations. Historically, the absence of standardized spelling rules meant that words were often spelled in multiple, unpredictable ways, reflecting the phonetic interpretations of authors across different regions and periods. In this post, we will take a look at two text normalization approaches to overcome this issue that shares a similarity with OCR post-correction.

Text Normalization and Historical English

Imagine, if you will, a group of intrepid linguistic adventurers from the University of Helsinki, embarking on a quest through the dense forests of historical texts, armed with nothing but their wits and an array of computational tools. Their paper, Normalizing Early English Letters to Present-day English Spelling (Hämäläinen et al., 2018), is nothing short of a map to linguistic El Dorado, revealing the secrets of English correspondence stretching from the 15th to the 19th century.

With the elegance of a ballet dancer and the precision of a surgeon, these scholars slice through the Gordian knot of non-standard spellings with a multifaceted approach that includes machine translation (both neural and statistical), edit distance and a rule-based FST. Each method, a masterstroke of genius, brings its own strengths to the fore, promising a combined might that could unlock the very essence of historical sociolinguistics.

Text normalization makes it possible for modern NLP methods to parse historical texts

The results weave together the threads of different methodologies, creating a tapestry of normalized spellings that is as rich in detail as it is in historical significance. The paper doesn’t just talk the talk; it walks the walk, demonstrating a significant leap towards the preservation and understanding of our linguistic heritage.

In a world where the written word is so often taken for granted, this paper serves as a vivid reminder of the beauty and complexity of language evolution. It is a call to arms for linguists, historians, and technologists alike to further explore the depths of historical texts and uncover the stories they hold.

The normalization method is available in the Python library called Natas.

Text Normalization and Lemmatization for Historical Finnish

In an electrifying leap for computational linguistics, researchers from the University of Helsinki have unveiled a groundbreaking approach to processing Old Literary Finnish texts, a domain hitherto encased in the formidable armor of historical and orthographic complexity. The work of Mika Hämäläinen, Niko Partanen, and Khalid Alnajjar, Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography (2021), was presented at the prestigious TALN conference.

Imagine the literary heritage of Finland, encapsulated in texts from as early as the 16th century, now accessible and understandable with modern linguistic tools! This is not just research; it’s a time machine enabling a dialogue with history. Their neural network model achieves a staggering 96.3% accuracy on texts by Agricola, the Finnish literary pioneer, and a commendable 87.7% on out-of-domain contemporary texts.

Morphological complexity makes even Modern Finnish challenging to parse, let alone any historical variants

But the excitement doesn’t stop at the results. The team’s commitment to open science — making their methods freely available on Github — is a testament to their dedication to advancing the field. They haven’t just built a bridge to the past; they’ve invited us all to cross it with them.

As a Digital Humanities scholar, it’s hard not to get swept up in the potential this opens up. The application of such advanced text normalization techniques on historical texts not only enriches our understanding of linguistic evolution but also democratizes access to cultural heritage. This work is a beacon for future research, promising a richer, more inclusive exploration of linguistic history.

References

Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, and Eetu Mäkelä. 2018. Normalizing Early English Letters to Present-day English Spelling. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 87–96, Santa Fe, New Mexico. Association for Computational Linguistics.

Mika Hämäläinen, Niko Partanen, and Khalid Alnajjar. 2021. Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography. In Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 189–198, Lille, France. ATALA.