Uncovering the Secrets of Historical Finnish with NLP and DH

Uncovering the Secrets of Historical Finnish with NLP and DH

Linguistic Change in Old Literary Finnish

In the fascinating study titled “Linguistic Change and Historical Periodization of Old Literary Finnish” by Partanen et al. (2021a) researchers employed innovative NLP methods to explore the evolution of Old Literary Finnish over centuries. They developed a lemmatization model based on the texts of Mikael Agricola, a prominent figure in the early Finnish literature, to normalize and lemmatize an extensive corpus of Historical Finnish. The model’s performance, measured through word error rate (WER) and analysis of various error types, served as a proxy to trace linguistic innovation and change across different decades. This approach revealed that the errors made by the model were closely linked to linguistic changes and innovations, providing insights into the evolving lexicon and orthography of Old Literary Finnish.

The study’s main findings highlight the gradual linguistic differentiation and the introduction of new vocabulary and writing conventions over time. It noted the shift in the use of pronouns, the introduction of modern punctuation, and the expansion of the language’s domain to include new scientific and technical terms. These changes indicate not only the language’s natural evolution but also reflect broader historical and cultural shifts within Finnish society.

Both Finland and Finnish have changed over time

This research contributes significantly to the digital humanities by showcasing how advanced NLP techniques can offer profound insights into historical linguistics and cultural evolution. By publishing their word embeddings and data, the researchers have provided valuable resources for further studies in Old Literary Finnish, paving the way for more nuanced analyses of linguistic change over time.

A Deep Dive in Historical Uralic Languages à la M.A. Castrén

In the fascinating realm of digital humanities, a groundbreaking study (Partanen et al., 2021b) has emerged, focusing on the rich linguistic and ethnographic materials of the historical Finnish ethnographer and linguist Matthias Alexander Castrén. This research, spearheaded by an adept team from the University of Helsinki, leverages state-of-the-art Natural Language Processing (NLP) methods to process Castrén’s multilingual manuscripts, offering unprecedented insights into Northern Eurasian languages.

At the heart of this project lies the innovative use of text recognition technologies, particularly the application of the Transkribus platform. This advanced tool has been pivotal in digitizing handwritten and typed manuscripts, facilitating the extraction and analysis of textual data from Castrén’s extensive collection. By converting these historical documents into machine-readable formats, the researchers have laid a solid foundation for further computational analyses, setting new benchmarks for text recognition accuracy in diverse linguistic materials.

Uralic languages are named after the Ural mountains

This meticulous processing has not only preserved a valuable slice of linguistic heritage but also opened new avenues for scholarly exploration. The study underscores Castrén’s contribution to the documentation of nearly thirty languages, highlighting the linguistic diversity of Northern Eurasia in the 19th century. By digitalizing Castrén’s work, the project offers a richer, more accessible context for understanding the historical and cultural dynamics of the region. The successful application of NLP techniques to historical documents illustrates the potential of digital tools in transforming humanities research, enabling a deeper, more nuanced analysis of linguistic and ethnographic data.

This pioneering work not only enriches our understanding of Uralic and other Northern Eurasian languages but also serves as a model for future digital humanities projects. It demonstrates the immense potential of combining traditional humanities research with cutting-edge technology to unlock historical linguistic treasures.


Niko Partanen, Khalid Alnajjar, Mika Hämäläinen, and Jack Rueter. 2021a. Linguistic change and historical periodization of Old Literary Finnish. In Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, pages 21–27, Online. Association for Computational Linguistics.

Niko Partanen, Jack Rueter, Khalid Alnajjar, and Mika Hämäläinen. 2021b. Processing M.A. Castrén’s Materials: Multilingual Historical Typed and Handwritten Manuscripts. In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 47–54, NIT Silchar, India. NLP Association of India (NLPAI).