NLP for Endangered Languages in the Era of Neural Networks

NLP for Endangered Languages in the Era of Neural Networks

Neural networks have fundamentally changed the way we do NLP research. Nobody in their sane mind would start to write rules or embrace statistical methods anymore when solving NLP tasks… Unless, of course, one was doing NLP for endangered languages. Such languages have very few resources which leads to the illusion that rule-based methods are the only way to do NLP for these languages. I will show that neural networks can be used in the context of endangered languages as well. One just has to be smart about it. 😉

Sami Cognates using Statistical and Neural Machine Translation

In a novel study (Hämäläinen & Rueter, 2019), researchers from the University of Helsinki have taken an innovative leap in NLP for endangered languages using neural machine translation (NMT) to unearth cognates between Skolt Sami and North Sami, languages with scarce parallel data. The brilliance of their approach lies in their method to overcome the data scarcity challenge. They ingeniously trained their model using North Sami cognates from a broader set of Uralic languages, supplemented by synthetic data generated through statistical machine translation (SMT). This clever combination not only enriched the training dataset but also enhanced the model’s ability to detect cognates with remarkable accuracy.

Sami languages spoken in the Northern parts of Nordic countries are endangered.

Their findings have profound implications, not just for the study of Sami languages but for computational linguistics in endangered language research. By making the discovered cognates publicly available in the Online Dictionary of Uralic Languages, they’ve provided invaluable resources for further linguistic studies and NLP applications. This study showcases the power of neural models in NLP for endangered languages, offering a promising avenue for future research in language preservation and the discovery of linguistic relations.

Extending Rule-Based Morphology with LSTMs

In the fascinating realm of computational linguistics, a recent study has made significant strides in addressing the challenges of neural morphology for a wide spectrum of languages, ranging from widely spoken to critically endangered. The research (Hämäläinen et al., 2021), conducted by a team from the University of Helsinki, unveils an innovative approach to generating substantial datasets for training neural models in morphological analysis, generation, and lemmatization.

This breakthrough is particularly notable for its clever solution to the perennial issue of data scarcity in NLP for endangered languages. By harnessing the power of Finite State Transducers (FSTs), the researchers were able to automatically extract a rich corpus of morphological data for 22 languages, including 17 that are considered endangered. This methodology not only expands the horizons of computational linguistics into less-studied linguistic territories but also reinforces the potential of neural models to adapt and perform across a diverse linguistic landscape.

Neural networks can learn by reading rules

The neural models developed in this study are designed to work hand-in-hand with the existing FSTs, ensuring that they can serve as reliable fallback systems. This synergy between traditional computational linguistics tools and cutting-edge neural network models exemplifies a forward-thinking approach to language technology development, especially in the context of language preservation and revitalization.

What sets this research apart is not just the scale and diversity of the languages studied but also the practical implications of the findings. The models have been made publicly available, promising to be a valuable resource for further academic research and real-world applications. This initiative not only advances the field of NLP but also contributes to the broader goal of safeguarding linguistic diversity.

Neural Machine Translation Between Erzya and Moksha Using Synthetic Data

In the realm of Natural Language Processing (NLP), the quest to break down language barriers has taken a remarkable turn with the advent of Neural Machine Translation (NMT). Yet, the challenge intensifies when it comes to endangered languages, such as Moksha and Erzya. These languages, with their limited parallel corpora, present a significant obstacle in developing robust NMT systems. However, a groundbreaking study (Alnajjar et al., 2023) has emerged, showcasing an innovative approach that cleverly overcomes the data scarcity issue.

The study introduces a novel strategy by leveraging Apertium, an existing rule-based machine translation system, to generate synthetic data for training NMT models. This ingenious solution utilizes the rule-based outputs of Apertium to create a parallel corpus for Moksha-Erzya translation, effectively bootstrapping the NMT system with high-quality, synthetic training data.

Erzya and Moksha are endangered and they are spoken in Mordovia, Russia

What sets this research apart is its use of the NLLB-200 model, fine-tuned specifically for the Moksha to Erzya translation task. Despite the inherent challenges, such as limited vocabulary and idiomatic expressions, the model achieved a notable improvement in translation quality, as evidenced by BLEU score enhancements. This achievement marks a significant milestone in the quest to preserve endangered languages, offering a viable path toward more accurate and efficient translation methods.

This study not only underscores the potential of neural models in revolutionizing machine translation for lesser-resourced languages but also highlights the importance of creative solutions to data scarcity. By marrying the strengths of rule-based systems with the advanced capabilities of neural models, the research opens up new avenues for enhancing language accessibility and preservation. It stands as a testament to the power of innovation in bridging linguistic divides and enriching the tapestry of global communication.


Mika Hämäläinen and Jack Rueter. 2019. Finding Sami Cognates with a Character-Based NMT Approach. In Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers), pages 39–45, Honolulu. Association for Computational Linguistics.

Mika Hämäläinen, Niko Partanen, Jack Rueter, and Khalid Alnajjar. 2021. Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 166–177, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.

Khalid Alnajjar, Mika Hämäläinen, and Jack Rueter. 2023. Bootstrapping Moksha-Erzya Neural Machine Translation from Rule-Based Apertium. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 213–218, Tokyo, Japan. Association for Computational Linguistics.