How to Incorporate Multimodal ML in NLP Methods?

How to Incorporate Multimodal ML in NLP Methods?

Multimodal ML (Machine Learning) is a challenging thing to work with and it is even more challenging when you are to apply it in the context on NLP (Natural Language Processing) research. In this post, we will take a look at three papers that utilize Multimodal ML to solve a variety of tasks related to figurative language: humor, metaphor and sarcasm detection.

Detecting Humor in Friends

In the distinguished COLING article When to Laugh and How Hard? A Multimodal Approach to Detecting Humor and its Intensity (Alnajjar et al., 2022), researchers introduce an innovative approach to humor detection using multimodal data from the TV show Friends. This study stands out for its creative data acquisition method, utilizing the show’s prerecorded laughter as a marker for humor presence and intensity. The researchers developed a pipeline comprising two neural models: one for detecting whether an utterance is humorous and another for assessing the humor’s intensity based on laughter duration.

Even an AI will learn to laugh at Friends

The data construction process is particularly notable for its automatic annotation of a multimodal humor corpus, leveraging the unique aspect of prerecorded laughter as an implicit annotation tool. This approach not only addresses the scarcity of annotated multimodal datasets for humor but also demonstrates a novel method for leveraging existing media content for computational humor research. The findings from this study could have significant implications for improving automatic humor detection systems and enhancing our understanding of humor’s multimodal nature.

Hunting Metaphors with Multimodal ML

In the quest to enhance natural language understanding (NLU), a groundbreaking study (Alnajjar et al., 2022b) has introduced the first openly available multimodal metaphor annotated corpus, setting a new precedent in computational linguistics. This corpus, distinct for its inclusion of videos with audio and subtitles annotated by experts, paves the way for advanced metaphor detection methodologies. The research team’s innovative approach leverages textual content within these videos to detect metaphors. Despite the text-based model outperforming those that incorporated additional modalities, the study illuminates the potential of video in disambiguating metaphors, even though the current model’s sensitivity to subtle visual cues remains a challenge.

Metaphors are not easy for a computer

This research underscores the complexity of metaphor detection and the necessity of integrating multimodal data to capture the nuanced ways humans communicate figuratively. By opening up this corpus to the wider research community, the study invites further exploration into multimodal NLP, promising new insights into the subtleties of human language and cognition. The findings not only contribute a valuable resource to the field but also invite a reevaluation of how we approach the interpretation of figurative language in computational models.

Solving Sarcasm using Multimodal ML

In the landscape of natural language processing (NLP), the detection of sarcasm presents a formidable challenge, primarily due to its nuanced nature and dependency on context, tone, and, often, visual cues. However, a pioneering study conducted by Khalid Alnajjar and Mika Hämäläinen (2021) from the University of Helsinki has marked a significant milestone in this area. Their work, titled Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline, not only introduces the first-ever multimodal sarcasm dataset for Spanish but also sets a new precedent for sarcasm detection research.

The core innovation of their approach lies in the integration of multiple modes of communication – text, audio, and video – to detect sarcasm. Traditional methods primarily focus on textual analysis, which, while effective to a certain extent, often miss the subtleties conveyed through tone of voice and visual expressions. The researchers’ multimodal dataset encompasses sarcasm-annotated text aligned with corresponding video and audio, offering a more holistic view of sarcasm in communication.

Sarcasm is an inherently multimodal phenomenon

This dataset is unique not just for its multimodal nature but also for its coverage of two varieties of Spanish: Latin American and Peninsular, thereby broadening its applicability and relevance. The study’s findings are illuminating: while text-only models achieved an 89% success rate in sarcasm detection, the inclusion of audio improved the accuracy to 91.9%. The most notable leap in performance, however, was observed when all three modalities were combined, culminating in a 93.1% accuracy rate.


Khalid Alnajjar, Mika Hämäläinen, Jörg Tiedemann, Jorma Laaksonen, and Mikko Kurimo. 2022a. When to Laugh and How Hard? A Multimodal Approach to Detecting Humor and Its Intensity. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6875–6886, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.

Khalid Alnajjar, Mika Hämäläinen, and Shuo Zhang. 2022b. Ring That Bell: A Corpus and Method for Multimodal Metaphor Detection in Videos. In Proceedings of the 3rd Workshop on Figurative Language Processing (FLP), pages 24–33, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.

Khalid Alnajjar and Mika Hämäläinen. 2021. ¡Qué maravilla! Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, pages 63–68, Mexico City, Mexico. Association for Computational Linguistics.