AI moves towards contextual knowledge through multimodal learning

February 17, 2022

#AI22 - No. 6 of 10

#AI22 is a series of articles highlighting what we believe to be 10 developments that will be impacting AI this year.
This series is co-written by Dr. Johannes Otterbach, Dr. Rasmus Rothe and Henry Schröder.

---

A key difference when processing data, between humans and AI models, is the comprehension of context. When a human processes a seemingly unrelated combination of a text (e.g. “I love your smell”) and an image (e.g. skunk), it recognizes the contextual combination of the two. Humans are very good at integrating numerous streams of input at once while filtering out what is unimportant. While AI tools are not fully able to make these multimodal inferences yet, there is significant convergence of model architectures currently developing as the separate research communities start to overlap.

In these multimodal structures, computer vision and NLP or audio and image models are trained together on datasets to understand and define a combined space of applications. These combinations of high-performing models in their own spaces, provide new problem-solving approaches. The key advantage of these approaches is the earlier sensor fusion. Earlier fusion models combine the various input data types before classifying the content, therefore enabling the system to better detect the contextual knowledge, in contrast to late fusion where the classification of data is provided for each data type individually and then fused afterward. These multimodal systems present two key advantages: firstly, the combination of numerous input data may detect complementary information that is not caught by individual systems. Secondly, the quality of the forecasts increases as the likelihood that numerous sensors observing the same data recognize differences is higher than with single modalities.

While there has been strong development in academia, through models such as CLIP, an OpenAI image recognition and description model which learns visual concepts, the application in the industry has been limited. However, the benefits and areas of application are undisputed: from detecting multimodal hate speech on social media, a model introduced by Facebook, to models for gas detection by fusing gas sensors and thermal images, to simple language translation using text and images.

Although, there has been significant progress in developing such models the adaptation into real-world applications has not yet been completed. However, the industry will experience an increased shift from academia to business and multimodality to become a more relevant and prevalent system.