
When you wake up in the morning, you reorient yourself into the world in a variety of ways. Before you open your eyes, you might hear the ambient sounds in your room (unless a not-so-ambient sound was what woke you up in the first place). You might feel cozy under the covers, or cold because you kicked them off while you were sleeping. And once you open your eyes, you get a visual sense of what’s going on in your room. These sense recognitions, along with the moods they evoke, create a nuanced perception of the morning and set you up for the rest of your day.
How do multimodal gen AI models work?
Get to know and directly engage with senior McKinsey experts on multimodal AI
Alex Singla is a senior partner in McKinsey’s Chicago office; Alexander Sukharevsky and David Champagne are senior partners in the London office; Hugues Lavandier is a senior partner in the Paris office; Ida Kristensen is a senior partner in the New York office; Lareina Yee is a senior partner in the Bay Area office; Lieven Van der Veken is a senior partner in the Lyon office; Steve Reis is a senior partner in the Atlanta office; and Ulrich Weihe is a senior partner in the Frankfurt office.
Multimodal gen AI models work in a similar way. They mirror the brain’s ability to combine sensory inputs for a nuanced, holistic understanding of the world, much like how humans use their variety of senses to perceive reality. These gen AI models’ ability to seamlessly perceive multiple inputs—and simultaneously generate output—allows them to interact with the world in innovative, transformative ways and represents a significant advancement in AI. By combining the strengths of different types of content (including text, images, audio, and video) from different sources, multimodal gen AI models can understand data in a more comprehensive way, which enables them to process more complex inquiries that result in fewer hallucinations (inaccurate or misleading outputs).
Today, enterprises that have deployed gen AI primarily use text-based large language models (LLMs). But a shift toward multimodal AI is underway, with the potential for a larger range of applications and more complex use cases. Multimodal gen AI models are well suited to the moment’s demands on business. As Internet of Things (IoT)–enabled devices collect more types and greater volumes of data than ever before, organizations can use multimodal AI models to process and integrate multisensory information, then deliver the increasingly personalized experiences that customers seek in retail, healthcare, and entertainment.
Multimodal gen AI models can also make technology more accessible to nontechnical users. Because the models can process multisensory inputs, users are able to interact with them by speaking, gesturing, or using an augmented reality or virtual reality controller. The ease of use also means that more people of varying abilities can reap the benefits that gen AI offers, such as increased productivity. And, finally, AI models in general are becoming less expensive and more powerful with each passing month. Not only is their performance improving, but the time it takes to generate results is decreasing—as is the number of unintended outputs or errors. What’s more, the cost of building these models is decreasing sharply. For example, researchers at Sony AI recently demonstrated that a model that cost $100,000 to train in 2022 can now be trained for less than $2,000.
The field of multimodal AI is evolving quickly, with new models and innovative use cases emerging almost every day, reshaping what’s possible with AI. In this Explainer, we’ll explore how multimodal gen AI models work, what they’re used for, and where the technology is headed next.
Learn more about Quantum Black, AI by McKinsey.
What four steps do multimodal AI models use to process information?
Multimodal AI models typically consist of multiple neural networks, each tailored to process—or “encode”—one specific format, such as text, images, audio, and video. The outputs are then combined through various fusion techniques, and in the final step, a classifier translates the fused outputs into a prediction or decision. Here is more about each step:
- Data input and preprocessing. Data from different formats is gathered and preprocessed. Types of preprocessing include tokenizing text, resizing images, and converting audio to spectrograms.
- Feature encoding. Encoder tools within individual neural networks transfer the data (such as a picture or a sentence) to machine-readable feature vectors or embeddings (typically represented by a series of numbers). Each modality is generally processed differently. For example, image pixels can be converted into feature vectors via CLIP (contrastive language–image pretraining), while text could be embedded using transformer architectures, such as those that power OpenAI’s GPT series.
- Fusion mechanisms. Encoded data from the different modalities is mapped into a shared space using various fusion mechanisms, which merge the embedded text from different modalities into a layer. The fusion step allows the model to dynamically focus on the parts of the data that are most relevant to the task. Fusion also enables the model to understand the relationships between the different modalities, which enables cross-modal understanding.
- Generative modeling. The generative step converts the data fused in the previous step into actionable outputs. For example, in image captioning, the model might generate a sentence that describes the image. Different models use different techniques; some adopt autoregressive methods to predict the next element in a sequence, while others utilize generative adversarial networks (GANs) or variational autoencoders (VAEs) to create outputs.
How do multimodal models compare with text-only models?
LLMs are efficient and cost-effective for text-based applications. By contrast, multimodal models—which are about twice as expensive per token as LLMs—enhance the capacity for more complex tasks by integrating multiple data types, such as text and images. What’s more, multimodal AI models are typically not significantly slower than text-only models.
How can organizations use multimodal gen AI models?
Organizations looking to implement multimodal gen AI can consider the following use cases:
- Accelerating creative processes in marketing and product design. Organizations can use multimodal AI models to design personalized marketing campaigns that seamlessly blend text, images, and video. On the product side, organizations can use multimodal AI to generate product prototypes.
- Reducing fraud in insurance claims. Multimodal models can reduce fraud in the insurance industry by cross-checking a diverse set of data sources, including customer statements, transaction logs, and claim supplements such as photos or videos. More efficient fraud detection can streamline the processing of claims for legitimate cases.
- Enhancing trend detection. By analyzing unstructured data from diverse sources, including social media posts, images, and videos, organizations can tailor their marketing strategies and products to resonate with local audiences.
- Transforming patient care. Multimodal AI can change patient care dramatically by enabling virtual assistants to communicate through text, speech, images, videos, and gestures, making interactions more intuitive, empathetic, and personalized.
- Providing real-time support in call centers and healthcare. Multimodal models can use low-latency voice processing to enable real-time assistance for patients through call centers and medical-assistance platforms. In call centers, these models can listen to customer interactions, transcribe their concerns, and provide instant recommendations to patients via agents. In medical settings, they can transcribe and analyze patient symptoms and then suggest next steps—all while maintaining seamless, natural conversations with the patients themselves. This capability enhances decision-making and patient satisfaction.
- Streamlining user interaction testing. Multimodal AI can revolutionize automated user interaction testing by simulating interactions across web browsers, applications, and games. By analyzing both code and visual data, this capability can autonomously verify accessibility standards, such as screen reader compatibility and color contrast, while also assessing the overall user experience.
By bringing together a diverse set of formats and data types, the information these models produce can empower leaders and their companies to stay competitive and innovative. The companies that invest early in these use cases may need to address some new technical risks but may also gain an advantage by being first movers.
Learn more about Quantum Black, AI by McKinsey.
How will organizations access and deploy multimodal AI?
The majority of organizations using multimodal AI are likely to be categorized as takers. This means they will deploy user-friendly applications that are built on pretrained models from third-party providers. Other organizations will want to customize out-of-the-box systems to improve performance in their specific use cases; these companies will be called shapers. Potential customizations include fine-tuning the model to reduce costs and improve performance on specific tasks, training the model on proprietary data, building scaffolding for continuous feedback and active learning, and adding guardrails to prevent unwanted responses and improve the model’s level of responsibility. A final category of companies will be makers, which tend to be technologically advanced organizations that train their models in-house. This training can cost up to millions of dollars and requires specialized technical expertise and access to sophisticated hardware.

Looking for direct answers to other complex questions?
For organizations that strive to be makers, a robust and user-friendly multimodal application requires several critical factors: an intuitive user interface, a powerful backend infrastructure (including a multimodal search pipeline that’s capable of understanding relationships across different data types), efficient strategies to deploy the model, and stringent data cleaning, security, and privacy protocols to protect user information.
Developing multimodal model architectures presents significant challenges, particularly when it comes to alignment and colearning. Alignment ensures that the modalities are properly synchronized with each other—more specifically, that audio output aligns with the corresponding video or that speech output aligns with the corresponding text. Colearning allows models to recognize and utilize correlations across modalities without succumbing to negative transfer (where a model’s learning from one modality actually hinders its comprehension of another).
What are real examples of organizations working with multimodal AI?
Life sciences companies are using multimodal AI to transform both drug discovery and clinical care delivery. Leading foundation models—a type of AI model trained on massive, general-purpose data sets—can accept a protein’s amino acid sequence (that is, the sequence of letters that represents the different molecules that comprise the protein) as an input. The scientists behind AlphaFold, an AI system developed by Google DeepMind, were honored by the Nobel committee in 2024 for constructing a model that can predict the 3D structure of a molecule in just a couple of minutes. In the past, this process would have taken several months and required expensive experimental methods, such as X-ray crystallography. Another example is ESM-3, which goes a step further than AlphaFold. It not only predicts the protein’s structure but also captures its functional and evolutionary information in a single, unified model. ESM-3 uses multimodal AI to learn simultaneously from sequences, structures, and biological annotations (similar to metadata), which enables the model to determine what a protein looks like, what it does, and how it evolved—all at once.
In clinical healthcare, single-modality foundation models have already outperformed clinical experts in certain tasks, such as mammography. The multimodal foundation models that are currently in development could simultaneously take into account an X-ray, mammogram, doctors’ notes, medical history, and genetic test results, generating a holistic picture of a patient’s risk of developing cancer rather than an isolated data point on their cancer risk matrix.
Learn more about Quantum Black, AI by McKinsey.
What risks are associated with multimodal AI?
Multimodal AI carries the same risks and limitations as other gen AI applications, including bias, data privacy, and exposure to expanding AI regulations. For the overall use of gen AI, McKinsey recommends that organizations create a plan to implement AI quickly and safely.
Hallucination is the most common risk associated with gen AI. The consequences of hallucinations may be more severe in multimodal AI than in their unimodal counterparts because an error in one modality could cause errors to cascade throughout the complex systems that generate the eventual output.
Other risks include the following:
- Data privacy and security become more complex with multimodal AI models that may handle multiple types of personal data. Cross-modality analysis also often involves highly sensitive information, such as physical movements or personal behaviors. This increases privacy concerns, particularly in surveillance contexts.
- Bias and fairness are significant concerns. Gen AI in general can inherit or amplify the biases present in training data of all types.
- The integration of diverse data sources poses technical challenges and requires the careful design of both system and model architectures, which ensures more accurate interpretation of the inputs.
- Regulatory compliance adds further complexity to the development and deployment of these systems, as AI and data usage laws are evolving across industries and regions. Images are a particularly sensitive format, due to copyright and intellectual property concerns.
How can organizations mitigate these risks?
To mitigate risk, leaders can consider the following strategies:
- Choose up-to-date models from trusted sources or platforms appropriate to the task.
- Keep a human in the loop for the model’s more sensitive tasks.
- Deploy gen AI in use cases where occasional inaccuracies are unlikely to cause harm or where the accuracy of the output can be easily verified, such as coding tasks within a controlled environment.
- Implement guardrails across the system to ensure safety for both the end user and the model.
Learn more about Quantum Black, AI by McKinsey. And check out multimodal AI-related job opportunities if you’re interested in working with McKinsey.
Articles referenced:
- “Scientific AI: Unlocking the next frontier of R&D productivity,” January 15, 2025, Alex Devereson, Chris Anagnostopoulos, David Champagne, Hugues Lavandier, Lieven Van der Veken, Thomas Devenyns, and Ulrich Weihe, with Alex Peluffo, Benji Lin, Jennifer Hou, and Maren Eckhoff
- “The state of AI in early 2024: Gen AI adoption spikes and starts to generate value,” May 30, 2024, Alex Singla, Alexander Sukharevsky, Lareina Yee, and Michael Chui, with Bryce Hall
- “Implementing generative AI with speed and safety,” March 13, 2024, Oliver Bevan, Michael Chui, Ida Kristensen, Brittany Presten, and Lareina Yee
- “AI-powered marketing and sales reach new heights with generative AI,” May 11, 2023, Richelle Deveau, Sonia Joseph Griffin, and Steve Reis
