We are used to dealing with multiple sensory inputs: the information from our senses of sight, hearing, taste, touch and smell. We combine this information to make decisions easily and unconsciously. By contrast, artificial intelligence (AI) chatbots based on large language models (LLMs) have been mostly restricted to a single mode of communication – text. Newer AI models have started to be able to process inputs that are not merely text. They can also read images, audio and video, and combine these different inputs. Amongst other things, this allows us to describe something to an AI model not just in text, but to enhance that with images. For example, an insurance claims processing model could check a customer’s email for a car insurance claim along with photos or videos of the vehicle damage being claimed.
A multimodal AI model will usually deploy multiple neural networks, each tailored to a specific format. So, there would be one model for text, another for images, another for audio etc. Each specialist model turns the input into a form that the AI model understands, a mathematical representation called vectors. For images, there are particular kinds of vector representations called feature vectors, which are characteristics of an image. These might be the average level of blueness in an image, or shapes described or edge directions or textures. For video, you would have vectors that describe the motion between frames or the sequence of frames. For sound, the vectors would represent acoustic elements or tempo. Once the various inputs have been collected and translated into vector representations, a separate part of the model combines the various inputs. This might involve, for example, synchronising lip movements on video with spoken words on an audio file, or combining images with descriptive text. Finally, these combined inputs are turned into an output, for example, generating text captions for video or answering questions in a way that combines text with images.
Multi-modal AI models have many potential applications. A medical AI model could communicate through images as well as text, and be able to have symptoms described by patients or doctors, not just as text, but also patient records and medical images such as X-rays. Another example of multimodal AI is in autonomous vehicles, which need to combine input from cameras, LIDAR and radar for navigation. Customer service chatbots would be able to communicate with customers who have issues with a purchased product, not just in a chat window, but could also see photos of the problem, and potentially react to verbal descriptions of the issue. Marketing analysis systems that produce sentiment analysis for a brand currently by analysing social media posts could be extended to analysing video discussions of a brand on videos uploaded to YouTube or similar platforms. Several popular AI models already have multi-modal capabilities, including Google’s Gemini 2.5 Flash, Anthropic’s Claude 3.7 and OpenAI’s GPT-4o.
This extra capability has a price: multimodal AI models are more expensive per token than text-based LLM. All the same limitations and issues that afflict text LLMs, such as hallucinations, bias, and security concerns, all apply to multimodal AI. Indeed, the extra level of complexity in multimodal AI means that these issues may be amplified. For example, an error in one form of input would cascade through to the combined output, potentially causing more severe hallucinations. Multimodal AI models have a larger attack surface for hackers, who can embed nefarious prompts in images, and the models qmay be more prone to generating harmful content. Work by security company Enkrypt AI in May 2025 found that multimodal models were forty times more likely to generate information about chemical weapons than traditional models, which have safeguards in place that supposedly prevent the models from providing sensitive information like that. This suggests that, at the very least, multimodal AI models need significantly more stringent guardrails and stress testing than traditional text-based LLMs.
Although it is still fairly early days for multimodal AI, it is clear that there are genuine opportunities for this newer kind of AI model, as outlined in this article. As in other areas of AI, the rapid pace of development means that capabilities are being produced so quickly that not all the issues and risks have been fully considered and worked through. In particular, the dramatically worse performance of multimodal AI compared to traditional text LLMs when it comes to having guardrails circumvented is a concern. These new abilities allow AI models to be more intuitive to communicate with than previously, but such additional power comes with additional responsibility: AI vendors need to double down on security and guardrail concerns when it comes to multimodal AI models.







