One of the key aspects of intelligence is the ability to quickly learn how to perform new tasks when given simple instructions. For example, a child may be able to recognize real animals in a zoo by looking at some pictures of animals in a book, despite the differences between the two. However, for a typical visual model to learn a new task, it must learn tens of thousands of examples specifically labeled for that task. If your goal is to count and identify animals in images, such as “Three Zebras,” you would need to collect thousands of images and annotate each with quantity and species. This process is inefficient, expensive, and resource-intensive, requiring large amounts of annotated data and training new models each time a new task is encountered. As part of DeepMind’s mission to solve intelligence problems, we investigated whether alternative models could make this process easier and more efficient, given only limited task-specific information.
In the preprint of our paper today we flamingo, A single visual language model (VLM) that sets a new state-of-the-art in a few training runs on a wide range of open, multimodal tasks. This means that Flamingo can solve many difficult problems with just a few task-specific examples (“a few shots”) and without any additional training. Flamingo’s simple interface makes this possible by taking as input prompts consisting of interleaved images, video, and text, and then outputting the associated language.
Similar to the behavior of large language models (LLMs), which can handle linguistic tasks by processing working examples from text prompts, Flamingo’s visual and textual interface can adapt the model to solve multimodal tasks. Given several pairs of visual inputs and expected text responses configured in Flamingo’s prompts, the model can be asked a question with a new image or video and then generate an answer.
Across the 16 tasks we studied, Flamingo outperformed all previous prime number learning approaches when provided only four examples per task. In many cases, the same Flamingo model outperforms methods that are independently fine-tuned and optimized for each task and use significantly more task-specific data. This allows non-experts to quickly and easily use accurate visual language models for new tasks.
In fact, Flamingo adds new architectural components in between, fusing large-scale language models with powerful visual representations (each individually pre-trained and anchored). They are then trained on a mixture of complementary large-scale multimodal data available only on the web, rather than using annotated data for machine learning purposes. Following this method, we start from Chinchilla, a recently introduced compute-optimized 70B parameter language model, and train the final Flamingo model, an 80B parameter VLM. Once this training is complete, Flamingo can adapt directly to vision tasks in a few simple lessons without any additional task-specific adjustments.
We also tested the model’s qualitative ability to surpass current benchmarks. As part of this process, we ran the captions generated by our model through Google’s Perspective API, which compared the model’s performance in captioning images related to gender and skin color and evaluated the toxicity of the text. Although initial results are positive, more research to assess the ethical risks of multimodal systems is important and we urge these issues to be carefully evaluated and considered before deploying these systems in the real world.
Multimodal capabilities are essential for critical AI applications, such as helping visually impaired people with everyday visual challenges or improving the identification of hateful content on the web. Flamingo lets you quickly and efficiently adapt to these examples and other tasks without modifying your model. Interestingly, this model demonstrates out-of-the-box multimodal conversation capabilities, as seen here.
Flamingo is a suite of effective and efficient general-purpose models that can be applied to image and video understanding tasks using a minimal number of task-specific examples. Models like Flamingo hold great promise to benefit society in practical ways, and we are continually improving their flexibility and capabilities to ensure they can be safely deployed for the benefit of all. Flamingo’s capabilities enable richer interactions with trained visual language models, enabling better interpretability and exciting new applications, such as visual assistants that help people in their daily lives. We are very satisfied with the results so far.