We review in this post the literature on Vision-Language Models for fine-grained images (documents). What are VLMs? Vision-Language Models , also known as "Multimodal Models" (with image and text as modalities), take an image as input (such as a natural image or a document page or multiple document pages for multi-page document models in our case) and a prompt as input (a question for QA, an instruction, or nothing for single-task models). These models process the inputs with a decoder (a language model) to return an output in natural language (the answer to a question for QA tasks, a response to an instruction for instruction-type inputs, a class for classification tasks, a JSON of entities extracted for entity extraction tasks, etc.). The architecture of VLMs includes an image encoder and a language model decoder , which takes both the image representation and text input (such as a question for QA or an instruction for instructional tasks). A projection layer (also called "Vision-Language Connector) is placed between the visual representation and the Language Model to convert the visual data into a format understandable by the LLM . This projection layer is trained using techniques like cross-attention, masked-language modeling, and image-text matching to link visual semantics with textual representations. This VLM structure is presented in this image from [Llava's paper](https://arxiv.org/pdf/2304.08485): ![](/literature_review/doc_ai/vlm/llava.png) or here:

with a concrete example in [Llava1.5's paper](https://arxiv.org/pdf/2310.03744): ![](/literature_review/doc_ai/vlm/llava1,5.png) Many VLMs have been released recently. Here is a timeline of the main released VLMs depicted in [LLaVA-Next's paper](https://arxiv.org/pdf/2406.16860):

What are Fine-Grained Images? Fine-grained images (in this context, documents) are images that contain text and numerous details (such as graphs, charts with text, etc.). In such images, each detail is crucial, and the images must be high-resolution since every detail (character, element in the image) matters, as opposed to natural images. An example of Document Understanding task on fine-grained documents is presented in the [Fox's paper](https://arxiv.org/pdf/2405.14295): ![](/literature_review/doc_ai/vlm/fine_grained_doc_ai_fox.png) Organisation of this blog post: We can classify Visual-Language Models handling fine-grained images into three categories: those using low-grained vision models with an LLM as a decoder (1.) those using both (a fine-grained vision model and an LLM as a decoder) through various techniques (2.) , and those using fine-grained vision models but a small language decoder (3.) . ##

1. Models using low-grained vision model and a LLM as decoder

These models are classic vision-language models (not specialized for fine-grained images). To be capable of understanding fine-grained images, some classic vision-language models were trained on text-rich images. These classic vision-language models work with a projection layer between the vision encoder and the LLM . Indeed, for the LLM to effectively understand and interact with the visual encoder's data, it is necessary that the representations generated by the encoder are in a format or context that is understandable to the LLM. This means that visual data must be transformed into a representation that makes sense in the linguistic domain. That's why an intermediary (a projection matrix) between the visual encoder and the LLM decoder is added. This projection layer helps to semantically align visual representations with textual representations. This means that similar visual concepts are mapped to close semantic spaces in the feature space of the LLM, thus facilitating the understanding and generation of language in relation to visual inputs. Can we take a pretrained vision encoder and LLM or should we fine-tune them to construct a VLM? And which Vision encoder / LLM to use? Traditional VLMs use a pretrained ViT as vision encoder (either CLIP-ViT-H/L ([InternLM-XComposer2-4KHD](https://arxiv.org/pdf/2404.06512), [VL-Mamba](https://arxiv.org/pdf/2403.13600), [Ferret-UI](https://arxiv.org/pdf/2404.05719), [Vary](https://arxiv.org/pdf/2312.06109), [Llava-HR](https://arxiv.org/pdf/2403.03003), [Llava-UHD](https://arxiv.org/pdf/2403.11703), [UReader](https://arxiv.org/pdf/2310.05126), [UniDoc](https://arxiv.org/pdf/2308.11592), [LLaVAR](https://arxiv.org/pdf/2306.17107), [mPLUG-Owl](https://arxiv.org/pdf/2304.14178), [Llava](https://arxiv.org/pdf/2304.08485), [LaRA](https://arxiv.org/pdf/2406.06730)), CLIP-ViT-BigG ([QwenVL](https://arxiv.org/pdf/2308.12966), [Monkey](https://arxiv.org/pdf/2311.06607)), EVA-CLIP ([BLIP2](https://arxiv.org/pdf/2301.12597), [MiniGPT4](https://arxiv.org/pdf/2304.10592), [CogAgent](https://arxiv.org/pdf/2312.08914)), SigLIP ([Tinychart](https://arxiv.org/pdf/2404.16635), [TextHawk](https://arxiv.org/pdf/2404.09204), [Idefics2](https://arxiv.org/pdf/2405.02246)), NFNet ([Flamingo](https://arxiv.org/pdf/2204.14198)), Swin Transformer ([DocPedia](https://arxiv.org/pdf/2311.11810), [DocParser](https://arxiv.org/pdf/2304.12484), [DONUT](https://arxiv.org/pdf/2111.15664), [Nougat](https://arxiv.org/pdf/2308.13418))...) Traditional VLMs use a pretrained LLM as decoder, usually with 7 or 13B parameters , like Llama1-2 ([UReader](https://arxiv.org/pdf/2310.05126), [mPLUG-DocOwl1.5](https://arxiv.org/pdf/2403.12895), [Llava-HR](https://arxiv.org/pdf/2403.03003), [mPLUG-PaperOwl](https://arxiv.org/pdf/2311.18248), [mPLUG-DocOwl](https://arxiv.org/pdf/2307.02499)), InternLM1-2 ([InternLM-XComposer2-4KHD](https://arxiv.org/pdf/2404.06512), [TextHawk](https://arxiv.org/pdf/2404.09204)), Vicuna ([Ferret-UI](https://arxiv.org/pdf/2404.05719), [Llava-UHD](https://arxiv.org/pdf/2403.11703), [DocPedia](https://arxiv.org/pdf/2311.11810), [LLaVAR](https://arxiv.org/pdf/2306.17107), [MiniGPT4](https://arxiv.org/pdf/2304.10592), [Llava](https://arxiv.org/pdf/2304.08485), [LaRA](https://arxiv.org/pdf/2406.06730)), Mistral ([Idefics2](https://arxiv.org/pdf/2405.02246)), Phi-2 ([Tinychart](https://arxiv.org/pdf/2404.16635)), OPT ([BLIP2](https://arxiv.org/pdf/2301.12597), [Vary](https://arxiv.org/pdf/2312.06109)), Qwen ([Monkey](https://arxiv.org/pdf/2311.06607), [TextMonkey](https://arxiv.org/pdf/2403.04473), [QwenVL](https://arxiv.org/pdf/2308.12966)) or FlanT5XXL ([BLIP2](https://arxiv.org/pdf/2301.12597)) (chat / instruct versions). The paper [What matters when building vision-language models?](https://arxiv.org/pdf/2405.02246) has shown that for a fixed number of parameters, the quality of the language model has a higher impact on the performance of the final VLM than the quality of the vision encoder . How the projection layer works? As shown in the paper [What matters when building vision-language models?](https://arxiv.org/pdf/2405.02246), there are 2 types of projection layer : (1) the cross-attention architectures , in which the encoding of the image is injected at different layers within the language model by interleaving cross-attention blocks in which the text cross-attends to the image hidden states, and (2) the fully autoregressive architectures in which the encoding of the image is directly concatenated to the sequence of text embeddings, and the entire sequence is passed as input to the language model. Models with a projection layer defined as cross-attention architecture The cross-attention architecture (1) of the projection layer is depicted in the Perceiver Resampler layer , implemented in [Flamingo](https://arxiv.org/pdf/2204.14198), [Kosmos 2.5](https://arxiv.org/pdf/2309.11419), and [Monkey](https://arxiv.org/pdf/2311.06607) as a "shared resampler", which uses cross-attention layer between text (query) and visual tokens (key and values) as depicted in the following shema from [Flamingo's paper](https://arxiv.org/pdf/2204.14198):

The Q-Former layer is also in this category, implemented in [BLIP-2](https://arxiv.org/pdf/2301.12597), [MiniGPT-4](https://arxiv.org/pdf/2304.10592) and [InstructDr](https://arxiv.org/pdf/2401.13313) as a "Document-Former". It extracts relevant features from the image representation with learnable queries, and adds the input text representation through attention mechanism. The result is then used as input to the LLM, as presented in the following shema from [BLIP-2's paper](https://arxiv.org/pdf/2301.12597) : ![Q-Former](/literature_review/doc_ai/vlm/qformer.png). The Partial LoRA module is also in this category, implemented in [InternLM-XComposer2](https://arxiv.org/pdf/2401.16420) and in [InternLM-XComposer2-4KHD](https://arxiv.org/pdf/2404.06512)'s papers. Models with a projection layer defined as fully autoregressive architecture The fully autoregressive architectures (2) is depicted in the linear projection layers (MLP) from [Llava](https://arxiv.org/pdf/2304.08485), [LLaVAR](https://arxiv.org/pdf/2306.17107), [UniDoc](https://arxiv.org/pdf/2308.11592), [DocPedia](https://arxiv.org/pdf/2311.11810), [Vary](https://arxiv.org/pdf/2312.06109), [Tinychart](https://arxiv.org/pdf/2404.16635), [InternLM-XComposer2-4KHD](https://arxiv.org/pdf/2404.06512) and [Idefics2](https://arxiv.org/pdf/2405.02246) which project linearly the visual representation to another space and then this transformed representation is concatenated with the textual input and fed into the language model, as well as in the Visual Abstractor Layer , implemented in [mPLUG-Owl](https://arxiv.org/pdf/2304.14178), [mPLUG-DocOwl](https://arxiv.org/pdf/2307.02499), [mPLUG-PaperOwl](https://arxiv.org/pdf/2311.18248) and [UReader](https://arxiv.org/pdf/2310.05126), which concatenates a selection of image patchs (done through the addition of learnable tokens interracting with visual patchs through cross-attention mechanism) to textual tokens, as well as in the H-Reducer layer , implemented in [mPLUG-DocOwl1.5](https://arxiv.org/pdf/2403.12895), which concatenated the reduced representation of the endoded image (convolution techniques) to the text representation, and the concatenated result is then fed to the LLM. The paper [What matters when building vision-language models?](https://arxiv.org/pdf/2405.02246) has shown that the cross-attention architecture (1) performs better than the fully autoregressive one (2) when pre-trained Vision and Language models are kept frozen . However, when fine-tuning them, the fully autoregressive architecture outperforms the cross-attention one, even though the latter has more parameters. Below are some examples of classic vision-language models:
Matryoshka Multimodal Models 2024-05-27 Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee
Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.
Show Paper Content
# Introduction Large Multimodal models (LMMs) [GPT4V_System_Card](https://cdn.openai.com/papers/GPTV_System_Card.pdf), [liu2023llava](http://arxiv.org/pdf/2402.11690v1), [zhu2023minigpt](http://arxiv.org/pdf/2402.17510v1), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1), [wang2023cogvlm](https://arxiv.org/pdf/2311.03079), [Qwen-VL](http://arxiv.org/pdf/2308.12966v3) have shown strong performance in visual-linguistic understanding and reasoning. Models such as LLaVA [liu2023llava](http://arxiv.org/pdf/2402.11690v1), [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) first embed the input image with a fixed number of visual tokens, and then feed them as prefix tokens to a Large Language Model (LLM) [Vicuna](https://vicuna.lmsys.org/), [llama-3](https://ai.meta.com/blog/meta-llama-3/) to reason about the input image. Similar model designs are borrowed in video LMMs [lin2023video](http://arxiv.org/pdf/2311.10122v2), [zhang2023video](http://arxiv.org/pdf/2311.12919v2), where each frame contributes a fixed number of tokens to form the final video representation. In reality, the number of visual tokens can be prohibitively large in the case of high-resolution images, and even more so for long videos. Existing works [lin2023video](http://arxiv.org/pdf/2311.10122v2), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [zhang2024llavanextvideo](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/), [geminiteam2024gemini](https://arxiv.org/pdf/2312.11805) mainly tackle this issue by increasing the input context length and consequently, feeding a large number -8k of visual tokens into the LLM. This approach has a couple of significant drawbacks: (1) the extremely long context makes both training and inference inefficient; (2) an excessive number of visual tokens can actually *harm* the LMM’s performance, distracting it from attending to the relevant information, as we show in Sec. [sec:exp:video understanding]. Several recent works [bolya2023tome](None), [chen2024image-fastv](http://arxiv.org/pdf/2403.06764v2), [shang2024LLaVA-PruMerge](http://arxiv.org/pdf/2403.15388v5) use heuristics to prune and merge visual tokens to reduce the sequence length. However, they produce a single-length output and *do not afford control over the final sequence length*, which could be useful to trade information density versus efficiency while accounting for resource constraints in the deployment phase.
. We enforce the coarser set of visual tokens XSi − 1 to be derived from the finer level of visual tokens XSi. As a result, the granularity of Matryoshka visual tokens gradually changes in a controllable manner. The image is from MSCOCO  validation set.
Images and videos naturally exhibit a hierarchical structure from coarse to fine details, and our human visual system has evolved to recognize visual information in this coarse to fine manner, as shown by biologists and psychologists decades ago [harris2000coarse](http://arxiv.org/pdf/2208.13560v1), [hegde2008time](http://arxiv.org/pdf/2108.02839v1). Can we create a similar structure for LMMs, where within one suite of model weights, the visual content tokens are organized into different scales of granularities? Conceptually, our goal is to learn the visual tokens to have a nested structure, similar to the Matryoshka Doll [kusupati2022matryoshka](http://arxiv.org/pdf/2405.17430v1). Matryoshka Representation Learning (MRL) [kusupati2022matryoshka](http://arxiv.org/pdf/2405.17430v1) builds the Matryoshka mechanism over a neural network’s representation vector, where each of the segments with various feature dimensions is capable of handling tasks like classification or retrieval. However, for LMMs, the inefficiency mainly comes from the number of tokens. Thus, inspired by, but different from MRL, our work is motivated to build upon the *token length dimension*, so that we can flexibly adjust it.
l0.5 image
Specifically, we propose *:* , which enforces an LMM to learn a hierarchy of visual representation granularities at the token sequence level, instead of the feature dimension level as in MRL [kusupati2022matryoshka](http://arxiv.org/pdf/2405.17430v1). With this representation, at inference time, the visual granularity can be *flexibly controlled* based on specific requirements, e.g., to account for the input image’s information density and efficiency constraints. Our training process is simple and straightforward. During training, we encode the image into $M$ sets of visual tokens from coarse to fine, $\mathbf{X} _{S_i}$, $i = 1, \cdots, M$, where the number of visual tokens gradually increases, $|\mathbf{X}_{S_{i-1}} | < |\mathbf{X}_{S_i}|$. And importantly, the visual tokens in a coarser level are derived from the visual tokens in a finer level, $\mathbf{X}_{S_{i-1}} \subset \mathbf{X}_{S_i}$, $\forall i$. In this way, the visual information in $[ {\mathbf{X}} _{S_1}, {\mathbf{X}} _{S_2}, \cdots, {\mathbf{X}} _{S_M}]$ gradually includes more fine-grained details. For example, given a natural image as shown in Figure 1, $\mathbf{X} _{S_1}$ includes high-level semantics such as the restaurant and girl, while $\mathbf{X} _{S_M}$ includes more details such as the Pepsi cup and white paper bag. All other training settings, such as the loss function and model architecture, are kept the same as LLaVA [liu2023llava](http://arxiv.org/pdf/2402.11690v1), [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/). Our approach, , introduces several novel properties and benefits for LMMs. First, our approach can adaptively and efficiently represent visual content. Under *one suite of weights*, it generates multiple nested sets of visual tokens with different granualarities in information density. This enables flexibility in the number of visual tokens used for any image during inference, enabling control over the best tradeoff between cost and performance based on the image or video content. For example, one can use all visual tokens for images with dense details and use just a few tokens for simpler images. This flexibility can be particularly significant when handling very long visual sequences, such as videos. For instance, given a fixed budget of 2880 visual tokens, a user could represent a video of 2880 frames each with one token or represent the same video by sampling 5 frames each with 576 tokens. Second, our approach can be used as a general framework to evaluate the visual complexity of vision-language datasets or benchmarks, which level of granularity is needed in order to perform the given task correctly. Surprisingly, we find that most benchmarks, especially those mainly crafted from natural scenes (such as COCO) [goyal2017vqav2](http://arxiv.org/pdf/1612.00837v3), [li2023pope](http://arxiv.org/pdf/2402.15721v1), [liu2023mmbench](http://arxiv.org/pdf/2005.12661v2), can be handled well with only $\sim9$ tokens per image. In contrast, dense visual perception tasks such as document understanding or OCR [singh2019textvqa](http://arxiv.org/pdf/1811.11903v1), [masry-etal-2022-chartqa](https://doi.org/10.18653/v1/2022.findings-acl.177) require a greater amount of tokens ($144-576$ tokens) per image to handle the task well. The detailed findings are presented in Sec. [sec:exp:Image Understanding]. Finally, our approach provides a foundation to tackle a critical task in LMMs: *How to use the least amount of visual tokens while answering the visual questions correctly?*. Based on the model’s predictions on the test set, we find that compared to full visual tokens, the oracle can use far fewer tokens while performing much better. For example, under six common LMM benchmarks used in LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/), the oracle with the trained  model can use as few as 8.9 visual tokens on average to achieve performance that is 8% points better than LLaVA-NeXT which uses 576 tokens per image grid. This indicates that there is a large room for improvement compared to the oracle upperbound, as we show in Sec. [sec:exp:Image Understanding]. To enable further research on adaptive LMMs that learn diverse information granularities, we publicly release our code and models. # Related Work **Large Multimodal Models.** Large Language Models (LLMs) like ChatGPT [chatgpt](https://openai.com/blog/chatgpt/), GPT-4 [gpt4](http://arxiv.org/pdf/2311.15732v2), and LLaMA [touvron2023LLaMA](touvron2023LLaMA) have demonstrated impressive reasoning and generalization capabilities for text. The landscape of LLMs has been significantly transformed by the recent introduction of models that also incorporate visual information, such as GPT-4V(ision)[GPT4V_System_Card](https://cdn.openai.com/papers/GPTV_System_Card.pdf). Building upon open-source LLMs [touvron2023LLaMA](touvron2023LLaMA), [Vicuna](https://vicuna.lmsys.org/), a plethora of multimodal models have made significant strides, spearheaded by models like LLaVA [liu2023llava](http://arxiv.org/pdf/2402.11690v1), [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1) and MiniGPT-4 [zhu2023minigpt](http://arxiv.org/pdf/2402.17510v1), which combine LLaMA’s [touvron2023LLaMA](touvron2023LLaMA) language capabilities with a CLIP [radford2021learning](http://arxiv.org/pdf/2404.19696v1) based image encoder. Recently, LMMs on more tasks and modalities have emerged, such as region level LMMs [cai2024vipllava](http://arxiv.org/pdf/2312.00784v2), [zhang2023gpt4roi](http://arxiv.org/pdf/2309.12109v1), [chen2023shikra](http://arxiv.org/pdf/2306.15195v2), [peng2023kosmos](http://arxiv.org/pdf/2305.16103v1), [zhang2023llavagrounding](https://arxiv.org/pdf/2312.02949), 3D LMMs [3dllm](http://arxiv.org/pdf/2403.09631v1), and video LMMs [lin2023video](http://arxiv.org/pdf/2311.10122v2), [zhang2023video](http://arxiv.org/pdf/2311.12919v2), [zhang2024llavanextvideo](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/). However, existing LMMs typically represent the visual content with a large and fixed number of tokens, which makes it challenging to scale to very long visual sequences such as high-resolution images or long-form videos. In this work, we propose to adaptively and efficiently represent the visual content by learning multiple nested sets of visual tokens, providing flexibility in the number of visual tokens used for any image during inference. **Matryoshka Representation Learning.** Matryoshka Representation Learning (MRL) [kusupati2022matryoshka](http://arxiv.org/pdf/2405.17430v1) addresses the need for flexible representations that can adapt to multiple downstream tasks with varying computational resources. This approach, inspired by the nested nature of Matryoshka dolls, encodes information at different granularities within the same high-dimensional feature vector produced by a neural network. The adaptability of MRL extends across different modalities, including vision (ResNet [he2016deep](http://arxiv.org/pdf/1608.05895v1), ViT [dosovitskiy2020vit](http://arxiv.org/pdf/2105.15075v2)), vision + language (ALIGN [jia2021scaling](http://arxiv.org/pdf/2102.05918v2)), and language (BERT [devlin2018bert](http://arxiv.org/pdf/1810.04805v2)), demonstrating its versatility and efficiency. Recent work [li20242d](http://arxiv.org/pdf/1804.10975v1) extends MRL to both the text embedding space and the Transformer layers space. Our approach is inspired by MRL, but instead of learning multiple nested embeddings for a high-dimensional feature vector, we learn *nested visual tokens along the token length dimension* for the visual input. We are the first to show that the idea of Matryosha learning can enable explicit control over the visual granularity of the visual content that an LMM processes. **Token Reduction.** One of the main causes of inefficiency in recent LMMs is their large number of prefix visual tokens that are fed into the LLM [liu2023llava](http://arxiv.org/pdf/2402.11690v1), [zhu2023minigpt](http://arxiv.org/pdf/2402.17510v1). The quadratic complexity in Transformers [vaswani2017attention](http://arxiv.org/pdf/2107.08000v1) is the key issue in scaling the input sequence length for Transformers. Token reduction serves as an effective technique to reduce computational costs in Transformers. Sparse attention methods such as Linformer [wang2020linformer](https://arxiv.org/pdf/2006.04768) and ReFormer [kitaev2020reformer](https://openreview.net/forum?id=rkgNKkHtvB) conduct attention operations within local windows rather than the full context, thereby reducing the quadratic complexity of the vanilla attention operation. Another notable method is Token Merging (ToMe) [bolya2023tome](None), which utilizes full attention but gradually reduces the number of tokens in each transformer block by selecting the most representative tokens through bipartite matching for the Vision Transformer (ViT). A recent work [Haurum_2023_ICCVW](http://arxiv.org/pdf/2308.04657v1) further studies different families of token reduction methods for ViT. However, prior approaches produce a single length output per input image and do not offer multiple granularities over the reduced token sequence. Our approach instead learns a multi-granularity, coarse-to-fine token representation within the same model architecture and weights, enabling it to easily be adjusted to various computational or memory constraints. # : [sec:approach]
Architecture of our proposed . The visual features from CLIP are represented as several groups of coarse-to-fine visual tokens. At test time, users can explicitly control the granularity of the visual features.
Our goal is to learn a Large Multimodal Model (LMM) that represents visual content as nested sets of visual tokens capturing information across multiple coarse-to-fine granularities, so that one can explicitly control the visual granularity per test instance during inference. Here we introduce how we learn a Matryoshka doll-like token sequence. LMMs such as LLaVA [liu2023llava](http://arxiv.org/pdf/2402.11690v1) typically input a sequence of visual tokens as prefix tokens to the LLM for visual-linguistic reasoning. The visual encoder from pretrained vision-language models, such as CLIP [radford2021learning](http://arxiv.org/pdf/2404.19696v1) and SigLIP [zhai2023sigmoid](http://arxiv.org/pdf/2303.15343v4), is typically utilized to project the images into the set of visual tokens. In particular, the CLIP visual encoder represents an input image $\mathbf{I}$ as an $H\times W$ grid of visual tokens ${\mathbf{X}} _{H\times W}$, where each $\mathbf{X}_i \in \mathbb{R}^{ C}$ is a $C$ dimensional feature vector. Our goal is to learn nested sets of visual tokens $[ {\mathbf{X}} _{S_1}, {\mathbf{X}} _{S_2}, \cdots, {\mathbf{X}} _{S_M}]$ which encode the visual information in a coarse-to-fine manner. To this end, we enforce ${\mathbf{X}} _{S_i} \subset {\mathbf{X}} _{S_{i+1}}, \forall i$. Importantly, we do not introduce any new learnable parameters to the LMM. We instead optimize the CLIP visual encoder to learn the nested visual representation directly, and train the ensuing LLM to adapt to the learned nested set of tokens. For ease of exposition, we consider CLIP-ViT-L-336 [radford2021learning](http://arxiv.org/pdf/2404.19696v1) as the visual encoder, where an image is encoded as $24\times24$ visual tokens (576 total). We create $M$ sets of tokens e.g., $|S_i| \in \{ 1, 9, 36, 144, 576 \}$, in which the visual tokens at the coarser level are derived directly from those at the finer level. Specifically, given the initial $24\times24$ visual tokens, We sequentially apply $2\times2$ pooling with a stride 2, resulting in $12\times12, 6\times6$, and $3\times3$ visual tokens. Finally, we apply $3\times3$ pooling and get the most condensed single visual token. In this way, the sets of Matryoshka visual tokens can gradually preserve the spatial information in the original tokens while simultaneously forming a coarse-to-fine nested representation. We train by averaging the autoregressive next token prediction loss for each scale $S_i$ for each image $\mathbf{I}_i$. Specifically, given a Matryoshka visual representation ${\mathbf{X}} _{S_i}$ for scale $S_i$, we maximize the likelihood of the predicted tokens matching the ground-truth answer $\mathbf{X}_{\mathrm{a}}$: $$P(\mathbf{X}_{\mathrm{a}} \mid {\mathbf{X}}_{S_i}, \mathbf{X}_{\text {q}})=\prod_{j=1}^L P_{\boldsymbol{\theta}}(x_j \mid {\mathbf{X}}_{S_i}, \mathbf{X}_{\text {q}}, \mathbf{X}_{\mathrm{a},1 shows our model architecture. The final objective averages over all $M$ visual token scales: $$\min_{\boldsymbol{\theta}} \frac{1}{M} \sum_{i=1}^M -\log P(\mathbf{X}_{\mathrm{a}} \mid {\mathbf{X}}_{S_i}, \mathbf{X}_{\text {q}}).$$ With this objective function, learns nested sets of visual tokens that gradually include more details with increasing scale. For example, in Figure [fig:detail-specturm-visualization], the smaller set of visual tokens describes the whole scene at a high level while the larger set of visual tokens includes more details such as the Pepsi cup. Our training objective affords our model to conduct visual question answering under any granularity during inference. This can be particularly useful in resource constrained applications; e.g., the visual granularity can be flexibly adjusted based on the anticipated simplicity or complexity of the visual content while taking into account compute and memory constraints. # Experiments In this section, we first detail the experiment settings in Sec 1.1. Then we show the performance of on both image-level benchmarks 1.2 and video-level benchmarks 1.3. Finally, we analyze the behavior of and provide ablations in Sec 1.4 and  1.5. ## Experiment Settings [sec:exp:setting] #### Model We use LLaVA-1.5 [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1) and LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) as the base LMMs, both with Vicuna 7B as the language model backbone. We finetune the whole model using the exact visual instruction data from LLaVA-1.5 and LLaVA-NeXT, respectively. The learning rate of LLM is $2\times10^{-5}$ and $1\times10^{-5}$, respectively for LLaVA-1.5 and LLaVA-NeXT. The learning rate for the visual encoder is $2\times10^{-5}$ for both models. We train both models for 1 epoch using 8 NVIDIA H100 GPUs. Instead of training the language model from scratch, we initialize the language model weights from pre-trained LLaVA-1.5 and LLaVA-NeXT, which we empirically works better. We name our LLaVA-1.5- and LLaVA-NeXT-. #### Visual Token Scales We design 5 scales for the visual tokens. LLaVA-1.5 [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1) and LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) both leverage CLIP-ViT-L-336 [radford2021learning](http://arxiv.org/pdf/2404.19696v1) as the visual encoder, where an image is embedded into $24\times24$ visual tokens. We gradually apply $2\times2$ pooling with stride 2, resulting in $12\times12, 6\times6$, and $3\times3$ visual tokens, where we finally apply a $3\times3$ pooling to get the final single visual token. Therefore, the size of Matryoshka visual token sets are $S \in \{ 1, 9, 36, 144, 576 \}$, following a nested manner. The efficiency anlaysis on the system level is shown in Appendix [sec: Efficiency Analysis], where boosts the speed of the LMM prefill process through diminished floating-point operations (FLOPs) and lessens computational memory requirements. #### Evaluations. For **image understanding**, we evaluate LLaVA-1.5 and LLaVA-NeXT on (a) diverse multimodal benchmarks: POPE [li2023pope](http://arxiv.org/pdf/2402.15721v1), GQA [hudson2019gqa](http://arxiv.org/pdf/2112.05136v1), MMBench [liu2023mmbench](http://arxiv.org/pdf/2005.12661v2), VizWiz [gurari2018vizwiz](http://arxiv.org/pdf/1802.08218v4), SEEDBench [li2023seed](http://arxiv.org/pdf/2311.15759v1), ScienceQA [lu2022learnscienceqa](http://arxiv.org/pdf/2209.09513v2), MMMU [yue2023mmmu](http://arxiv.org/pdf/2311.16502v3), and (b) document understanding/Optical character recognition (OCR) benchmarks: DocVQA [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), ChartQA [masry-etal-2022-chartqa](https://doi.org/10.18653/v1/2022.findings-acl.177), AI2D [ai2d](http://arxiv.org/pdf/1603.07396v1) and TextVQA [singh2019textvqa](http://arxiv.org/pdf/1811.11903v1). For **video understanding**, we use both (a) open ended video question answering benchmarks evaluated by GPT-3.5: MSVD-QA [xu2017video](http://arxiv.org/pdf/1904.04357v1), MSRVTT-QA [xu2017video](http://arxiv.org/pdf/1904.04357v1) and ActivityNet-QA [yu2019activitynet](http://arxiv.org/pdf/1906.02467v1); and (b) multi-choice video question answering benchmarks: NExT-QA [xiao2021next](http://arxiv.org/pdf/2307.04412v1), IntentQA [Li2023IntentQACV](http://arxiv.org/pdf/2002.08945v1), and EgoSchema [mangalam2023egoschema](http://arxiv.org/pdf/2308.09126v1). ## Image Understanding [sec:exp:Image Understanding] #### LLaVA-1.5- We evaluate LLaVA-1.5- on the common multimodal understanding and reasoning benchmarks. Results are shown in Table 1. LLaVA-1.5- with full tokens maintains the performance of LLaVA-1.5 across diverse benchmarks. More importantly, our approach shows strong performance even with 1 or 9 tokens. Specifically, in MMBench, a comprehensive multimodal understanding benchmark, LLaVA-1.5- with 9 tokens surpasses Qwen-VL-Chat with 256 tokens, and achieves similar performance as Qwen-VL-Chat with even 1 token. Compared with InstructBLIP [instructblip](http://arxiv.org/pdf/2311.00233v2), LLaVA-1.5 with 9 tokens surpasses InstructBLIP-7B and InstructBLIP-13B across all benchmarks. This demonstrates that our model has both flexibility and strong empirical performance under diverse number of visual tokens.
max width=0.95
| | \# Tokens | MMBench | GQA | POPE | VizWiz | SEEDBench | |:---|:--:|:--:|:--:|:--:|:--:|:--:| |  [Qwen-VL](http://arxiv.org/pdf/2308.12966v3) | 256 | 38.2 | 59.3 | \- | 35.2 | 56.3 | |  [Qwen-VL](http://arxiv.org/pdf/2308.12966v3) | 256 | 60.6 | 57.5 | \- | 38.9 | 58.2 | |  [instructblip](http://arxiv.org/pdf/2311.00233v2) | 32 | 36.0 | 49.2 | \- | 34.5 | 53.4 | |  [instructblip](http://arxiv.org/pdf/2311.00233v2) | 32 | \- | 49.5 | 78.9 | 33.4 | \- | | LLaVA-1.5-7B [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1) | 576 | 64.8 | **62.0** | 85.9 | 54.4 | 60.5 | | | 576 | 65.9 | 61.9 | **87.4** | **54.9** | **60.6** | | | 144 | **66.4** | 61.3 | 87.0 | 53.1 | 59.7 | | | 36 | 64.8 | 60.3 | 85.5 | 52.8 | 58.0 | | | 9 | 63.1 | 58.0 | 83.4 | 51.9 | 55.4 | | | 1 | 59.5 | 52.6 | 78.4 | 49.4 | 50.1 | Comparison between LLaVA-1.5-$M^3$ across various benchmarks under video understanding benchmarks. LLaVA-1.5- maintains the performance of LLaVA-1.5 while outperforming Qwen-VL and InstructBLIP with fewer tokens.
#### LLaVA-NeXT- We use the proposed to finetune LLaVA-NeXT, and compare LLaVA-NeXT- with , which denotes the setting where the LLaVA-NeXT is trained under a **S**pecific **S**cale of visual tokens also for 1 epoch. We also include the oracle upperbound performance. Specifically, ‘Oracle’ denotes the case where the best tradeoff between visual tokens and performance is picked for each test instance. Specifically, for each test instance, we select the the scale with the fewest amount of tokens but can answer the question correctly. Results are shown in Table 2. Our approach, , is at least as good as , while performing better on tasks such as document understanding (TextVQA and ChartQA) and common benchmarks such as MMBench [liu2023mmbench](http://arxiv.org/pdf/2005.12661v2).
max width=
| \# Tokens Per Grid | Approach | TextVQA | AI2D | ChartQA | DocVQA | MMBench | POPE | ScienceQA | MMMU | |:---|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | | | 64.53 | 64.83 | 59.28 | 75.40 | 66.58 | 87.02 | 72.29 | 34.3 | | | $M^3$ | 63.13 | 66.71 | 58.96 | 72.61 | 67.96 | 87.20 | 72.46 | 34.0 | | | | 62.16 | 65.77 | 55.28 | 67.69 | 67.78 | 87.66 | 72.15 | 36.4 | | | $M^3$ | 62.61 | 68.07 | 57.04 | 66.48 | 69.50 | 87.67 | 72.32 | 36.1 | | | | 58.15 | 65.90 | 45.40 | 56.89 | 67.01 | 86.75 | 71.87 | 36.2 | | | $M^3$ | 58.71 | 67.36 | 50.24 | 55.94 | 68.56 | 87.29 | 72.11 | 36.8 | | | | 50.95 | 65.06 | 37.76 | 44.21 | 65.29 | 85.62 | 72.37 | 36.8 | | | $M^3$ | 51.97 | 66.77 | 42.00 | 43.52 | 67.35 | 86.17 | 71.85 | 35.2 | | | | 38.39 | 63.76 | 28.96 | 33.11 | 61.43 | 82.83 | 72.32 | 35.3 | | | $M^3$ | 38.92 | 64.57 | 31.04 | 31.63 | 62.97 | 83.38 | 71.19 | 34.8 | | | \# Tokens | 31.39 | 11.54 | 41.78 | 64.09 | 8.90 | 6.08 | 7.43 | 22.85 | | | Performance | 70.51 | 76.36 | 70.76 | 81.73 | 74.35 | 94.29 | 76.07 | 50.44 | Comparison of approaches with the baseline and $M^3$ across various benchmarks under LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/). Here \# Tokens denotes the number of visual tokens per image grid in LLaVA-NeXT. denotes the baseline model trained with a **S**pecific **S**cale of visual tokens. is at least as good as , while performing better on tasks such as TextVQA, ChartQA, and MMBench. Oracle denotes the case where the best tradeoff between visual tokens and performance is picked.
Our results also show that dataset level biases towards the visual token scales do exist. For example, ScienceQA maintains consistent performance across all visual token scales. AI2D and MMBench only encounter a small performance drop for even as few as 9 to 1 tokens. On the other hand, dense visual perception tasks such as TextVQA and DocVQA show a significant performance with fewer tokens. This analysis shows that could serve as a framework to analyze the granularity that a benchmark needs. Furthermore, there is a large gap between the model’s actual performance under full tokens and the upper-bound oracle. This indicates that using full tokens cannot always result in the optimal performance for all samples; i.e., there is a large room of improvement towards the oracle point. ## Video Understanding [sec:exp:video understanding] Following IG-VLM [kim2024image](http://arxiv.org/pdf/2403.18406v1), we directly conduct zero-shot inference on diverse video benchmarks using LLaVA-NeXT-. Specifically, 6 frames are uniformly sampled over the entire video, then arranged as a collage, which is fed into LLaVA-NeXT along with the question to get the response. Results under LLaVA-NeXT- and recent video LMMs are show in Table 3. LLaVA-NeXT- with full visual tokens again shows comparable performance with LLaVA-NeXT. More interestingly, results indicate that full visual tokens usually *do not lead to the best performance* in video understanding tasks. Specifically, on 4 out of 6 benchmarks, full visual tokens show less desirable performance compared to 720 or 180 visual tokens. We suspect that very long visual context could bring distraction (e.g., too much focus on potentially irrelevant background) to the model’s prediction, where a compact representation of the video focusing on the more relevant information may be more advantageous. Finally, for most video understanding tasks such as ActivityNet, IntentQA and EgoSchema, with 9 tokens per image grid (45 tokens in total), the accuracy difference compared to full tokens (2880 in total) is less than 1%. This demonstrates that the video questions in these benchmarks usually require very sparse visual information, as the source of such video understanding benchmarks mostly comes from natural scenes, which matches our observation in image understanding benchmarks.
max width=
| | \# Tokens | MSVD | MSRVTT | ActivityNet | NextQA | IntentQA | EgoSchema | |:---|:--:|:--:|:--:|:--:|:--:|:--:|:--:| |  [zhang2023VideoLLAMA](None) | \- | 51.6 | 29.6 | 12.4 | \- | \- | \- | |  [Zhang2023LLaMAAdapterEF](http://arxiv.org/pdf/2207.10858v1) | \- | 54.9 | 43.8 | 34.2 | \- | \- | \- | |  [Maaz2023VideoChatGPTTD](http://arxiv.org/pdf/2311.18445v1) | \- | 64.9 | 49.3 | 35.2 | \- | \- | \- | |  [Lin2023VideoLLaVALU](http://arxiv.org/pdf/2311.10122v2) | 2048 | 70.7 | 59.2 | 45.3 | \- | \- | \- | | InternVideo [Wang2022InternVideoGV](http://arxiv.org/pdf/2212.03191v2) | \- | \- | \- | \- | 59.1 | \- | 32.1 | | LLaVA-NeXT-7B [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) | 2880 | 78.8 | 63.7 | 54.3 | **63.1** | **60.3** | 35.8 | | | 2880 | 78.2 | **64.5** | 53.9 | **63.1** | 58.8 | 36.8 | | | 720 | **79.0** | **64.5** | **55.0** | 62.6 | 59.6 | 37.2 | | | 180 | 77.9 | 63.7 | **55.0** | 61.4 | 59.3 | 37.6 | | | 45 | 75.8 | 63.0 | 53.2 | 59.5 | 58.7 | **38.8** | | | 5 | 73.5 | 62.7 | 50.8 | 56.5 | 56.7 | 36.2 | Overall accuracy of LLaVA-NeXT- and recent video LMMs on various video understanding benchmarks. Here \# Tokens denotes the overall number of visual tokens across all frames.
## In-depth Analysis [sec:exp:analysis] #### shows much stronger performance compared to heuristics based sampling at test time. A simple way to reduce the number of visual tokens via a training-free way is to conduct heuristic token merging or reduction. In Table 4, we compare with three training-free approaches: average pooling, spatial sampling, and sequential sampling. is much more resilient when the number of tokens decreases, while the heuristic based sampling approaches show dramatic performance drop. A visualization of the spatial and sequential sampling is shown in Figure [fig:vis sampling inference].
max width=0.9
| \# Tokens | | Average Pooling | Spatial Sampling | Sequential Sampling | |:----------|:----|:---------------:|:----------------:|:-------------------:| | | | 67.18 | 67.18 | 67.18 | | | | 61.68 | 65.81 | 60.14 | | | | 50.77 | 60.05 | 44.76 | | | | 45.45 | 45.45 | 31.96 | | | | 19.33 | 26.29 | 22.42 | Comparison between , and heuristics based sampling baselines—average pooling, spatial sampling, and sequential sampling—at inference time on MMBench with the LLaVA-NeXT architecture.
#### serves as a good metric for image complexity. We extract the response from LLaVA-NeXT- in the TextVQA benchmark, and show the samples where using visual tokens across different scales can answer the question correctly and incorrectly. Shown in Figure 1, the OCR performance aligns with the complexity of the images, which indicates that can be utilized as a metric towards sample level complexity.
TextVQA test samples with correct and incorrect predictions upon different scales. Answers vary with different number of visual tokens. In addition, can serve as a framework to evaluate the complexity of images.
#### Large gap between oracle and actual performance. As shown in Table 2, the oracle upper-bound can use very few ($6\sim64$) tokens yet achieve at least 10% better performance compared to full visual tokens. This suggests that a visual token scale predictor, where the model learns to automatically select the best visual token scale given the input images or both input images and questions, has potential to achieve a better tradeoff. This would be interesting future work. #### Zero-shot generalization to longer visual sequences. Here we extend the length of the visual tokens at inference time to study the model’s zero-shot generalization behavior. Results under LLaVA-NeXT are shown in Table 5. Here LLaVA-NeXT- is trained on $2\times2$ image grids but evaluated on $3\times3$ grids. We set the number of visual tokens to be 144 in each image during evaluation. The model obtains a significant improvement in document understanding by 2.12, 1.80, and 4.11 on TextVQA, ChartQA, and DocVQA, respectively, while maintaining the same performance on benchmarks mainly composed of natural scene images. $3\times3$ image grids with 144 tokens per grid own 1440 tokens, yet achieve similar performance with the default LLaVA-NeXT $2\times2$ image grids with 2880 total tokens (576 tokens per grid). This indicates it is promising to feed more subimages while making the number of visual tokens within each subimage much smaller.
max width=
| \# Grids | \# Tokens per grid | Overall \# Tokens | TextVQA | AI2D | ChartQA | DocVQA | MMBench | POPE | ScienceQA | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | $2\times2$ | 144 | 720 | 62.61 | 68.07 | 57.04 | 66.48 | 69.50 | 87.67 | 72.32 | | $3\times3$ | 144 | 1440 | 64.73 | 67.75 | 58.84 | 70.59 | 69.50 | 87.67 | 72.22 | | $2\times2$ | 576 | 2880 | 63.13 | 66.71 | 58.96 | 72.61 | 67.96 | 87.20 | 72.46 | Performance comparison of different image grid configurations with LLaVA-NeXT-.
l0.5
## Ablation Studies [sec:exp:ablation] We ablate the key designs in , including the sampling method of Matryoshka visual tokens, and training strategy. #### Matryoshka visual token sampling. Here we compare three different ways to select the visual tokens for , including average pooling, spatial sampling, and sequential sampling, which is illustrated in Figure [fig:vis sampling inference]. Shown in Table 6, averaging pooling shows better performance than the two alternatives across diverse benchmarks. In general, sequential sampling performs the worst. We hypothesize that this is due to the visual tokens having spatial information, while sequential sampling does not naturally align with the spatial distribution of the visual tokens.
max width=
| | TextVQA | | | MMBench | | | AI2D | | | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 2-10 Num of Vis Tokens | Avg Pooling | Sequential | Spatial | Avg Pooling | Sequential | Spatial | Avg Pooling | Sequential | Spatial | | 576 | 63.13 | 59.37 | 60.45 | 67.96 | 64.60 | 64.43 | 66.71 | 65.61 | 64.96 | | 144 | 62.61 | 55.80 | 58.33 | 69.50 | 64.18 | 64.52 | 68.07 | 64.90 | 64.96 | | 36 | 58.71 | 52.79 | 52.39 | 68.56 | 63.92 | 64.69 | 67.36 | 64.51 | 64.02 | | 9 | 51.97 | 44.05 | 44.19 | 67.35 | 63.14 | 62.11 | 66.77 | 63.70 | 63.92 | | 1 | 38.92 | 28.03 | 29.91 | 62.97 | 59.36 | 57.47 | 64.57 | 63.21 | 63.08 | Ablation on Matryoshka visual token sampling including average pooling, sequential sampling, and spatial sampling.
max width=
| Num of Vis Tokens | TextVQA | | MMBench | | AI2D | | DocVQA | | |:-----------------:|:-------:|:-------:|:-------:|:-------:|:------:|:-------:|:------:|:-------:| | 2-9 | w/ LLM | w/o LLM | w/ LLM | w/o LLM | w/ LLM | w/o LLM | w/ LLM | w/o LLM | | 576 | 63.13 | 61.16 | 67.96 | 63.66 | 66.71 | 63.92 | 72.61 | 69.15 | | 144 | 62.61 | 57.79 | 69.50 | 65.21 | 68.07 | 63.73 | 66.48 | 59.77 | | 36 | 58.71 | 49.75 | 68.56 | 63.92 | 67.36 | 62.89 | 55.94 | 44.08 | | 9 | 51.97 | 36.15 | 67.35 | 61.08 | 66.77 | 62.05 | 43.52 | 28.36 | | 1 | 38.92 | 19.72 | 62.97 | 51.80 | 64.57 | 60.59 | 31.63 | 17.37 | Performance comparison of training LLaVA-NeXT- with and without training the LLM across diverse benchmarks. We see a clear drop when freezing the LLM.
max width=0.95
| Technique | TextVQA | | | | AI2D | | | | |:------------------------------:|:-------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| | Init LLM weights from LLaVA | | | | | | | | | | Average losses over all scales | | | | | | | | | | 576 | 60.36 | 62.25 | 61.01 | 63.13 | 62.40 | 65.06 | 65.84 | 66.71 | | 144 | 59.61 | 61.02 | 59.80 | 62.61 | 63.67 | 65.61 | 65.77 | 68.07 | | 36 | 54.86 | 55.91 | 55.32 | 58.71 | 63.67 | 65.32 | 66.68 | 67.36 | | 9 | 46.84 | 47.04 | 48.80 | 51.97 | 63.02 | 64.83 | 65.38 | 66.77 | | 1 | 33.78 | 33.68 | 36.05 | 38.92 | 61.53 | 63.21 | 63.37 | 64.57 | Impact of (a) initializing the LLM weights from LLaVA, and (b) averaging the loss from all scales vs randomly selecting a scale for each sample during training.
#### Training the entire LMM vs only training CLIP. Since the nested behavior of Matryoshka visual tokens is learned within the CLIP visual encoder, we next evaluate whether it is necessary to also finetune the LLM. Shown in Table 7, training the whole LLM achieves better performance. This demonstrates that by also training the LLM, the model can better adapt to the patterns of the visual tokens distributed in the Matryoshka manner. As explained in Sec. [sec:approach] and 1.1, we (a) initialize the LLM weights from LLaVA and (b) minimize the loss averaged upon all visual token scales for each sample during training. An alternative choice is to randomly sample a visual token scale. Shown in Table 8, initializing the LLM weights from LLaVA and minimizing the losses over all scales shows consistent performance boost compared to using the vanilla text-only pre-trained LLM weights [Vicuna](https://vicuna.lmsys.org/) and randomly selecting a visual token scale. Initializing the LLM weights from LLaVA makes the training process of more stable. By learning all scales at once, the model is forced to learn the nested behavior for each sample, which leads to better performance. # Conclusion and Future Work [sec:conclusion and limitation] We introduced : , which learns to represent visual content as nested sets of visual tokens, capturing information across multiple coarse-to-fine granularities. LMMs equipped with afford explicit control over the visual granularity per test instance during inference. We also showed that can serve as an analysis framework to investigate the visual granularity needed for existing datasets, where we discovered that a large number of multimodal benchmarks only need as few as  9 visual tokens to obtain accuracy similar to that of using all visual tokens, especially for video understanding. Furthermore, we disclosed a large performance-efficiency gap between the oracle upper-bound and the model’s performance. Our work can be naturally extended to other domains. For example, the long context in a text-only LLM or vision tokens in dense vision tasks can also be represented as nested sets of tokens in a Matryoshka manner. One limitation of our current approach is that we are lacking an effective visual token predictor that can bridge the gap between the oracle and LMM’s actual performance at a specific scale. We believe this would be an exciting next direction of research in this space. # Acknowledgement [acknowledgement] This work was supported in part by NSF CAREER IIS2150012, and Institute of Information & communications Technology Planning & Evaluation(IITP) grants funded by the Korea government(MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) and (No. RS2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training), and Microsoft Accelerate Foundation Models Research Program. # Broader Impact [sec:boarder_impact] The broader impact of , a framework with nested visual representations, has potential benefits and risks associated with its deployment and release. Our model is trained using the exact same architecture and data of LLaVA-1.5 [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1) and LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/). All the concerns are same as LLaVA. Specifically, as one example, LLaVA conducts instruction tuning using GPT-4 and GPT-4V generated data. The bias from GPT-4 and GPT-4V would still exist in LLaVA. # Efficiency Analysis [sec: Efficiency Analysis] To illuminate the computational benefits conferred by , we employ the roofline-based LLM-Viewer analysis as detailed in [yuan2024llm](http://arxiv.org/pdf/2402.16363v6). Our analysis is set within a hypothetical context designed to emphasize the effects of on processing efficiency in LMMs. We study the LLaVA-1.5 case where a $336 \times 336$ resolution image is processed using a CLIP-ViT image encoder, resulting in 576 visual tokens. Accompanied by a text prompt with an assumed number of 30 tokens, the nested visual tokens in substantially lowers the visual token count. The consequences of this reduction are substantial as outlined in Table 1, detailing the computational costs involved in the LMM prefill process. Notably, not only boosts the speed of the LMM prefill process through diminished floating-point operations (FLOPs) but also lessens computational memory requirements. It is crucial to highlight that the advantages of are not limited to just efficiency improvements. The token reduction approach of can also enhance other LMM acceleration methods, such as quantization and factorization, as referenced in [yuan2023asvd](http://arxiv.org/pdf/2403.07378v4). This complementary relationship accentuates the broad potential of to contribute to a wider array of efficiency-boosting strategies.
| \# Tokens | FLOPs (TB) | Prefill Time (ms) | Total Memory (GB) | Storing Activation (GB) | |:--:|:--:|:--:|:--:|:--:| | 576 | 8.0 | 58.1 | 21.6 | 3.8 | | 144 | 2.2 | 19.5 | 15.0 | 0.7 | | 36 | 0.9 | 18.0 | 13.8 | 0.3 | | 9 | 0.5 | 17.7 | 13.6 | 0.2 | | 1 | 0.4 | 17.6 | 13.5 | 0.1 | Computation Cost Analysis. The development device is Tesla V100 GPU, and time estimated by the roofline model represents the theoretical performance that the hardware can achieve.
# More Visualizations on Nested Visual Representation Shown in Figure 1, with more visual tokens, LMMs can discover more details, such as furniture and human attributes. Besides, LMMs can generate higher quality descriptions with more visual tokens, as demonstrated by the OCR capability in Figure 1 (b).
More visualization examples. With more visual tokens, LMMs can discover more details, and generate higher quality descriptions. The images are from MSCOCO  validation set.
What matters when building vision-language models? 2024-05-03 Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh
The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.
Show Paper Content
Idefics2-chatty analyzes the table to compute and answer the query.
# Introduction Vision-language models (VLMs) that take images and texts as inputs and output texts, are useful for many tasks, like retrieving information in a scanned PDF [mPLUG-DocOwl-1.5](https://arxiv.org/pdf/2403.12895), explaining charts or diagrams [Chart-PaLI](https://arxiv.org/pdf/2403.12596), transcribing the text in an image [Nougat](https://arxiv.org/pdf/2308.13418), counting objects in a picture [VQAv2](https://doi.org/10.1109/CVPR.2017.670) or turning screenshots of webpages into code [WebSight](https://arxiv.org/pdf/2403.09029). The development of powerful open large language models [Llama2](https://arxiv.org/pdf/2307.09288), [Mistral7B](https://arxiv.org/pdf/2310.06825), [Gemma](https://arxiv.org/pdf/2403.08295) and image encoders [SigLIP](https://arxiv.org/pdf/2303.15343), [EVA-CLIP](https://arxiv.org/pdf/2303.15389), [CLIP](http://arxiv.org/pdf/2404.19696v1) enables researchers to build upon these unimodal pre-trained models to create advanced VLMs that solve these problems with increasing accuracy [InstructBLIP](https://openreview.net/forum?id=vvoWPYqZJA), [LLaVA](https://openreview.net/forum?id=w0H2xGHlkw), [Qwen-VL](https://arxiv.org/pdf/2308.12966), [VILA](https://arxiv.org/pdf/2312.07533), [SPHINX](https://arxiv.org/pdf/2311.07575), [Monkey](https://arxiv.org/pdf/2311.06607), [CogVLM](https://arxiv.org/pdf/2311.03079). Despite the progress in the field, the literature reveals many disparate design choices which are often not justified experimentally, or very briefly. This situation makes it challenging to distinguish which decisions truly account for model performance, thereby making it difficult for the community to make meaningful and grounded progress. For instance, [Flamingo](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf), [OBELICS](https://openreview.net/forum?id=SKN2hflBIZ) use interleaved Transformer-based cross-attentions to fuse the image information into the language model, while [BLIP-2](http://arxiv.org/pdf/2301.12597v3), [LLaVA](https://openreview.net/forum?id=w0H2xGHlkw) concatenate the sequence of image hidden states with the sequence of text embeddings, and feed the concatenated sequence to the language model. To our knowledge, this choice has not been properly ablated, and trade-offs in terms of compute, data efficiency and performance are poorly understood. In this work, we aim to bring experimental clarity to some of these core design choices and pose the question: **What matters when building vision-language models?** We identify two areas where various works adopt different design choices: (a) model architecture, and in particular, connector modules that fuse the vision and text modalities and their impact on inference efficiency, (b) multimodal training procedure and its impact on training stability. For each of these areas, we rigorously compare different design choices in a controlled environment and extract experimental findings. Notably, we find that (a) the progress of vision-language models is in large part driven by the progress of pre-trained unimodal backbones, (b) the more recent fully autoregressive architecture outperforms the cross-attention architecture, although it requires modifications to the optimization procedure to ensure a stable training, (c) adaptation of the pre-trained vision backbone and the modules connecting the text and vision modalities allow for more efficiency at inference time on one side, and handling images in their original ratio and size without harming downstream performance on the other side, and (d) modifications to the image processing enables trading inference cost for downstream performance. Our results are complementary with those presented in [prismatic](https://arxiv.org/pdf/2402.07865), [MM1](https://arxiv.org/pdf/2403.09611), [VILA](https://arxiv.org/pdf/2312.07533) which derive insights about multi-stage training, selective unfreezing of the pre-trained backbones, data repetition, and impact of training mixture on zero and few-shot performance. We specifically delve into unexplored aspects such as model architecture, training methods, stability, and efficiency improvements at inference. Learning from these insights, we train Idefics2, a foundational VLM with 8 billion parameters. Idefics2 achieves state-of-the-art performance in its size category on various benchmarks while being more efficient at inference, for both the base and the fine-tuned version. It is on par with state-of-the-art models 4 times larger on some vision-language benchmarks and matches the performance of Gemini 1.5 Pro on some challenging benchmarks. We release the base, instructed, and chat versions of Idefics2[^1] as resources for the VLM community along with the data created to train the model. [^1]: # Terminology [section:terminology] We first establish shared terminology for discussing the different design choices. Training VLMs typically requires gluing together a pre-trained vision backbone and a pre-trained language backbone by initializing new parameters to connect the two modalities. Training these new parameters is done during the *pre-training phase*. This stage commonly leverages a large multimodal dataset such as image-caption pairs. We note that even though it is most common to start from two separate unimodal pre-trained backbones, the parameters of these two backbones can be optionally shared and initialized from scratch as done in [fuyu](https://www.adept.ai/blog/fuyu-8b). As in the large language models literature, the pre-training stage is followed by an instruction fine-tuning stage, in which the model learns from task-oriented samples. Recent works explore two main choices to combine the visual inputs and the text inputs. In the *cross-attention architecture* [Flamingo](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf), [OBELICS](https://openreview.net/forum?id=SKN2hflBIZ), [OpenFlamingo](https://arxiv.org/pdf/2308.01390), the images encoded through the vision backbone are injected at different layers within the language model by interleaving cross-attention blocks in which the text cross-attends to the image hidden states. In contrast, in the *fully autoregressive architecture* [FROMAGe](http://arxiv.org/pdf/2301.13823v4), [PaLM-E](http://arxiv.org/pdf/2302.14030v3), [LLaVA](https://openreview.net/forum?id=w0H2xGHlkw), the output of the vision encoder is directly concatenated to the sequence of text embeddings, and the entire sequence is passed as input to the language model. The input sequence of the language model is thus the concatenation of *visual tokens* and text tokens. The sequence of visual tokens can be optionally pooled into a shorter sequence, providing more compute efficiency. We refer to the layers that maps the vision hidden space to the text hidden space as *modality projection* layers. Figure 1 highlights the fully-autoregressive architecture we ultimately use for Idefics2.
Idefics2 fully-autoregressive architecture: Input images are processed by the Vision encoder. The resulting visual features are mapped (and optionally pooled) to the LLM input space to get the visual tokens (64 in our standard configuration). They are concatenated (and potentially interleaved) with the input sequence of text embeddings (green and red column). The concatenated sequence is fed to the language model (LLM), which predicts the text tokens output.
# Exploring the design space of vision-language models In this section, we compare recurrent design choices in the vision-language model literature and highlight findings. Unless specified otherwise, we run the ablations for 6’000 steps and report the average score of the 4-shot performance on 4 downstream benchmarks measuring different capabilities: VQAv2 [VQAv2](https://doi.org/10.1109/CVPR.2017.670) for general visual question answering, TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) for OCR abilities, OKVQA [okvqa](http://arxiv.org/pdf/1906.00067v2) for external knowledge, and COCO [coco](http://arxiv.org/pdf/2012.01295v1) for captioning. ## Are all pre-trained backbones equivalent for VLMs? Most recent VLMs start from pre-trained unimodal backbones. How does the choice of the backbones (vision and text) influence the performance of the resulting VLM?
r4.4cm
We fix the size of the pretrained backbones, the data used for multimodal pre-training, and the number of training updates. Under the cross-attention architecture, we observe that the greatest improvement in the performance on vision-language benchmarks comes from changing the language model to a better one. More specifically, replacing LLaMA-1-7B [LLaMA](https://arxiv.org/pdf/2302.13971) (35.1% on MMLU [MMLU](https://openreview.net/forum?id=d7KBjmI3GmQ)) by Mistral-7B [Mistral7B](https://arxiv.org/pdf/2310.06825) (60.1% on MMLU) yields a boost of 5.1 (see Table [tab:ablations_archi_lm_backbone]). Additionally, switching the vision encoder from CLIP-ViT-H [CLIP](http://arxiv.org/pdf/2404.19696v1) (78.0% on ImageNet[ImageNet](https://doi.org/10.1109/CVPR.2009.5206848)) to SigLIP-SO400M [SigLIP](https://arxiv.org/pdf/2303.15343) (83.2% on ImageNet) yields a 3.3 increase in performance on the benchmarks (see Table [tab:ablations_archi_vision_encode_backbone]). This result on better vision backbones corroborates observations from [prismatic](https://arxiv.org/pdf/2402.07865).
r5cm
We note that [PaLI-17B](http://arxiv.org/pdf/2402.18932v1) reports a stronger increase in performance by scaling the size of the vision encoder compared to scaling the size of the language model even though scaling the vision encoder leads to a smaller parameter count increase. Although EVA-CLIP-5B [EVA-CLIP](https://arxiv.org/pdf/2303.15389) is ten times bigger in parameter counts than SigLIP-SO400M [SigLIP](https://arxiv.org/pdf/2303.15343), we obtain similar performance across 4 benchmarks, suggesting that EVA-CLIP-5B could be heavily under-trained, and we acknowledge that the open VLM community is missing a large well-trained vision encoder.
#### ***Finding* 1.** For a fixed number of parameters, the quality of the language model backbone has a higher impact on the performance of the final VLM than the quality of the vision backbone.
## How does the fully autoregressive architecture compare to the cross-attention architecture? To our knowledge, there is no proper comparison between the fully autoregressive and the cross-attention architecture. We aim to fill this gap by considering their trade-offs, namely performance, parameter count, and inference cost.
r7.1cm
Following [Flamingo](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf), we first compare the two architectures by freezing the unimodal backbones and training only the newly initialized parameters (cross-attention on one side, and modality projection along with learned pooling on the other side), while fixing the amount of training data. [Flamingo](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf) shows that the more frequently the cross-attention blocks are interleaved with the language model layers, and the higher the vision-language performance. As such, we note that under this setup, the cross-attention architecture has 1.3B more trainable parameters (2B trainable parameters in total) than the fully autoregressive architecture. Additionally, at inference time, the former uses 10% more flops than the latter. Under these conditions, we observe that the cross-attention architecture performs 7 points better in Table [tab:ablations_archi_type_archi_method_training]. Out of the total number of parameters, approximately 15% for the fully autoregressive architecture and 25% for the cross-attention are trained. We hypothesize that this low proportion limits the expressivity of the training and hinders performance. To test that hypothesis, we compare the two architectures by unfreezing all parameters (newly initialized parameters and parameters of the pre-trained unimodal backbones). Under these conditions, training the fully autoregressive architecture would yield loss divergences, and we were not successful in stabilizing the training even by aggressively lowering the learning rate or gradually unfreezing various components. To overcome this stability challenge, we leverage Low-Rank Adaptation [LoRA](https://openreview.net/forum?id=nZeVKeeFYf9) to adapt the pre-trained parameters while using standard full fine-tuning for the newly initialized ones. This setup yields significantly more stable trainings, and more importantly, we observe a 12.9 points increase under the fully autoregressive architecture, and 0.6 point under the cross-attention architecture. While the cross-attention architecture performs better than the fully autoregressive architecture with frozen backbones, it is worse when we add degrees of liberty for the pre-trained backbones. Besides, using LoRA allows training the unimodal backbones at a fraction of the GPU memory cost of full fine-tuning, and LoRA layers can be merged back into the original linear layers yielding no additional cost at inference. We therefore choose the fully autoregressive architecture in the rest of this work. It is interesting to note that this finding contradicts [prismatic](https://arxiv.org/pdf/2402.07865) in which the authors observed that unfreezing the pre-trained visual backbone would significantly degrade the performance. We hypothesize that using parameter-efficient fine-tuning methods is a key difference.
#### ***Finding* 2.** The cross-attention architecture performs better than the fully autoregressive one when unimodal pre-trained backbones are kept frozen. However, when training the unimodal backbones, the fully autoregressive architecture outperforms the cross-attention one, even though the latter has more parameters.
#### ***Finding* 3.** Unfreezing the pre-trained backbones under the fully autoregressive architecture can lead to training divergences. Leveraging LoRA still adds expressivity to the training and stabilizes it.
## Where are the efficiency gains? #### Number of visual tokens Recent VLMs typically route the entire sequence of the vision encoder’s hidden states directly into the modality projection layer, which subsequently inputs into the language model, without no pooling. This is motivated by previous works in which adding a pooling strategy, like average pooling, was found to deteriorate the performance [DePALM](https://arxiv.org/pdf/2403.13499). This results in a high number of visual tokens for each image ranging from 576 for DeepSeek-VL [DeepSeek-VL](https://arxiv.org/pdf/2403.05525) to 2890 for SPHINX-2k [SPHINX](https://arxiv.org/pdf/2311.07575). With the resulting sequence lengths, training is computationally costly, and in-context learning with interleaved images and texts is challenging because it requires modifications to the language models to handle very large context windows. We reduce the sequence length of each image’s hidden states by using a perceiver resampler [perceiver](https://proceedings.mlr.press/v139/jaegle21a.html), [Flamingo](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf), [Qwen-VL](https://arxiv.org/pdf/2308.12966) as a form of trainable Transformer-based pooling. The number of queries (also referred to as latents) corresponds to the number of resulting visual tokens after the pooling. We observe that the learned pooling is effective in two ways: it increases the performance by 8.5 points on average and reduces the number of visual tokens necessary for each image from 729 to 64 (see Table [tab:ablations_archi_type_archi_method_training]).
r4.6cm
In contrast to [DePALM](https://arxiv.org/pdf/2403.13499), [MM1](https://arxiv.org/pdf/2403.09611) which find that the more visual tokens the higher the performance, we observe no gains when using more than 64 visual tokens. We hypothesize that in a hypothetical scenario of infinite training on unlimited data, performance might eventually improve, at the cost of a longer training time. Other variations over the Perceiver architecture [MAPL](https://doi.org/10.18653/v1/2023.eacl-main.185), [register-tokens](https://openreview.net/forum?id=2dnO3LLiJ1), [DePALM](https://arxiv.org/pdf/2403.13499) resulted in decreased performance.
#### ***Finding* 4.** Reducing the number of visual tokens with learned pooling significantly improves compute efficiency at training and inference while improving performance on downstream tasks.
#### Preserving the original aspect ratio and image resolution Vision encoders, such as SigLIP, are typically trained on fixed-size square images. Resizing images alters their original aspect ratio, which is problematic, for instance, for tasks requiring reading long texts. Furthermore, conditioning the training on a single resolution size inherently introduces limitations: a low resolution omits crucial visual details, while a high resolution leads to inefficiency in training and inference. Allowing the model to encode images at various resolutions allows users to decide how much compute is spent on each image.
r5.2cm
Following [pix2struct](http://arxiv.org/pdf/2210.03347v2), [PatchNPack](https://openreview.net/forum?id=VpGFHmI7e5), we pass the image patches to the vision encoder without resizing the image or modifying its aspect ratio. Given that SigLIP was trained on fixed-size low-resolution square images, we interpolate the pre-trained positional embeddings to allow for a higher resolution and train the vision encoder with LoRA parameters to adapt to these modifications.[^1] Our findings indicate that the aspect ratio preserving strategy maintains performance levels on downstream tasks while unlocking computational flexibility during both training and inference (see Table [tab:ablations_archi_aspect_ratio_preserving]). In particular, not having to resize images to the same high resolution allows for saving GPU memory and handling images at the resolution they require.
#### ***Finding* 5.** Adapting a vision encoder pre-trained on fixed-size square images to preserve images’ original aspect ratio and resolution does not degrade performance while speeding up training and inference and reducing memory.
## How can one trade compute for performance? [SPHINX](https://arxiv.org/pdf/2311.07575), [Monkey](https://arxiv.org/pdf/2311.06607), [LLAVA-NeXT](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [MM1](https://arxiv.org/pdf/2403.09611) show that splitting an image into sub-images allows boosting the downstream performance with no changes to the model’s signature. An image is decomposed into sub-images (for instance 4 equal sub-images), which are then concatenated to the original image to form a sequence of 5 images. Additionally, the sub-images are resized to the original image’s size. This strategy however comes at the cost of a much higher number of tokens to encode the images. We adopt this strategy during the instruction fine-tuning stage. Each single image becomes a list of 5 images: 4 crops and the original image. This way, at inference, the model is able to deal with standalone images (64 visual tokens per image), as well as artificially augmented images (320 visual tokens in total per image). We notice that this strategy is particularly useful for benchmarks like TextVQA and DocVQA, which require a sufficiently high resolution to extract the text in an image (see Table [table:perf_sft]). Moreover, when we apply image spitting to only 50% of the training samples (instead of 100% of the samples), we observe that it does not impair the performance increase that image splitting provides. Surprisingly, we find at evaluation time that increasing the resolution of the sub-images (and the standalone image) provides only a minor boost in performance compared to the improvement yielded by sole image splitting: 73.6% when increasing the resolution of the sub-images to the maximum vs 73.0% accuracy on our validation set of TextVQA, and respectively 72.7 vs 72.9 ANLS on the validation set of DocVQA.
#### ***Finding* 6.** Splitting images into sub-images during training allow trading compute efficiency for more performance during inference. The increase in performance is particularly noticeable in tasks involving reading text in an image.
[^1]: Since SigLIP is trained with a fixed resolution, the positional embeddings can be interpreted both as absolute or relative positions. With the aspect ratio and resolution preserving, these positions become relative positional embeddings. # Idefics2 - an open state-of-the-art vision-language foundation model With these learnings in hand, we train an open 8B parameters vision-language model: Idefics2. This section describes the construction of the model, the choice of the dataset, the sequence of training phases and compares the resulting model against VLMs baselines. ## Multi-stage pre-training We start from SigLIP-SO400M and Mistral-7B-v0.1 and pre-train Idefics2 on 3 types of data. **Interleaved image-text documents** We use OBELICS [OBELICS](https://openreview.net/forum?id=SKN2hflBIZ), an open web-scale dataset of interleaved image-text documents with 350 million images and 115 billion text tokens. As shown by the authors, the long documents of OBELICS allow for preserving the performance of the language model while learning to deal with an arbitrary number of interleaved images and texts and long context. Additionally, the authors show that interleaved image-text documents are the biggest driving factor in increasing the performance on visual question answering (VQA) tasks, in particular in the in-context learning setup. We perform an additional removal of newly opted-out content in January 2024 using the Spawning API[^1] even though OBELICS had already been filtered to exclude opted-out content as of September 2023. We also removed the 5% of documents with the highest perplexity scores, as computed by Falcon-1B [RefinedWeb](https://openreview.net/forum?id=kM5eGcdCzq).
r3.5cm
**Image-text pairs** Training on image-text pairs allows the model to learn the alignment between images and their associated texts. We use a combination of high-quality human-annotated image-text pairs from PMD [flava](https://doi.org/10.1109/CVPR52688.2022.01519) and higher-noise web-scale image-text pairs from [LAION-5B](https://proceedings.neurips.cc/paper_files/paper/2022/file/a1859debfb3b59d094f3504d5ebb6c25-Paper-Datasets_and_Benchmarks.pdf). To limit the amount of poor-quality data, we opt for the synthetic captions obtained through the LAION COCO[^2] version of the dataset where images have been captioned with a model trained on COCO. This improves the quality of the training samples and thus the quality of the resulting model (see Table [tab:ablations_pretraining_type_captions]). We use a NSFW classifier[^3] with a high recall and remove 7% of the samples in LAION COCO. We manually inspect 5’000 examples and found 28 pornographic images in the original LAION COCO and only 1 after filtering. This filtering does not negatively impact the downstream performance.
r5cm
**PDF documents** [multimodal-rlhf](https://arxiv.org/pdf/2309.14525) shows that a large proportion of mistakes of state-of-the art VLMs stem from their failure to accurately extract text in images or documents. In order to obtain strong OCR and document understanding abilities, we train Idefics2 on different sources of PDF documents: 19 million industry documents from OCR-IDL [OCRIDL](https://arxiv.org/pdf/2202.12985) and 18 million pages from PDFA[^4]. Moreover, we add Rendered Text[^5] to complement the dataset with texts written with a wide variety of fonts and colors and on diverse backgrounds. These integrations significantly boost the performance on benchmarks that require reading text without decreasing the performance on other benchmarks (see Table [tab:ablations_finetuning_ocr]). To maximize compute efficiency, we decompose the pre-training in two stages. In the first stage, we limit the max image resolution to 384 pixels, which allows us to use a large global batch size of 2’048 (17k images and 2.5M text tokens on average). We sample OBELICS for 70% of the examples with a maximum sequence length of 2’048, and the image-text pairs datasets for 30% of the examples with a maximum sequence length of 1’536. In the second stage, we introduce PDF documents. Since they require a higher image resolution for the text to be legible, we increase the resolution to a maximum of 980 pixels. We use the same global batch size, but have to decrease the per-device batch size and use gradient accumulation to compensate for the additional memory cost. OBELICS represents 45% of the examples with a maximum sequence length of 2’048, image-text pairs represent 35% of the examples with a maximum sequence length of 1’536, and PDF documents represent the remaining 20% of the examples with a maximum sequence length of 1’024. Additionally, we randomly scale up images to adequately cover the distribution of potential image sizes. We emphasize that the training stages are different than the ones ablated in [prismatic](https://arxiv.org/pdf/2402.07865): instead of selectively freezing/unfreezing parts of the model, we train the entire model during both stages (some parameters are trained with LoRA) and increase the image resolution from one stage to the other. We use a learning rate of $10^{-4}$ and do around 2 epochs on our training data. It corresponds to approximately 1.5 billion images and 225 billion text tokens. We note that this is orders of magnitude more training data than other open VLMs. For example, ShareGPT [ShareGPT4V](https://arxiv.org/pdf/2311.12793) uses 1.2 million images, while Monkey [Monkey](https://arxiv.org/pdf/2311.06607) uses 1.4 million for training.
| | | | | | | | | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | **Model** | **Size** | **Archi.** | | | | | | | per image | **VQAv2** | **TextVQA** | **OKVQA** | **COCO** | | | | | OpenFlamingo | 9B | CA | \- | 54.8 | 29.1 | 41.1 | 96.3 | | Idefics1 | 9B | CA | \- | 56.4 | 27.5 | 47.7 | 97.0 | | Flamingo | 9B | CA | \- | 58.0 | 33.6 | 50.0 | 99.0 | | MM1 | 7B | FA | 144 | 63.6 | 46.3 | 51.4 | **116.3** | | Idefics2-base | 8B | FA | **64** | **70.3** | **57.9** | **54.6** | 116.0 | Performance of Idefics2-base against state-of-the-art base VLMs. The evaluations were done with 8 random in-context examples, and in an open-ended setting for VQA tasks. *FA: fully autoregressive architecture. CA: cross-attention architecture.* *(Task, Metric, Split): (VQAv2, VQA acc., testdev), (TextVQA, VQA acc., val), (OKVQA, VQA acc., val), (COCO, CIDEr, test)*
To evaluate the base model, we consider VQAv2 [VQAv2](https://doi.org/10.1109/CVPR.2017.670), TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1), OKVQA [okvqa](http://arxiv.org/pdf/1906.00067v2), and COCO [coco](http://arxiv.org/pdf/2012.01295v1). Table 1 presents the results. While having fewer tokens per image, and thus being more efficient, Idefics2 performs favorably compared to the other current best base VLMs (OpenFlamingo [OpenFlamingo](https://arxiv.org/pdf/2308.01390), Idefics1 [OBELICS](https://openreview.net/forum?id=SKN2hflBIZ), Flamingo [Flamingo](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf), and MM1 [MM1](https://arxiv.org/pdf/2403.09611)). It is notably much better at reading texts in an image. Figure 1 shows an example of an output from the base model on a task similar to the pre-training.
An example of text transcription with Idefics2-base.
[^1]: [^2]: [^3]: [^4]: [^5]: ## Instruction fine-tuning We continue the training with an instruction fine-tuning phase. To do so, we create and release The Cauldron[^1], a massive collection of 50 vision-language datasets, covering a wide range of tasks: general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a functional code. Similarly to [T0](https://openreview.net/forum?id=9Vrb9D0WI4), [flan](https://openreview.net/forum?id=gEZrGCozdqR), [promptsource](https://doi.org/10.18653/v1/2022.acl-demo.9), [InstructBLIP](https://openreview.net/forum?id=vvoWPYqZJA), [m3it](https://arxiv.org/pdf/2306.04387), each dataset is prompted into a shared question/answer format. When there are multiple question/answer pairs per image, we concatenate the pairs into a multi-turn conversation. We deduplicate the training set against the evaluation sets, ensuring that there is minimum contamination from the training to the evaluation. In addition to these vision-language datasets and following insights from [MM1](https://arxiv.org/pdf/2403.09611), we add text-only instruction datasets to the mixture. The datasets aim at teaching the model to follow complex instructions, solve mathematical problems, or do arithmetic calculations. We give more details about the chosen datasets, the number of images, question-answer pairs, and size of each of the subsets, as well as our selected mixture proportion in Table [table:mixture_sft] in Appendix [subsection:details_the_cauldron].
| | | | | | | | | | | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | **Model** | **Size** | | | | | | | | | | per image | **MMMU** | **MathVista** | **TextVQA** | **MMBench** | | | | | | | LLaVA-NeXT | 13B | 2880 | 36.2/- | 35.3 | 67.1 | 70.0 | | | | | DeepSeek-VL | 7B | 576 | 36.6/- | 36.1 | 64.4 | 73.2 | | | | | MM1-Chat | 7B | 720 | 37.0/35.6 | 35.9 | 72.8 | 72.3 | | | | | Idefics2 | 8B | **64** | **43.5**/**37.9** | **51.6** | 70.4 | **76.8** | | | | | Idefics2 | 8B | 320 | 43.0/37.7 | 51.4 | **73.0** | 76.7 | | | | Performance of Idefics2 against state-of-the-art VLMs up to a size of 14B parameters. The evaluations are done in zero shot. Idefics2 with 64 or 320 tokens per image is the same model (same weights), only the inference differs. The full table is present in Appendix [subsection:expanded_evals]. *(Benchmark, Split, Metric): (MMMU, val/test, MMMU score), (MathVista, testmini, MMMU score), (TextVQA, val, VQA acc.), (MMBench, test, accuracy).*
We instruction-tune the base model using DoRA [DoRA](https://arxiv.org/pdf/2402.09353) (a variant of LoRA). During the fine-tuning, we only compute the loss on the tokens of the answers in the Q/A pairs. Since we are doing many epochs over some of the datasets, we employ several strategies to lower the risk of overfitting. First, we add noise to the embeddings with the NEFTune [NEFTune](https://openreview.net/forum?id=0bMmZ3fkCk) technique. Then, we scale up randomly the resolution of the images during the training. Finally, when applicable, we shuffle the multiple user/assistant turns randomly before feeding the example to the model. We evaluate Idefics2 on commonly adopted benchmarks: MMMU [MMMU](http://arxiv.org/pdf/2311.16502v3) for multidiscipline college-level problems, MathVista [mathvista](http://arxiv.org/pdf/2310.02255v3) for mathematical reasoning, TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) for text reading on natural images, and MMBench [MMBench](https://arxiv.org/pdf/2307.06281) for various perception and reasoning tasks. Table 1 presents the results (see Table [table:perf_sft_full] for the complete result table) of Idefics2 against the current strongest VLMs in its class size: LLaVA-Next [LLAVA-NeXT](https://llava-vl.github.io/blog/2024-01-30-llava-next/), DeepSeek-VL [DeepSeek-VL](https://arxiv.org/pdf/2403.05525) and MM1-Chat [MM1](https://arxiv.org/pdf/2403.09611). While being computationally much more efficient at inference, Idefics2 exhibits strong performance on various benchmarks, outperforming the current best foundation VLMs in its size category. It is on par with state-of-the-art models 4x its size, or with closed-source models like Gemini 1.5 Pro on several benchmarks like MathVista or TextVQA. ## Optimizing for chat scenarios The evaluation benchmarks expect very short answers, but humans prefer long generations when interacting with a model. We find that Idefics2 can exhibit difficulties in precisely following instructions about the expected format, making it difficult to reconcile “chattiness“ and downstream performance. As such, after instruction fine-tuning, we further train Idefics2 on dialogue data. We fine-tune Idefics2 for a few hundred steps on LLaVA-Conv [LLaVA](https://openreview.net/forum?id=w0H2xGHlkw) and ShareGPT4V [ShareGPT4V](https://arxiv.org/pdf/2311.12793), with a large batch size. Our blind human evaluations reveal that Idefics2-chatty is overwhelmingly preferred over its instruction fine-tuned version in many user interactions. We also adversarially stress-tested the model to generate inaccurate, biased, or offensive responses and reported the findings in Appendix [sec:red_teaming]. We show examples of generations with Idefics2-chatty in Figure [fig:qualitative_gen_0], and in Appendix in Figures [fig:qualitative_gen_1], [fig:qualitative_gen_2] and [fig:qualitative_gen_3]. [^1]: # Conclusion In this work, we re-examine common choices made in the VLM literature and rigorously compare these choices in controlled experiments. Our findings touch upon the effectiveness of different architectures, their performance/inference cost trade-offs as well as training stability. With these learnings at hand, we train Idefics2, an open 8B parameters vision-language model. Idefics2 is state-of-the-art on various benchmarks in its category size and is much more efficient at inference. By releasing our findings, as well as our models and our training dataset, we aim to contribute to the ongoing evolution of VLMs and their applications in solving complex real-world problems. # Acknowledgement [acknowledgement] We thank Mustafa Shukor for helpful suggestions on the paper, and Yacine Jernite, Sasha Luccioni, Margaret Mitchell, Giada Pistilli, Lucie-Aimée Kaffee, and Jack Kumar for red-teaming the model. # Appendix ## Further experimental details of the ablations ### Cross-attention vs. fully autoregressive architectures We apply LoRA modules to the LLM for the fully autoregressive architecture and to the cross-attention modules and the LLM for the cross-attention architecture. In Figure 1, we report the average performance with respect to the number of steps, the number of images, as well as the number of text tokens. We see an improvement across the board with the fully autoregressive architecture. Comparing the average score with these different axes is essential because the cross-attention architecture feeds a single token per image to the language model, against 64 for the fully autoregressive architecture with perceiver pooling. This implies that for the same training sequence length, the number of images and text tokens is different for the two architectures. Equivalently, the same multimodal document will yield different sequence lengths. Even though we fix the batch size in the comparison, the number of text tokens and number of images grow at different paces under the two architectures.
Comparison of the cross-attention and fully autoregressive architectures through the number of steps, the number of images and the number of text tokens.
### Comparing various vision backbones We present in Table [tab:ablations_archi_vision_encode_backbone_detailed] the detailed results of comparing multiple vision backbones. While EVA-CLIP-5B performs similarly to SigLIP-SO400M, we emphasize that it has 11 times more parameters. We also noticed in early experiments that TextVQA is the most sensitive benchmark to image resolution, which accounts for the performance increase. ### Comparing various pooling strategies We compare multiple pooling strategies: a simple linear layer that takes the flattened sequence of vision hidden states and projects it into a shorter sequence of visual tokens, as well as a Mapping Network [MAPL](https://doi.org/10.18653/v1/2023.eacl-main.185). The perceiver resampler significantly outperforms these two options (see Table [tab:vision_language_adaptor_ablation]). We also ablate the number of layers in the perceiver resampler, and find no statistically significant differences when increasing the number of layers, similarly to results from [palm2vadapter](https://arxiv.org/pdf/2402.10896). We settle on 3 layers out of caution to avoid any potential capacity bottleneck. Finally, we add a 2-layer modality projection MLP on top of the vision encoder hidden states to project the vision hidden dimension to the language model hidden dimension prior to the perceiver resampler. These changes yield better performance as well (see Table [tab:modality_projection_prior_to_perceiver]). ### Ablations on OCR data We hypothesize that adding PDF documents helps the model learn to read text from images. In Table [tab:ablations_finetuning_ocr], we compare checkpoints trained with and without OCR documents, along with image resolution increase to ensure that the text is legible. We do not observe statistically significant differences when evaluating checkpoints in zero or few shot. Instead, we fine-tune the checkpoints on DocVQA for 500 steps with a learning rate of $1e-5$, leading to checkpoints showing much stronger differences. ## Details of the instruction fine-tuning ### Statistics of The Cauldron [subsection:details_the_cauldron] In Table 1, we present the statistics of the datasets included in The Cauldron, as well as the text-only instruction datasets used for the supervised fine-tuning. For each dataset, we give the number of different images it contains, the number of question-answer pairs, the total number of tokens for the answers in the question-answer pairs, and the selected percentage of tokens it represents in our final mixture after upsampling or downsampling.
| **Dataset** | | **\# Q/A pairs** | **\# tokens** | **% mixture** | |:---|:--:|:--:|:--:|:--:| | | | | | | | *General visual question answering* | | | | | | VQAv2 [VQAv2](https://doi.org/10.1109/CVPR.2017.670) | 82,772 | 443,757 | 1,595,929 | 5.72% | | COCO-QA [CocoQA](https://proceedings.neurips.cc/paper_files/paper/2015/file/831c2f88a604a07ca94314b56a4921b8-Paper.pdf) | 46,287 | 78,736 | 286,982 | 1.47% | | Visual7W [Visual7w](None) | 14,366 | 69,817 | 279,268 | 1.43% | | A-OKVQA [A-OKVQA](https://doi.org/10.1007/978-3-031-20074-8_9) | 16,539 | 17,056 | 236,492 | 1.21% | | TallyQA [TallyQA](http://arxiv.org/pdf/1810.12440v2) | 98,680 | 183,986 | 738,254 | 0.57% | | OK-VQA [okvqa](http://arxiv.org/pdf/1906.00067v2) | 8,998 | 9,009 | 38,853 | 0.40% | | HatefulMemes [hatefulmeme](https://proceedings.neurips.cc/paper_files/paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf) | 8,500 | 8,500 | 25,500 | 0.13% | | VQA-RAD [VQA-RAD](https://doi.org/10.1038/sdata.2018.251) | 313 | 1,793 | 8,418 | 0.09% | | | | | | | | *Captioning* | | | | | | LNarratives [LocalizedNarratives](http://arxiv.org/pdf/2302.11217v2) | 507,444 | 507,444 | 21,328,731 | 4.56% | | Screen2Words [screen2words](https://doi.org/10.1145/3472749.3474765) | 15,730 | 15,743 | 143,103 | 0.37% | | VSR [VSR](https://doi.org/10.1162/tacl_a_00566) | 2,157 | 3,354 | 10,062 | 0.21% | | | | | | | | *OCR, document understanding, text transcription* | | | | | | RenderedText[^1] | 999,000 | 999,000 | 27,207,774 | 5.57% | | DocVQA [DocVQA](https://doi.org/10.1109/WACV48630.2021.00225) | 10,189 | 39,463 | 337,829 | 3.46% | | TextCaps [textcaps](http://arxiv.org/pdf/2003.12462v2) | 21,953 | 21,953 | 389,658 | 2.00% | | TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) | 21,953 | 34,602 | 181,918 | 1.86% | | ST-VQA [STVQA](https://doi.org/10.1109/ICCV.2019.00439) | 17,247 | 23,121 | 127,846 | 1.31% | | OCR-VQA [OCR-VQA](https://doi.org/10.1109/ICDAR.2019.00156) | 165,746 | 801,579 | 6,073,824 | 0.93% | | VisualMRC [VisualMRC](http://arxiv.org/pdf/2101.11272v2) | 3,027 | 11,988 | 168,828 | 0.86% | | IAM [IAM](https://doi.org/10.1007/s100320200071) | 5,663 | 5,663 | 144,216 | 0.74% | | InfoVQA [InfographicVQA](https://doi.org/10.1109/WACV51458.2022.00264) | 2,118 | 10,074 | 61,048 | 0.63% | | Diagram image-to-text[^2] | 300 | 300 | 22,196 | 0.11% | | | | | | | | *Chart/figure understanding* | | | | | | Chart2Text [Chart2Text](https://doi.org/10.18653/v1/2020.inlg-1.20) | 26,985 | 30,242 | 2,852,827 | 4.38% | | DVQA [DVQA](http://arxiv.org/pdf/1810.02358v2) | 200,000 | 2,325,316 | 8,346,234 | 4.27% | | VisText [VisText](http://vis.csail.mit.edu/pubs/vistext) | 7,057 | 9,969 | 1,245,485 | 1.91% | | ChartQA [ChartQA](https://doi.org/10.18653/v1/2022.findings-acl.177) | 18,271 | 28,299 | 185,835 | 1.90% | | PlotQA [PlotQA](http://arxiv.org/pdf/1906.04124v2) | 157,070 | 20,249,479 | 8478299.278 | 0.65% | | FigureQA [FigureQA](https://arxiv.org/pdf/1710.07300) | 100,000 | 1,327,368 | 3,982,104 | 0.61% | | MapQA [MapQA](https://openreview.net/forum?id=znKbVjeR0yI) | 37,417 | 483,416 | 6,470,485 | 0.33% | | | | | | | | *Table understanding* | | | | | | TabMWP [TabMWP](http://arxiv.org/pdf/2209.14610v3) | 22,729 | 23,059 | 1,948,166 | 2.49% | | TAT-QA [TAT-QA](https://doi.org/10.18653/v1/2021.acl-long.254) | 2,199 | 13,215 | 283,776 | 2.18% | | HiTab [Hitab](https://doi.org/10.18653/v1/2022.acl-long.78) | 2,500 | 7,782 | 351,299 | 1.80% | | MultiHiertt [Multihiertt](https://aclanthology.org/2022.acl-long.454) | 7,619 | 7,830 | 267,615 | 1.37% | | FinQA [FinQA](https://doi.org/10.18653/v1/2021.emnlp-main.300) | 5,276 | 6,251 | 242,561 | 0.99% | | WikiSQL [WikiSQL](https://arxiv.org/pdf/1709.00103) | 74,989 | 86,202 | 9,680,673 | 0.99% | | SQA [SQA](https://doi.org/10.18653/v1/P17-1167) | 8,514 | 34,141 | 1,894,824 | 0.97% | | WTQ [WTQ](https://doi.org/10.3115/v1/P15-1142) | 38,246 | 44,096 | 6,677,013 | 0.51% | | | | | | | | *Reasoning, logic, maths* | | | | | | GeomVerse [GeomVerse](https://openreview.net/forum?id=A9NOAS0hn1) | 9,303 | 9,339 | 2,489,459 | 3.83% | | CLEVR-Math [CLEVR-Math](https://doi.org/10.48550/ARXIV.2208.05358) | 70,000 | 788,650 | 3,184,656 | 3.26% | | CLEVR [CLEVR](https://doi.org/10.1109/CVPR.2017.215) | 70,000 | 699,989 | 2,396,781 | 1.23% | | IconQA [IconQA](http://arxiv.org/pdf/2110.13214v4) | 27,315 | 29,859 | 112,969 | 1.16% | | RAVEN [RAVEN](http://arxiv.org/pdf/2207.00590v1) | 42,000 | 42,000 | 105,081 | 0.67% | | Inter-GPs [Inter-GPS](http://arxiv.org/pdf/2105.04165v3) | 1,451 | 2,101 | 8,404 | 0.17% | | | | | | | | *Textbook/academic questions* | | | | | | AI2D [AI2D](http://arxiv.org/pdf/1603.07396v1) | 3,099 | 9,708 | 38,832 | 0.80% | | TQA [TQA](https://doi.org/10.1109/CVPR.2017.571) | 1,496 | 6,501 | 26,004 | 0.53% | | ScienceQA [ScienceQA](https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf) | 4,985 | 6,218 | 24,872 | 0.25% | | | | | | | | *Differences between 2 images* | | | | | | NLVR2 [NLVR2](https://doi.org/10.18653/v1/P19-1644) | 50,426 | 86,373 | 259,119 | 1.33% | | GSD [MIMIC-IT-General-Scene-Difference](https://arxiv.org/pdf/2306.05425) | 70,939 | 141,869 | 4,637,229 | 0.48% | | Spot the diff [SpotTheDiff](https://doi.org/10.18653/v1/D18-1436) | 8,566 | 9,524 | 221,477 | 0.57% | | | | | | | | *Screenshot to code* | | | | | | WebSight [WebSight](https://arxiv.org/pdf/2403.09029) | 500,000 | 500,000 | 276,743,299 | 0.28% | | DaTikz [DaTikz](https://arxiv.org/pdf/2310.00367) | 47,974 | 48,296 | 59,556,252 | 0.03% | | | | | | | | | | | | | | *Text-only general instructions, math problems, arithmetic calculations* | | | | | | OpenHermes-2.5 [OpenHermes](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 0 | 1,006,223 | 248,553,747 | 12.73% | | LIMA [LIMA](https://openreview.net/forum?id=KBMOKmX2he) | 0 | 1,052 | 633,867 | 0.81% | | Dolly [Dolly](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) | 0 | 14,972 | 1,329,999 | 0.68% | | MetaMathQA [MetaMathQA](https://openreview.net/forum?id=N8N0hgNDRt) | 0 | 395,000 | 74,328,255 | 3.81% | | MathInstruct [MathInstruct](https://openreview.net/forum?id=yLClGs770I) | 0 | 261,781 | 45,393,559 | 2.33% | | OrcaMath [Orca-Math](https://arxiv.org/pdf/2402.14830) | 0 | 200,031 | 63,780,702 | 1.63% | | CamelAIMath [CamelAIMath](https://openreview.net/forum?id=3IyL2XWDkG) | 0 | 49,744 | 21,873,629 | 0.06% | | AtlasMathSets[^3] | 0 | 17,807,579 | 455,411,624 | 3.50% | | Goat [Goat](https://arxiv.org/pdf/2305.14201) | 0 | 1,746,300 | 167,695,693 | 0.86% | | | | | | | | | | | | | The statistics of datasets used for instruction fine-tuning. \# tokens is the total number of tokens for each dataset for the answers only. % mixture is our selected percentage of answer tokens for each dataset in the final mixture.
## Details of the evaluations ### Evaluation setup We perform all evaluations with a batch size of 1 and greedy decoding. For the multi-choice questions in MMMU, MathVista, MMBench, we evaluate with the same prompt used for similar types of datasets during the instruction fine-tuning:
Question: {question} Choices: A. {choice_a} B. {choice_b} C. {choice_c} ... Answer with the letter.
For the open-ended questions in TextVQA, DocVQA, and VQAv2, we evaluate with the prompt:
Question: {question} Give a very brief answer.
We use the stop words `Question`, `User`, `` and `` to stop a generation. ### Expanded evaluation table [subsection:expanded_evals] We report the expanded evaluation of Idefics2 and the comparison to other models in Table 2. This includes scores on VQAv2 [VQAv2](https://doi.org/10.1109/CVPR.2017.670), which is widely adopted for evaluation. We acknowledge, though, that the metric used for the open-ended visual question answering benchmarks strongly penalizes models that do not generate in the same format as the ground truth. For example, answering "large" when the ground truth is "big" or more verbose reformulations will be counted as incorrect. Our manual qualitative analysis reveals that on benchmarks like VQAv2, the generations of two models differing by 5 points would be barely noticeable. This problem is less concerning for other open-ended benchmarks like TextVQA or DocVQA which require finding a text in an image, making the expected answer less prone to ambiguity.
| | | | | | | | | | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | **Model** | **Size** | | | | | | | | | per image | | | | | | | | | | *7B-14B models* | | | | | | | | | | LLaVA-NeXT | 13B | 2880 | 36.2/- | 35.3 | 67.1 | 70.0 | \- | 82.8 | | DeepSeek-VL | 7B | 576 | 36.6/- | 36.1 | 64.4 | 73.2 | 49.6 | \- | | MM1-Chat | 7B | 720 | 37.0/35.6 | 35.9 | 72.8 | 72.3 | \- | 82.8 | | Idefics2 | 8B | 64 | 43.5/37.9 | 51.6 | 70.4 | 76.8 | 67.3 | 80.8 | | Idefics2 | 8B | 320 | 43.0/37.7 | 51.4 | 73.0 | 76.7 | 74.0 | 81.2 | | *$\geq$``{=html}30B models* | | | | | | | | | | Mini-Gemini-HD | 34B | 2880 | 48.0/44.9 | 43.3 | 74.1 | 80.6 | \- | \- | | MM1-Chat | 30B | 720 | 44.7/40.3 | 39.4 | 73.5 | 75.1 | \- | 83.7 | | LLaVA-NeXT | 34B | 2880 | 51.1/44.7 | 46.5 | 69.5 | 79.3 | \- | 83.7 | | *Proprietary* | | | | | | | | | | Gemini 1.0 Pro | \- | \- | 47.9/- | 45.2 | 74.6 | \- | 88.1 | 71.2 | | Claude 3 Haiku | \- | \- | 50.2/- | 46.4 | \- | \- | 88.8 | \- | | Claude 3 Sonnet | \- | \- | 53.1/- | 47.9 | \- | \- | 89.5 | \- | | Gemini 1.5 Pro | \- | \- | 58.5/- | 52.1 | 73.5 | \- | 86.5 | 73.2 | Performance of Idefics2 against state-of-the-art VLMs across different sizes. The evaluations are done in zero shot. Idefics2 with 64 or 320 tokens per image only differs by the image splitting. *(Benchmark, Split, Metric): (MMMU, val/test, MMMU score), (MathVista, testmini/test, MMMU score), (TextVQA, val, VQA acc.), (MMBench, test, accuracy), (DocVQA, test, ANLS score), (VQAv2, testdev, VQA acc.).*
### Qualitative evaluation We show in Figures 2, 3, and 4, examples of generations with Idefics2-chatty.
Idefics2-chatty finds the requested information in the resume, and organizes it in JSON format.
Idefics2-chatty describes an AI-generated image.
Idefics2-chatty answers a question on a scientific diagram.
## Red-teaming [sec:red_teaming] In the context of a red-teaming exercise, our objective is to evaluate the propensity of the model to generate inaccurate, biased, or offensive responses. We evaluate more specifically the chat-optimized checkpoint[^4]. While the model typically refrains from responding to offensive inputs, we observe that through repeated trials or guided interactions, it tends to hastily form judgments in situations necessitating nuanced contextual understanding, often perpetuating harmful stereotypes. Noteworthy instances include: - Speculating or passing judgments, or perpetuating historical disparities on individuals’ professions, social status, or insurance eligibility based solely on visual cues (e.g., age, attire, gender, facial expressions). - Generating content that promotes online harassment or offensive memes reinforcing harmful associations from a portrait, or from a benign image. - Assuming emotional states or mental conditions based on outward appearances. - Evaluating individuals’ attractiveness solely based on their visual appearance. Additionally, we identify behaviors that increase security risks that already exist: - Successfully solving CAPTCHAs featuring distorted text within images. - Developing phishing schemes from screenshots of legitimate websites to deceive users into divulging their credentials. - Crafting step-by-step guides on constructing small-scale explosives using readily available chemicals from common supermarkets or manipulating firearms to do maximum damage. It’s important to note that these security concerns are currently limited by the model’s occasional inability to accurately read text within images. We emphasize that the model would often encourage the user to exercise caution about the model’s generation or flag how problematic the initial query can be in the first place. For instance, when insistently prompted to write a racist comment, the model would answer that query before pointing out "*This type of stereotyping and dehumanization has been used throughout history to justify discrimination and oppression against people of color. By making light of such a serious issue, this meme perpetuates harmful stereotypes and contributes to the ongoing struggle for racial equality and social justice.*". However, certain formulations can circumvent (i.e. "jailbreak") these cautionary prompts, emphasizing the need for critical thinking and discretion when engaging with the model’s outputs. While jail-breaking text LLMs is an active research area, jail-breaking vision-language models have recently emerged as a new challenge as vision-language models become more capable and prominent [jailbreak](https://openreview.net/forum?id=plmBsXHxgR). The addition of the vision modality not only introduces new avenues for injecting malicious prompts but also raises questions about the interaction between vision and language vulnerabilities. [^1]: [^2]: [^3]: [^4]:
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality 2023-04-27 Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
Show Paper Content
# Introduction Large language models (LLMs) such as GPT-3 [gpt3](http://arxiv.org/pdf/2112.07522v2), BLOOM [bloom](None), LLaMA [llama](http://arxiv.org/pdf/2402.08075v1) have experienced rapid development to make general artificial intelligence possible, which demonstrates impressive zero-shot abilities on various linguistic applications. However, except GPT-4 [gpt4](None), current general LLMs cannot support different modalities of input and develop impressive multimodal abilities. Although GPT-4 [gpt4](None) has exhibited remarkable multimodal abilities, the methods behind its extraordinary abilities remain a mystery. Recently, researchers have been extending LLMs to understand visual inputs in two different paradigms: systematic collaboration and end-to-end trained models. However, systematic collaboration approaches, including Visual ChatGPT [visualchatgpt](None), MM-REACT [mmreact](None), and HuggingGPT [hugginggpt](None), are designed to facilitate the coordination of various vision models or tools to express visual information with text descriptions. However, these approaches may not be able to comprehend specific multimodal instructions due to their lack of alignment with different modalities. Additionally, these approaches may encounter challenges related to inference efficiency and cost. End-to-end models, such as BLIP-2 [blip2](None), LLaVA [llava](http://arxiv.org/pdf/2402.11690v1), and MiniGPT-4 [minigpt4](http://arxiv.org/pdf/2402.17510v1) aim to use unified models to support different modalities. However, these models have some limitations as they take frozen visual models, which may lead to inadequate alignment due to the limited number of parameters. Moreover, they cannot unlock various abilities due to missing unimodal and multimodal instruction. In this paper, we present mPLUG-Owl with an innovative modularized training paradigm for large multi-modal language models that can support multiple modalities concurrently, drawing inspiration from the concept of modularization [mplug2](None), [mplug](None), [e2evlp](None), [hitea](https://doi.org/10.48550/arXiv.2212.14546). Our method harnesses the power of pre-trained LLM, visual knowledge module, and connected visual abstractor module to achieve effective alignment between images and text, and utilizes a two-stage training scheme to stimulate impressive unimodal and multimodal abilities. Our approach even enhances the strong generation abilities of LLM by modality collaboration between modalities. In the first step, we align the image and text to acquire comprehensive visual knowledge using text-image pairs, which is accomplished by training the visual knowledge module and abstractor module with the frozen LLM module. Subsequently, we fine-tune mPLUG-Owl with language-only and multi-modal instructions to unlock a range of unimodal and multimodal abilities. We freeze the visual knowledge module and train low-rank adaption (LoRA) [lora](http://arxiv.org/pdf/2402.11485v1) on LLM and visual abstractor module jointly. This approach allows for the effective integration of textual and visual information, facilitating the development of versatile and robust cognitive abilities. Our experiments on a carefully-built visually related instruction evaluation set OwlEval shows that mPLUG-Owl outperforms existing models such as MiniGPT-4 [minigpt4](http://arxiv.org/pdf/2402.17510v1) and LLaVA [llava](http://arxiv.org/pdf/2402.11690v1). We separately verifies mPLUG-Owl’s remarkable abilities in instruction understanding, visual understanding, knowledge transfer, and multi-turn dialogue. Abundant ablation study is performed to show the effectiveness of our training paradigm. Furthermore, we find some unexpected emerging ability such as multi-image correlation, multilingual conversation and scene text understanding. Our main contributions can be highlighted as follows: - We propose mPLUG-Owl, a novel training paradigm for large language models through modularization. - We carefully construct an instruction evaluation set, dubbed **OwlEval**, to assess the capabilities of different models in the context of visual-related tasks. - Experimental results demonstrate that mPLUG-Owl excels in multi-modal instruction understanding and multi-turn dialogue, surpassing the performance of existing models. # Related Work ## Large Language Models In recent times, Large Language Models (LLMs) have garnered increasing attention for their exceptional performance in diverse natural language processing (NLP) tasks. Initially, transformer models such as BERT [bert](None), GPT [gpt1](http://arxiv.org/pdf/2310.01427v1), and T5 [t5](None) were developed with different pre-training objectives. However, the emergence of GPT-3 [gpt3](http://arxiv.org/pdf/2112.07522v2), which scales up the number of model parameters and data size, showcases significant zero-shot generalization abilities, enabling them to perform commendably on previously unseen tasks. Consequently, numerous LLMs such as OPT [opt](None), BLOOM [bloom](None), PaLM [palm](http://arxiv.org/pdf/2209.05735v4), and LLaMA [llama](http://arxiv.org/pdf/2402.08075v1) are created, ushering in the success of LLMs. Additionally, Ouyang et al. [instructgpt](http://arxiv.org/pdf/2302.05206v1) propose InstructGPT by aligning human instruction and feedback with GPT-3. Furthermore, it has been applied to ChatGPT [chatgpt](https://openai.com/blog/chatgpt), which facilitates conversational interaction with humans by responding to a broad range of diverse and intricate queries and instructions. ## Multi-Modal Large Language Models Despite the successful applications of LLMs in natural language processing, it is still struggling for LLMs to perceive other modalities such as vision and audio. Recently, researchers have been extending language models to understand visual inputs in two different paradigms: systematic collaboration and end-to-end trained models. Systematic collaboration approaches, such as Visual ChatGPT [visualchatgpt](None), MM-REACT [mmreact](None), and HuggingGPT [hugginggpt](None), leverage various vision experts or tools to express visual information with text descriptions. Subsequently, large language models, such as ChatGPT, can act as the agents, and be prompted to select the appropriate experts and tools for visual understanding. Finally, LLMs would summarize the output of these experts to answer user queries. On the other hand, some approaches [blip2](None), [flamingo](http://arxiv.org/pdf/2205.07065v1), [llava](http://arxiv.org/pdf/2402.11690v1) leverage the pre-trained large language model to build unified models for multi-modality. For example, Flamingo [flamingo](http://arxiv.org/pdf/2205.07065v1) freezes the pre-trained vision encoder and large language model and fuses vision and language modalities with gated cross-attention showing impressive few-shot capabilities. Additionally, BLIP-2 [blip2](None) designs Q-Former to align the visual features from the frozen visual encoder and large language models with Flan-T5 [flant5](http://arxiv.org/pdf/2202.03371v1) and OPT [opt](None). Moreover, PaLM-E [palm-e](http://arxiv.org/pdf/2302.14030v3) directly inputs features from sensor modalities with PaLM [palm](http://arxiv.org/pdf/2209.05735v4), which has 520 billion parameters, contributing to robust performance in real-world perceptions. Furthermore, some powerful instruction-tuned language models that built upon open-sourced foundation model LLaMA [llama](http://arxiv.org/pdf/2402.08075v1), such as Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca) and Vicuna [vicuna](https://github.com/lm-sys/FastChat), exhibit comparable performance to ChatGPT [chatgpt](https://openai.com/blog/chatgpt) and GPT-4 [gpt4](None). MiniGPT-4 [minigpt4](http://arxiv.org/pdf/2402.17510v1) and LLaVA [llava](http://arxiv.org/pdf/2402.11690v1) align these finetuned models with extracted visual features from the frozen visual backbone. In contrast, mPLUG-Owl not only aligns the representation between the vision and language foundation model (e.g. CLIP and LLaMA) in terms of knowledge acquisition and grounding to the real world but also can understand language and multi-modal instructions, showcasing strong zero-shot generalization and multi-turn conversation capabilities. # mPLUG-Owl
Comparison between different training paradigms. All of these methods are trained in a two-stage fashion. Stage 1 stands for pre-training and Stage 2 represents instruction tuning.
## Architecture Overview As illustrated in Figure 1, there exist mainly three types of end-to-end multimodal LLMs: 1) models that utilize limited parameters with frozen LLM and visual models during pretraining and instruction tuning, such as MiniGPT4; 2) models that incorporate trainable LLMs and frozen visual models, exemplified by Kosmos-1; and 3) models that involve trainable LLMs during instruction tuning and frozen visual models, as seen in LLaVA. Nevertheless, these models exhibit certain constraints since they depend on frozen visual models, which can lead to insufficient alignment due to the limited number of parameters. Furthermore, they fail to effectively stimulate a diverse set of abilities, as they lack both unimodal and multimodal instruction. To this end, we propose mPLUG-Owl, a multi-modal language model that is capable of perceiving various modalities while taking the visual context and information into account and generating corresponding outputs. Specifically, as illustrated in Figure [fig:model], mPLUG-Owl consists of a vision foundation model $f_{\mathbf{V}}$ to encode the visual knowledge, a language foundation model $f_{\mathbf{L}}$, and a visual abstractor module $f_{\mathbf{K}}$. We first obtain dense image representations from the pre-trained visual foundation model $f_{\mathbf{V}}$. However, such dense features would fragment the fine-grained image information and bring large computation due to the lengthy sequence when feeding into $f_{\mathbf{L}}$. To mitigate this issue, we employ the visual abstractor module $f_{\mathbf{K}}$ to summarize visual information within several learnable tokens, thereby obtaining higher semantic visual representations and reducing computation, as illustrated in Figure [fig:model]. The visual representations are combined with text queries and fed into the language model to generate the response. ## Training Scheme #### Multimodal Pretraining Large-scale language models, such as GPT-3 [gpt3](http://arxiv.org/pdf/2112.07522v2) and LLaMA [llama](http://arxiv.org/pdf/2402.08075v1), are trained on extensive and diverse data collected from the internet, providing them with a comprehensive understanding of the world. This vast knowledge base endows these models with remarkable capabilities across a range of tasks. However, the utilization of visual information in such models remains underexplored. Previous approaches [minigpt4](http://arxiv.org/pdf/2402.17510v1), [llava](http://arxiv.org/pdf/2402.11690v1) have employed a limited number of additional parameters to learn the alignment between visual data and language models, constraining their capacity to comprehend complex visual information. To enhance the ability of large-scale language models to perceive visual information while integrating their internal abilities, we propose a novel training paradigm that incorporates a trainable visual backbone $f_{\mathbf{V}}$ and an additional visual abstractor $f_{\mathbf{K}}$, while maintaining the pre-trained language model $f_{\mathbf{L}}$ in a frozen state. This approach enables the model to effectively capture both low-level and higher semantic visual information and align it with the pre-trained language model without compromising its performance. #### Joint Instruction Tuning Upon completion of the prior phase, the model acquires the ability to retain a considerable amount of knowledge and provide reasonable answers to human queries. Nonetheless, it continues to exhibit challenges in generating coherent linguistic responses. As posited in GPT-3 [gpt3](http://arxiv.org/pdf/2112.07522v2), refining the model through instruction tuning is essential for accurately discerning user intentions. Previous attempts [mplug](None), [mplug2](None) in multi-modal learning have demonstrated that joint learning from uni-modal and multi-modal sources can lead to significant improvements owing to the collaboration between different modalities. Building on this insight, we present a novel vision-language joint instruction tuning strategy to facilitate better alignment between mPLUG-Owl and human instructions and intentions. Specifically, given that the model can comprehend the visual concepts and knowledge depicted in images through visual knowledge learning, we freeze the entire model and employ low-rank adaption (i.e., LoRA [lora](http://arxiv.org/pdf/2402.11485v1)) to adapt $f_{\mathbf{L}}$ by training multiple low-rank matrices for efficient alignment with human instructions. For each data record, we unified them in a snippet of conversation following Vicuna [vicuna](https://github.com/lm-sys/FastChat), and we compute the loss on the response. During the training, we accumulate the gradient for text-only instruction data and multi-modal instruction data for multiple batches and updated the parameters. Therefore, by joint training with both language and multi-modal instructions, mPLUG-Owl can better understand a wide range of instructions and respond with more natural and reliable output. Moreover, our approach can easily handle various text and multi-modal instructions without the need for realignment of the vision and language models, as required by methods such as MiniGPT-4 [minigpt4](http://arxiv.org/pdf/2402.17510v1) and LLaVA [llava](http://arxiv.org/pdf/2402.11690v1). #### Training Objective The model is trained using the language modeling task, which entails learning to generate subsequent tokens based on the preceding context. The primary objective of the training process is to maximize the log-likelihood of the tokens. It is important to note that only discrete tokens, such as text tokens, are considered in the calculation of the training loss. Most significantly, the emergence of diverse capabilities resulting from the training task during the joint instruction tuning stage enhances the performance of mPLUG-Owl in downstream applications. # Experiment ## Experimental Setup #### Model Settings. We choose ViT-L/14 [vit](http://arxiv.org/pdf/2105.15075v2) as the visual foundation model $f_{\mathbf{V}}$ which has 24 layers with hidden dimension set as 1024 and patch size set as 14. For faster convergence, the ViT is initialized from CLIP ViT-L/14 model pre-trained via contrastive learning. Different with LLaVA [llava](http://arxiv.org/pdf/2402.11690v1) and MiniGPT-4 [minigpt4](http://arxiv.org/pdf/2402.17510v1), to demonstrate the effectiveness and generalization ability, we utilize raw LLaMA-7B [llama](http://arxiv.org/pdf/2402.08075v1) rather than its instruction-tuned variants such as Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca) and Vicuna [vicuna](https://github.com/lm-sys/FastChat). The total number of parameters of mPLUG-Owl is about 7.2B. More details about hyper-parameters can be found in Appendix. #### Data and Training Details. For the first stage, we utilize the image-caption pairs from several datasets, including LAION-400M [laion400m](None), COYO-700M [coyo700m](https://github.com/kakaobrain/coyo-dataset), Conceptual Captions [conceptualcap](None) and MSCOCO [cococap](None). We use a batch size of 2.1 million tokens and train mPLUG-Owl for 50k steps, corresponding to about 104 billion tokens. We adopt the AdamW optimizer with $\beta=(0.9, 0.98)$, and set the learning rate and weight decay to 0.0001 and 0.1 respectively. We warm up the training with 2k warm-up steps then decay the learning rate with the cosine schedule. The input image is randomly resized to $224\times 224$. Besides, we tokenize the text input with SentencePiece [sentencepiece](None) tokenizer. For the second stage, we gather pure text instruction data from three distinct sources: 102k data from the Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca), 90k from the Vicuna [vicuna](https://github.com/lm-sys/FastChat), and 50k from the Baize [baize](None). Additionally, we utilize 150k multi-modal instruction data from the LLaVA dataset [llava](http://arxiv.org/pdf/2402.11690v1). We train mPLUG-Owl for 2k steps with the batch size 256, and the learning rate is set to 0.00002. #### Baselines. We compare our mPLUG-Owl with end-to-end models and systematic collaboration approaches as follows: - *OpenFlamingo* [openflamingo](None) is an open-source version of Flamingo [flamingo](http://arxiv.org/pdf/2205.07065v1) model. We use the released code of OpenFlamingo-9B[^3] to run zero-shot generation. - *BLIP-2* [blip2](None) is pre-trained through bootstrapped learning from off-the-shelf frozen pre-trained image models and large language models using an efficient pre-training strategy. We use the released code of BLIP-2 ViT-G FlanT5$_{XXL}$[^4] to perform zero-shot generation. - *MiniGPT-4* [minigpt4](http://arxiv.org/pdf/2402.17510v1) utilizes a single projection layer to align visual information from a pre-trained vision encoder with LLM. Specifically, they employ the same visual encoder as used in BLIP-2, a ViT coupled with their pre-trained Q-Former, and Vicuna as LLM. We use the released demonstration[^5] to perform image-instruction generation. - *LLaVA* [llava](http://arxiv.org/pdf/2402.11690v1) applies a single projection layer to convert image features from pre-trained CLIP visual encoder ViT-L/14 into the language embedding space of Vicuna. We use their released demonstration[^6] to perform image-instruction generation. - *MM-REACT* [mmreact](None) integrates ChatGPT/GPT-4 with various specialized vision experts to achieve multimodal reasoning and action. We use their released demonstration[^7] to get responses. ## Quantitative analysis
The comparison between mPLUG-Owl and baselines on OwlEval with manual evaluation metrics. The order of response quality ranking is as follows: A > B > C > D.
In order to comprehensively evaluate various models, we construct a visually-related evaluation set **OwlEval** by collecting 82 artificially constructed questions based on 50 images, where 21 from MiniGPT-4, 13 from MM-REACT, 9 from BLIP-2, 3 from GPT-4 and 4 collected by us. Partial images have multiple rounds of questions, refers to multi-turn conversation cases. These questions examine a variety of model capabilities including natural image understanding, diagram and flowchart comprehension, optical character recognition (OCR), multi-modal creation, knowledge-intensive QA, and referential interaction QA. As questions are open-ended, we employ manual evaluation metrics to rate the model’s responses as A, B, C, or D following the rating method proposed in Self-Instruct [self-instruct](https://doi.org/10.48550/arXiv.2212.10560). We manually score 82 responses given by mPLUG-Owl and baselines. The comparison results are shown in Figure 2. First, mPLUG-Owl gets 66 $A$ and $B$, while the most competitive baseline MiniGPT-4 gets 54. Second, mPLUG-Owl doesn’t get any $D$ scores, outperforming all the models. These results suggest that mPLUG-Owl can better understand both instructions and images, which results in a stronger capability in generating satisfactory responses. For a fair comparison, we have excluded those cases in which MM-REACT failed to make predictions. The results are shown separately in Figure 14 and mPLUG-Owl still exhibits superior performance. To separately examine the single-turn and multi-turn conversation capabilities, we reorganize 82 questions into a single-turn conversation set and a multi-turn conversation set. The former contains the first question from 50 images. The latter contains 52 questions from multi-turn conversation cases. As shown in Figure 3, the mPLUG-Owl achieves outstanding performance in both single-turn and multi-turn conversations.
The comparison results of 50 single-turn responses (left) and 52 multi-turn responses (right) among mPLUG-Owl and baselines on OwlEval with manual evaluation metrics.
## Ablation Study We ablate the two-stage training scheme and the data modality of instruction tuning. Six dimensions of abilities are defined to complete visually related tasks, as shown in Table [fig:mult-modle-level]. For each question, we manually label the required abilities and annotate which abilities are reflected in the model’s response. Table [tb:ablation] shows the ability accuracy of different variants of mPLUG-Owl.
**Training Strategy Ablation.** As shown in Table [tb:ablation], without joint instruction tuning, the model is not good at instruction understanding and fail to generalize pre-training abilities to other tasks (r1 vs r5). With the instruction tuning alone, although the model can better comprehend instructions, the model is incapable of achieving promising performance in visual knowledge-related tasks due to lacking of visually-related knowledge pretraining (r2 vs r5). With both multimodal pretraining and joint instruction tuning, the model achieves the best performance and demonstrates the effectiveness of our two-stage training scheme. **Instruction Data Ablation.** By comparing r3 with r4, text-only instruction tuning brings more improvement in instruction understanding, while multi-modal instruction tuning achieves better knowledge and reasoning capabilities. This is due to that visual question answering mainly requires the alignment of vision and language knowledge, which is not optimized during text-only instruction tuning. Besides, we also verify that introducing multi-modal data during instruction tuning could further improve the model’s performance on text-only tasks, as shown in Table [tab:text-only result] (r5 vs r4). Concretely, following the evaluation setting as Vicuna[vicuna](https://github.com/lm-sys/FastChat), for each question, we pair the response of each model with the one given by ChatGPT and prompt ChatGPT[^8] to give two scores respectively for these two responses. Table [tab:text-only result] shows the total score and the score ratio with the ChatGPT score as a reference.
## Qualitative Analysis In this section, we show qualitative results from our evaluation set OwlEval.
A comparison of Knowledge-intensive QA.
#### Knowledge-intensive QA As shown in Figure 4, the instruction expects the model to identify the movie characters in the image. MM-REACT is unable to provide an effective response to the instruction, while MiniGPT-4 understands the instruction but failed to answer the movie characters. In contrast, mPLUG-Owl answers four out of the five characters present in the image. This demonstrates that mPLUG-Owl has a better understanding of the knowledge in the image.
A comparison of Multi-turn Conversation.
#### Multi-round Conversation The instruction in Figure 5 requires the model to identify the content of the image based on the referential information. The baseline models often made mistakes when faced with referential expressions related to spatial orientation, human behavior, and target attributes in the questions, whereas mPLUG-Owl provided the most accurate response. This capability stems from mPLUG-Owl’s fine-grained understanding of the image, allowing it to locate the corresponding part of the image based on the referential information in the instruction.
A comparison of Reasoning QA.
#### Reasoning Figure 6 shows an instruction asking models to give a prediction based on visual information and explain the reason. mPLUG-Owl analyzes the characteristics of the two teams from the aspects of the lineup and tactics and uses them to reason for the outcome. Although MiniGPT-4 also performs well, its persuasiveness in reasoning is slightly inferior to mPLUG-Owl.
A comparison of Joke Understanding.
More cases of Jokes Comprehension by mPLUG-Owl.
#### Joke Comprehension The case in Figure 7 comes from the GPT-4[gpt4](None), which requires the model to understand and explain a visually related joke. GPT-4 not only follows the instructions in performing analysis panel by panel but also almost perfectly understands the humor of the charging method. mPLUG-Owl also understands this unusual humor, but it incorrectly identified the “VGA” to “USB”. This is mainly due to the limitation of visual information in our training data. More cases about joke comprehension are shown in Figure 8. # Discussion and Limitation In this section, we show some nascent abilities of mPLUG-Owl that is not yet fully developed and discuss the limitation. Part of cases (without scores) in this section are not in OwlEval.
Multi-image correlation cases.
#### Multi-image Correlation In Figure 9, mPLUG-Owl shows a emerging but not strong vision correlation capability across multiple images. In the left case, the model could identify an identical person in two images and correctly tell the difference of cloth color. But in the left case, the model fails to relate 4 images and produces some text hallucinations.
Example prompt of multilingual understanding which showcases the multilingual abilities across Chinese, French, and Japanese, respectively.
#### Multilingual Conversation Besides English, we further test the model’s multilingual ability. As shown in Figure 10, although there is no multilingual data during our two-stage training, mPLUG-Owl shows a promising multilingual understanding for Chinese, French and Japanese. We mainly attribute this ability to the raw text knowledge in LLaMa[llama](http://arxiv.org/pdf/2402.08075v1). However, due to the lacking of multilingual training, mPLUG-Owl may fail to response in corresponding languages. #### Scene Text Understanding In Figure 15, mPLUG-Owl demonstrates its OCR ability in some simple scenes, but we can see that the model’s perception of numbers in images is still limited. However, for the OCR of complex scenes, as shown in Figure 16-17, the performance of mPLUG-Owl is more general, mainly because the perception of numbers in images is weak, which affects the subsequent reasoning calculation. #### Vision-only Document Comprehension Although we did not use any document annotation data for training, the model exhibited some text recognition and document understanding capabilities. Hence, we delved deeper into the combination of document understanding and functionality of our model. as illustrated in Figure 11, we explored movie review writing, code generation, code explanation, chat summary, and application guidance. The model show decent performance in (a) and (b), but still, had some errors. Meanwhile, it was unable to provide usable responses in (d), (e), and (f). Therefore, there is further scope to explore our model’s potential in document understanding and downstream applications.
Examples about various document understanding and application.
#### Open-ended Creation mPLUG-Owl performs well in the creation of poetry, lyrics, advertisements and other works based on images. Its performance in some cases is shown in Figure 12-13. However, further exploration is needed for more functional and practical creations.
Open-ended creation cases.
Copywriting cases.
# Conclusion We propose mPLUG-Owl, a novel training paradigm that enhances the multi-modal abilities of large language models (LLMs). Our approach consists of modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module, which can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. We employ a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. Experimental results demonstrate the impressive capabilities of mPLUG-Owl, indicating its potential for various applications in multi-modal generation. # Training Hyperparameters We report the detailed model training hyperparameters for visual knowledge learning in Table 1 and vision-language joint instruction tuning in Table 2.
| **Hyperparameters** | | |:----------------------------------|:-----------:| | Training steps | 50,000 | | Warmup steps | 375 | | Max length | 512 | | Batch size of image-caption pairs | 4,096 | | Optimizer | AdamW | | Learning rate | 2e-4 | | Learning rate decay | Cosine | | Adam $\epsilon$ | 1e-6 | | Adam $\beta$ | (0.9, 0.98) | | Weight decay | 0.01 | Training hyperparameters for multi-modal pre-training stage.
| **Hyperparameters** | | |:-------------------------------------------|:------------:| | Training steps | 2,000 | | Warmup steps | 50 | | Max length | 1,024 | | Batch size of text instruction data | 128 | | Batch size of multi-modal instruction data | 128 | | Optimizer | AdamW | | Learning rate | 2e-5 | | Learning rate decay | Cosine | | AdamW $\epsilon$ | 1e-6 | | AdamW $\beta$ | (0.9, 0.999) | | Weight decay | 0.0001 | Training hyperparameters for vision-language joint instruction tuning stage.
# Comparison with MM-REACT
The comparison results which exclude the cases that were generated unsuccessfully by MM-REACT.
OCR of simple scenes (mostly scenes with few numbers and no calculation a).
OCR of complex scenes (a).
OCR of complex scenes (b).
[^1]: Equal contribution [^2]: Corresponding author [^3]: [^4]: [^5]: [^6]: [^7]: [^8]: Without access to the GPT-4, we use the ChatGPT as the suboptimal scorer.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models 2023-04-20 Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
Show Paper Content
# Introduction # Related Works # Method # Experiments ## Limitation analysis # Discussion # Appendix [^1]: equal contribution The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model’s generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at . In recent years, large language models (LLMs) have experienced rapid advancements [instructGPT](http://arxiv.org/pdf/2302.05206v1), [chatgpt](http://arxiv.org/pdf/2307.11380v2), [gpt3](http://arxiv.org/pdf/2112.07522v2), [bloom](http://arxiv.org/pdf/2106.06683v2), [llama](http://arxiv.org/pdf/2402.08075v1), [chowdhery2022palm](http://arxiv.org/pdf/2209.05735v4), [hoffmann2022training](http://arxiv.org/pdf/2202.03371v1). With exceptional language understanding capabilities, these models can perform a variety of intricate linguistic tasks in a zero-shot manner. Notably, GPT-4, a large-scale multimodal model, has been recently introduced and demonstrated several impressive capabilities of vision-language understanding and generation [gpt4](http://arxiv.org/pdf/2311.15732v2). For example, GPT-4 can produce detailed and accurate image descriptions, explain unusual visual phenomena, and even construct websites based on handwritten text instructions. Although GPT-4 has exhibited remarkable vision language capabilities, the methods behind its exceptional abilities are still a mystery [gpt4](http://arxiv.org/pdf/2311.15732v2). We believe that these impressive skills may stem from the utilization of a more advanced large language model (LLM). LLMs have demonstrated various emergent abilities, as evidenced in GPT-3’s few-shot prompting setup [gpt3](http://arxiv.org/pdf/2112.07522v2) and the findings of Wei *et al*. (2022) [wei2022emergent](https://openreview.net/forum?id=yzkSU5zdwD). Such emergent properties are hard to find in smaller-scale models. It is conjectured that these emergent abilities are also applicable to multi-modal models, which could be the foundation of GPT-4’s impressive visual description capabilities. To substantiate our hypothesis, we present a novel vision-language model named MiniGPT-4. It utilizes an advanced large language model (LLM), Vicuna [vicuna2023](https://vicuna.lmsys.org), which is built upon LLaMA [llama](http://arxiv.org/pdf/2402.08075v1) and reported to achieve 90% of ChatGPT’s quality as per GPT-4’s evaluation, as the language decoder. In terms of visual perception, we employ the same pretrained vision components of BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3) that consists of a ViT-G/14 from EVA-CLIP [fang2022eva](http://arxiv.org/pdf/2402.18128v1) and a Q-Former network. MiniGPT-4 adds a single projection layer to align the encoded visual features with the Vicuna language model and freezes all the other vision and language components. MiniGPT-4 is initially trained for 20k steps using a batch size of 256 on 4 A100 GPUs, leveraging a combined image captioning dataset that includes images from LAION [laion](http://arxiv.org/pdf/2111.02114v1), Conceptual Captions [changpinyo2021conceptual](http://arxiv.org/pdf/2102.08981v2), [sharma2018conceptual](http://arxiv.org/pdf/2304.13130v1), and SBU [ordonez2011im2text](http://arxiv.org/pdf/2204.00679v1) to align visual features with the Vicuna language model. Nevertheless, merely aligning visual features with the language model (LLM) is inadequate to ensure robust visual conversation capabilities, resembling that of a chatbot. The presence of underlying noise in raw image-text pairs can lead to subpar language outputs. Therefore, we collect another  3,500 detailed image description pairs to further fine-tune the model with a designed conversational template in order to improve the naturalness of the generated language and its usability.
The architecture of MiniGPT-4. It consists of a vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and an advanced Vicuna large language model. MiniGPT-4 only requires training the linear projection layer to align the visual features with the Vicuna.
In our experiments, we discovered that MiniGPT-4 possesses numerous capabilities similar to those demonstrated by GPT-4. For instance, MiniGPT-4 can generate intricate image descriptions, create websites based on handwritten text instructions, and explain unusual visual phenomena. Furthermore, our findings revealed that MiniGPT-4 also has a variety of other intriguing abilities not showcased in the GPT-4 demonstrations. For example, MiniGPT-4 can directly generate detailed cooking recipes from food photos, write stories or poems inspired by images, write advertisements for products in images, identify problems shown in photos and provide corresponding solutions, and retrieve rich facts about people, movies, or art directly from images, among other capabilities. These abilities are absent in previous vision-language models like Kosmos-1 [kosmos](http://arxiv.org/pdf/2302.14045v2) and BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3) that use less powerful language models. This further validates that integrating visual features with an advanced language model is one of the keys to enhancing vision-language models. We present a summary of our key findings: - Our research reveals with compelling evidence that by aligning visual features with advanced large language models like Vicuna, MiniGPT-4 can achieve advanced vision-language capabilities comparable to those exhibited in the GPT-4 demonstrations. - Our findings suggest that training merely one projection layer can effectively align a pretrained vision encoder with the large language model. Our MiniGPT-4 only requires training approximately 10 hours on 4 A100 GPUs. - We discovered that simply aligning visual features with large language models using short image caption pairs is not sufficient for developing a well-performing model and leads to unnatural language generation. Further finetuning with a small but detailed image description pairs can address this limitation and significantly improves its usability. **Large language models** have experienced tremendous success in recent years due to the scaling up of training data and an increase in the number of parameters. Early models, such as BERT [bert](http://arxiv.org/pdf/1810.04805v2), GPT-2 [gpt2](http://arxiv.org/pdf/2203.12926v1), and T5 [t5](http://arxiv.org/pdf/1910.10683v4), laid the foundation for this progress. Subsequently, GPT-3 [gpt3](http://arxiv.org/pdf/2112.07522v2), with a massive scale of 175 billion parameters, was introduced, demonstrating significant breakthroughs across numerous language benchmarks. This development inspired the creation of various other large language models, including Megatron-Turing NLG [smith2022using](http://arxiv.org/pdf/2201.11990v3), Chinchilla [hoffmann2022training](http://arxiv.org/pdf/2202.03371v1), PaLM [chowdhery2022palm](http://arxiv.org/pdf/2209.05735v4), OPT [zhang2022opt](http://arxiv.org/pdf/2405.04515v2), BLOOM [scao2022bloom](http://arxiv.org/pdf/2106.06683v2), and LLaMA [llama](http://arxiv.org/pdf/2402.08075v1), among others. Wei *et al.* [wei2022emergent](https://openreview.net/forum?id=yzkSU5zdwD) further discovered several *emergent abilities*, which appear exclusively in large models. The emergence of these abilities underscores the importance of scaling up in the development of large language models. Moreover, by aligning the pre-trained large language model GPT-3 with human intent, instructions and human feedback, InstructGPT [instructGPT](http://arxiv.org/pdf/2302.05206v1) and ChatGPT [chatgpt](http://arxiv.org/pdf/2307.11380v2) enable conversational interactions with humans and can answer a wide range of diverse and complex questions. More recently, several open-sourced models, such as Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca) and Vicuna [vicuna2023](https://vicuna.lmsys.org), have been developed based on LLaMA [llama](http://arxiv.org/pdf/2402.08075v1) and also exhibit similar performance. **Leveraging Pre-trained LLMs in Vision-Language Tasks.** In recent years, the trend of using autoregressive language models as decoders in vision-language tasks has gained significant traction [visualgpt](http://arxiv.org/pdf/2102.10407v5), [kosmos](http://arxiv.org/pdf/2302.14045v2), [yang2022zero](http://arxiv.org/pdf/2206.08155v2), [tiong2022plug](http://arxiv.org/pdf/2210.08773v3), [alayrac2022flamingo](http://arxiv.org/pdf/2205.07065v1), [blip2](http://arxiv.org/pdf/2301.12597v3), [blip1](http://arxiv.org/pdf/2311.01038v2), [palm_e](http://arxiv.org/pdf/2302.14030v3). This approach takes advantage of cross-modal transfer, allowing knowledge to be shared between language and multimodal domains. Pioneering studies like VisualGPT [visualgpt](http://arxiv.org/pdf/2102.10407v5) and Frozen [tsimpoukelli2021multimodal](http://arxiv.org/pdf/2106.13884v2) have demonstrated the benefits of employing a pre-trained language model as a vision-language model decoder. Flamingo [alayrac2022flamingo](http://arxiv.org/pdf/2205.07065v1) was then developed to align a pre-trained vision encoder and language model using gated cross-attention, and was trained on billions of image-text pairs, showcasing impressive in-context few-shot learning capabilities. Following that, BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3) was introduced, employing a Flan-T5 [flanT5](http://arxiv.org/pdf/2202.03371v1) with a Q-Former to efficiently align visual features with the language model. Most recently, PaLM-E [palm_e](http://arxiv.org/pdf/2302.14030v3), featuring 562 billion parameters, has been developed to integrate real-world continuous sensor modalities into an LLM, thereby establishing a connection between real-world perceptions and human languages. GPT-4 [gpt4](http://arxiv.org/pdf/2311.15732v2) has also been recently released, showcasing more powerful visual understanding and reasoning abilities after pre-training on a vast collection of aligned image-text data. LLMs, such as ChatGPT, have proven to be powerful tools in enhancing the performance of vision-language tasks by collaborating with other specialized models. For instance, Visual ChatGPT [visualChatGPT](http://arxiv.org/pdf/2303.04671v1) and MM-REACT [yang2023mmreact](http://arxiv.org/pdf/2303.11381v1) showcase how ChatGPT can act as a coordinator, integrating with diverse visual foundation models and facilitating their collaboration to tackle more complex challenges. ChatCaptioner [chatcaptioner](http://arxiv.org/pdf/2303.06594v1) treats ChatGPT as a questioner, prompting diverse questions for BLIP-2 to answer. Through multi-round conversations, ChatGPT extracts visual information from BLIP-2 and effectively summarizes the image content. Video ChatCaptioner [chen2023video](http://arxiv.org/pdf/2304.04227v3) extends this approach, applying it to video spatiotemporal understanding. ViperGPT [vipergpt](http://arxiv.org/pdf/1905.11127v1) demonstrates the potential of combining an LLM with different vision models to address complex visual queries programmatically. In contrast, MiniGPT-4 directly aligns visual information with the language model to accomplish diverse vision-language tasks without the usage of external vision models. MiniGPT-4 aims to align visual information from a pretrained vision encoder with an advanced large language model (LLM). Specifically, we utilize the Vicuna [vicuna2023](https://vicuna.lmsys.org) as our language decoder, which is constructed upon LLaMA [llama](http://arxiv.org/pdf/2402.08075v1) and can perform a wide range of complex linguistic tasks. For visual perception, we employ the same visual encoder as used in BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3), a ViT backbone [fang2022eva](http://arxiv.org/pdf/2402.18128v1) coupled with their pre-trained Q-Former. Both language and vision models are open-sourced. We target to bridge the gap between the visual encoder and LLM using a linear projection layer, with an overview of our model displayed in Fig.[fig:overview]. To achieve an effective MiniGPT-4, we propose a two-stage training approach. The initial stage involves pretraining the model on a large collection of aligned image-text pairs to acquire vision-language knowledge. In the second stage, we finetune the pretrained model with a smaller but high-quality image-text dataset with a designed conversational template to enhance generation reliability and usability. ## First pretraining stage During the initial pretraining stage, the model is designed to acquire vision-language knowledge from a large collection of aligned image-text pairs. We regard the output from the injected projection layer as a soft prompt for the LLM, prompting it to generate the corresponding ground-truth texts. Throughout the entire pretraining process, both the pretrained vision encoder and the LLM remain frozen, with only the linear projection layer being pretrained. We use a combined dataset of Conceptual Caption [changpinyo2021conceptual](http://arxiv.org/pdf/2102.08981v2), [sharma2018conceptual](http://arxiv.org/pdf/2304.13130v1), SBU [ordonez2011im2text](http://arxiv.org/pdf/2204.00679v1) and LAION [laion](http://arxiv.org/pdf/2111.02114v1) to train our model. Our model undergoes 20,000 training steps with a batch size of 256, covering approximately 5 million image-text pairs. The entire process takes about 10 hours to complete, utilizing 4 A100 (80GB) GPUs. **Issues of the first pretraining stage** Following the first pretraining stage, our MiniGPT-4 demonstrates the capacity to possess a wealth of knowledge and offer reasonable responses to human inquiries. However, we have observed instances where it produces incoherent linguistic outputs, such as repetitive words or sentences, fragmented sentences, or irrelevant content. These issues hinder MiniGPT-4’s ability to engage in a fluent visual conversation with humans. We also observed similar challenges encountered in GPT-3. Despite its pretraining on a extensive language dataset, GPT-3 struggles to generate language outputs that are accurately aligned with users’ intentions. Through a process of instruction fine-tuning and reinforcement learning from human feedback, GPT-3 evolves into GPT-3.5 [instructGPT](http://arxiv.org/pdf/2302.05206v1), [chatgpt](http://arxiv.org/pdf/2307.11380v2) and becomes capable of producing more human-friendly outputs. This phenomenon bears a resemblance to the current state of MiniGPT-4 following its initial pretraining stage. As such, it is not surprising that our model may struggle to generate fluent and natural human language outputs at this stage. ## Curating a high-quality alignment dataset for vision-language domain. To achieve greater naturalness in the generated language and enhance the model’s usability, a second-stage alignment process is essential. While in the realm of NLP, instruction fine-tuning datasets [alpaca](https://github.com/tatsu-lab/stanford_alpaca) and conversations [sharegpt](https://github.com/domeccleston/sharegpt) are easily accessible, no equivalent datasets exist for the vision-language domain. To address this deficiency, we carefully curated a detailed image description dataset, specifically tailored for vision-language alignment purposes. This dataset is subsequently utilized to fine-tune our MiniGPT-4 during the second-stage alignment process. #### Initial aligned image-text generation In the initial phase, we employ the model derived from the first pretraining stage to generate comprehensive descriptions of input images. To enable our model to produce more detailed image descriptions, we designed a prompt that adheres to the conversational format of the Vicuna [vicuna2023](https://vicuna.lmsys.org) language model, as shown below. In this prompt, *\* represents the visual features produced by the linear projection layer. *\###Human: \\\Describe this image in detail. Give as many details as possible. Say everything you see. \###Assistant:* To identify incomplete sentences, we examine whether the generated sentence exceeds 80 tokens. If it does not, we incorporate an additional prompt, *\###Human: Continue \###Assistant:* , prompting our MiniGPT-4 to extend the generation process. By concatenating the outputs from both steps, we can create a more comprehensive image description. This approach enables us to generate image-text pairs with detailed and informative image descriptions. We randomly select 5,000 images from the Conceptual Caption dataset [changpinyo2021conceptual](http://arxiv.org/pdf/2102.08981v2), [sharma2018conceptual](http://arxiv.org/pdf/2304.13130v1) and use the pretrained model to generate corresponding language descriptions for each image. #### Data post-processing The above automatically generated image descriptions contain noisy or incoherent descriptions, such as repetition of words or sentences, fragmented sentences, or irrelevant content. In order to fix these issues, we employ ChatGPT to mend the descriptions by utilizing the following prompt: *Fix the error in the given paragraph. Remove any repeating sentences, meaningless characters, not English sentences, and so on. Remove unnecessary repetition. Rewrite any incomplete sentences. Return directly the results without explanation. Return directly the input paragraph if it is already correct without explanation.* Upon completing the post-processing stage, we manually verify the correctness of each image description to guarantee its high quality. Specifically, we first identified several frequently shown errors (*“I’m sorry I made a mistake...”, or “I apologize for that ...”*) and then hard-coded rules to automatically filter them out. We also manually refine the generated captions by eliminating redundant words or sentences that ChatGPT fails to detect. Finally, only approximately 3,500 out of 5,000 image-text pairs satisfy our requirement, and these pairs are subsequently utilized for the second-stage alignment process. ## Second-stage finetuning During the second stage, we finetune our pretrained model with the curated high-quality image-text pairs. During the finetuning, we use the predefined prompts in the following template: *\###Human: \\\\###Assistant:* In this prompt, *\* represents a randomly sampled instruction from our predefined instruction set containing variant forms of instructions such as *“Describe this image in detail”* or *“Could you describe the contents of this image for me”*. It is important to note that we do not calculate the regression loss for this specific text-image prompt. As a result, MiniGPT-4 is now capable of producing more natural and reliable language outputs. Furthermore, we observed that this fine-tuning process is remarkably efficient, only requiring a mere 400 training steps with a batch size of 12, which takes around 7 minutes with a single A100 GPU. In the experiment, we aim to showcase the diverse and emergent capabilities of our MiniGPT-4 model through various qualitative examples. These abilities include generating detailed image descriptions, identifying amusing aspects within memes, providing food recipes from photos, writing poems for images, etc. Additionally, we present quantitative results on the task of image captioning.
Advertisement promotion
## Uncovering emergent abilities with MiniGPT-4 through qualitative examples MiniGPT-4 demonstrates many advanced abilities compared to traditional vision-language models. For example, it can describe images in detail and interpret the humorous aspects of a given meme. Here, we qualitatively compared our model to one of the leading vision-language models, BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3), with eight distinct examples, each highlighting a different ability. An example in Fig.[fig:detailed] demonstrates that MiniGPT-4 effectively identifies various elements within the image, such as busy city streets, clock towers, shops, restaurants, motorcycles, people, streetlights, and clouds. In contrast, BLIP-2 can only cover city streets, people, and motorcycles in its image caption generation. Another example presented in Fig.2 shows that MiniGPT-4 successfully explains why the meme is humorous. It interprets that the lying dog is feeling the same way as many people do on Monday, which is often considered to be the most dreaded day of the week. In contrast, BLIP-2 only briefly describes the image content and fails to comprehend the amusing aspects of the image. We also showcase MiniGPT-4’s other abilities by demonstrating other distinctive abilities. These include creating advertising promotions based on a given image (Fig.1), retrieving factual information from a movie photograph (Fig.[fig:movie]), generating a food recipe from a food image (Fig.[fig:cook]), diagnosing plant diseases and suggesting treatment plans (Fig.[fig:plant]), creating a website from a hand-written draft (Fig.3), and writing poems inspired by an image (Fig.[fig:poem]). These abilities are absent in traditional vision-language models like BLIP-2 (utilizing Flan-T5 XXL [flanT5](http://arxiv.org/pdf/2202.03371v1) as a language model), which use less powerful language models (LLMs). This contrast indicates that those advanced vision-language abilities only emerge when the visual features are properly aligned with an advanced LLM such as Vicuna [vicuna2023](https://vicuna.lmsys.org).
Meme explaining
Website Creating
Model generations from BLIP-2, BLIP-2 finetuned our second stage data (BLIP-2 FT), MiniGPT-4 finetuned with Local Narrative data in the second stage (MiniGPT-4 LocNa), MiniGPT-4 model without Q-Former (MiniGPT-4 No Q-Former), and MiniGPT-4.
## Quantitative analysis
An example of MiniGPT-4’s limitations. MiniGPT-4 hallucinates unexisting tablecloths and can’t locate the windows correctly.
#### Advanced Abilities To quantify performance on advanced vision-language tasks, we compiled a small evaluation dataset comprising 4 tasks: meme interpretation with the question “Explain why this meme is funny.”, recipe generation with the question “How should I make something like this?”, advertisement creation with the prompt “Help me draft a professional advertisement for this.”, and poem composition with “Can you craft a beautiful poem about this image?”. In total, we collect 100 diverse images, with 25 images allocated to each task. We asked human evaluators to determine whether the model generation satisfies the request. We compared our results with BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3) and present the findings in Tab.[tab: quanti_adv]. In meme interpretation, poem writing, and advertisement creation, BLIP-2 largely struggles to fulfill any requests. For recipe generation, BLIP-2 succeeds in 4 out of 25 cases. In contrast, MiniGPT-4 manages to address the requests in recipes, advertisements, and poem generation in nearly 80% of the instances. Furthermore, MiniGPT-4 correctly comprehends the challenging humor understanding in memes in 8 out of 25 cases. #### Image Captioning We evaluate the performance of MiniGPT-4 on the COCO caption benchmark and compare it with BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3). Our model’s generated captions typically contain rich visual details. As such, conventional similarity-based image-caption evaluation metrics struggle to provide an accurate evaluation of our models. In this regard, we evaluate the performance by checking if the generated captions cover all the ground truth captions’ information with the help of ChatGPT and details can be found in Appx.[appx: caption_eval]. Results in Tab.[human_evaluation] shows that MiniGPT-4 outperforms BLIP-2 in generating captions that are more closely aligned with the ground-truth visual objects and relationships. With a success rate of 66.2%, MiniGPT-4 is considerably more accurate than BLIP-2, which achieves only 27.5%. Further evaluation on traditional VQA tasks can be found in Appx.[appx: vqa]. ## Analysis on the second-stage finetuning #### Effectiveness of the second-stage finetuning The utilization of only the model pretrained after the first pretraining stage may result in failures, such as the occurrence of repetitive words or sentences, fragmented sentences, or irrelevant content. However, these issues have been largely mitigated through the second-stage finetuning process. This can be observed in Fig.[fig:secondstage], where MiniGPT-4 generates incomplete captions before the second-stage finetuning. However, after the second-stage finetuning, MiniGPT-4 is capable of generating complete and fluent captions. In this section, we investigate the importance and effectiveness of the second-stage finetuning approach. To quantify the impact of second-stage finetuning, we randomly sampled 100 images from the COCO test set and investigated the model performance on two tasks: detailed description generation and poem writing. The prompts used were “*Describe the image in detail.*” and “*Can you write a beautiful poem about this image?*”. These tasks were performed by both the models before and after second-stage finetuning. We manually counted the number of failure generations for the model in each stage. The results are presented in Tab.[exp:stage2ablation]. Prior to the second-stage finetuning, approximately 1/3 of the generated outputs failed to match ground truth captions or poems. In contrast, the model after second-stage fineuning has less than two failure cases out of the 100 test images for both tasks. These experimental results demonstrate that second-stage finetuning yields a significant improvement in the quality of generated outputs. A qualitative example of the model generation before and after the second-stage finetuning is shown in Fig.[fig:secondstage]. #### Can the original BLIP-2 benefit from the second-stage data? In this study, we finetune BLIP-2 [blip2](http://arxiv.org/pdf/2301.12597v3) with our second-stage data in the same way as MiniGPT-4, and check if it can obtain similar advanced abilities as MiniGPT-4. The finetuned BLIP-2 is denoted as BLIP-2 FT. Note that MiniGPT-4 uses the same visual module as BLIP-2; while BLIP-2 uses FlanT5 XXL [flanT5](http://arxiv.org/pdf/2202.03371v1) as the language model, which is not as strong as the Vicuna [vicuna2023](https://vicuna.lmsys.org) model used in our MiniGPT-4 model. We rely on the same prompts to assess the advanced capabilities of our model. Qualitative results are shown in Fig.4, [fig:ab_cook], and [fig:ab_des]. We discover that BLIP-2 FT still generates short responses and fails to generalize to advanced tasks like meme explaining and website coding (Fig.4). Our finding suggests that BLIP-2’s relatively weaker language model FlanT5 XXL benefits less from such a small dataset, and highlights the effectiveness of a more advanced LLM in a VLM system. #### Second stage with Localized Narratives The dataset Localized Narratives [pont2020connecting](http://arxiv.org/pdf/2302.11217v2) is a detailed image description dataset where annotators describe images while simultaneously localizing the corresponding regions. Here, we test the performance of our model by replacing our self-collected dataset in the second-stage with the Localized Narratives dataset. The model is denoted as MiniGPT-4 LocNa. Qualitative results in Fig.4, [fig:ab_cook], and [fig:ab_des] show that MiniGPT-4 LocNa can generate long image descriptions (Fig.[fig:ab_des]). However, the generated outputs have lower quality with monotonous expressions. Besides, MiniGPT-4 LocNa does not generalize as well as the original MiniGPT-4 in other complex tasks like explaining why the meme is funny (Fig.2). The performance gap may be due to the monotonous and repeated image descriptions in Localized Narratives. ## Ablation on the architecture designs To further demonstrate the effectiveness of using one single linear layer to align visual features with LLM, we conduct experiments with different architecture designs, including (a) removing the Q-Former and directly mapping the VIT’s output to Vicuna’s embedding space (i.e., without Q-former), (b) using three linear layers instead of one layer, and (c) additionally finetuning the Q-Former in the vision module. All the variants are trained in the same way as the original design. Results on AOK-VQA [schwenk2022okvqa](http://arxiv.org/pdf/2206.01718v1) and GQA [hudson2019gqa](http://arxiv.org/pdf/2112.05136v1) datasets in Tab.[tab: ablation] show that the variant (a) **MiniGPT-4 w/o Q-Former** has a similar performance to the original design. Qualitative results of this variant in Fig.4, [fig:ab_cook], and [fig:ab_des] also show similar advanced skills. This reveals that the Q-Former from BLIP-2 doesn’t plays a critical roles for advanced skills. Besides, both variants (b) **MiniGPT-4+ 3 Layers** and (c) **MiniGPT-4 + finetuning Q-Former**, perform slightly worse than the original MiniGPT-4. This indicates a single projection layer is sufficient to align the vision encoder and the large language model in our limited training data setting. #### Hallucination As MiniGPT-4 is built upon LLMs, it inherits LLM’s limitations like hallucinating nonexistent knowledge. An example in Fig. [fig:Limitation] shows that MiniGPT-4 incorrectly identifies the presence of white tablecloths in the image, despite their absence. Here, we use the metric $\text{CHAIR}_i$ [rohrbach2018object](http://arxiv.org/pdf/1809.02156v2) to gauge the hallucination rate of the generation, with the two distinct prompts to control the model generation length: *MiniGPT-4 (long)*: Please describe this image as detailed as possible. *MiniGPT-4 (short)*: Please describe the image shortly and precisely, in less than 20 words. Results in Tab.[tab:hallu] show that longer captions tend to have higher hallucination rates. For example, MiniGPT-4 (long) generates captions averaging 175 words with a higher hallucination rate, while MiniGPT-4 (short) averages 28.8 words with a lower rate. BLIP-2, averaging 6.5 words, hallucinates less but covers fewer objects as seen in Tab.[human_evaluation]. Hallucination in detailed image descriptions is still an unresolved issue. Using Reinforcement Learning with AI feadback with hallucination detection modules may be a potential solution. #### Spatial Information Understanding MiniGPT-4’s visual perception remains limited. It may struggle to differentiate spatial localization. For example, MiniGPT-4 in Fig. [fig:Limitation] fails to identify the location of the windows. This limitation may stem from a lack of aligned image-text data designed for spatial information understanding. Training on such datasets like RefCOCO [kazemzadeh2014referitgame](http://arxiv.org/pdf/1808.08754v1) or Visual Genome [krishna2017visual](http://arxiv.org/pdf/1602.07332v1) could potentially alleviate this issue. How does MiniGPT-4 obtain these advanced abilities? Many of the advanced vision-language capabilities demonstrated by GPT-4 can be understood as compositional skills rooted in two foundational skills: image understanding and language generation. Take the task of image-based poem writing as an example. Advanced LLMs like ChatGPT and Vicuna can already craft poems based on users’ instructions. If they acquire the ability to understand images, compositionally generalizing to the task of image-based poem writing even without having image-poem pairs in their training data is possible. In the first pretraining stage, MiniGPT-4 learns to understand images by modeling the correlation between images and short image descriptions from image caption datasets. However, the language style in these image caption datasets differs from that of modern LLMs’ generation, which leads to distorted language generation and hinders successful compositional generalization. Therefore, we introduce a second-stage finetuning to restore the language generation ability. MiniGPT-4 after the two-stage training successfully generalizes to many advanced compositional vision-language abilities like website coding from drafts or meme interpretation, verifies our assumption. Future research might delve deeper into the mechanism of compositional generalization and seek ways to enhance them. We hope our work, as an early exploration of these vision-based LLM capabilities, will spur further investigations in this domain. ## More Qualitative Results
Factual retrieval
Poem writing
Plant cultivating
## Evaluation in traditional VQA benchmarks [appx: vqa] The aim of this study is to replicate the remarkable multi-modal capabilities demonstrated in GPT-4, such as generating detailed image descriptions and creating websites from hand-drawn drafts. To emphasize the most crucial component of advanced vision-language skills, the methodology of MiniGPT-4 is intentionally kept minimal. For instance, the learnable model capacity is limited (only one linear layer), and MiniGPT-4 is trained with just 5 million pairs, in contrast to BLIP-2 with 129 million image-text pairs. Such a pared-down approach is anticipated to yield suboptimal results on traditional benchmarks. While this isn’t our primary goal, we offer a quantitative analysis of the VQA datasets A-OKVQA (multi-choice) [schwenk2022okvqa](http://arxiv.org/pdf/2206.01718v1) and GQA [hudson2019gqa](http://arxiv.org/pdf/2112.05136v1). Additionally, to showcase the potential of MiniGPT-4 with traditional benchmarks, we conduct a straightforward ablation study. Here, we simply unfreeze the LLM using LoRA [hu2021lora](http://arxiv.org/pdf/2402.11485v1) and incorporate more training data from the VQAv2, OKVQA, and A-OKVQA datasets during the second finetuning stage. Results in Tab. [tab_supp] indicate that the original MiniGPT-4 lags behind BLIP-2 by a reasonable margin, and merely augmenting the learning capacity and the training data results in a substantial performance improvement, which confirms our expectations. We believe our model’s performance on conventional vision benchmarks can be enhanced with a carefully designed training strategy (e.g., dataset sample ratios, learning rate schedule, etc.), more training data/datasets, and additional learnable parameters. Since enhancing performance on traditional vision benchmarks isn’t this project’s objective, we reserve this aspect for future research. ## Details of Caption Evaluation [appx: caption_eval] We employ ChatGPT to determine whether the baseline models cover all the objects and visual relations presented in the ground-truth captions. For the COCO evaluation dataset, we randomly choose one ground-truth caption and treat it as the reference caption. We apply the following prompt to perform the evaluation. *There is one image caption1 ‘{ground-truth caption}’, and there is another image caption2 ‘{comparison caption}’. Does image caption2 cover all the objects and visual relations shown in image caption1? Only answer yes or no without any explanation.* ## More qualitative ablation results
Ablation Study on Detailed Description
Visual Instruction Tuning 2023-04-17 Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
Show Paper Content
# Introduction Humans interact with the world through many channels such as vision and language, as each individual channel has a unique advantage in representing and communicating certain concepts, and thus facilitates a better understanding of the world. One of the core aspirations in artificial intelligence is to develop a general-purpose assistant that can effectively follow multi-modal vision-and-language instructions, aligned with human intent to complete various real-world tasks in the wild [askell2021general](http://arxiv.org/pdf/2112.00861v3), [li2022elevater](None), [li2023multimodal](http://arxiv.org/pdf/2309.10020v1). To this end, the community has witnessed an emergent interest in developing language-augmented foundation vision models [li2022elevater](None), [gan2022vision](http://arxiv.org/pdf/2110.03902v1), with strong capabilities in open-world visual understanding such as classification [radford2021learning](http://arxiv.org/pdf/2404.19696v1), [openclip](https://doi.org/10.5281/zenodo.5143773), [yuan2021florence](http://arxiv.org/pdf/2301.05065v2), [yang2022unicl](http://arxiv.org/pdf/2107.11945v1), [pham2021combined](http://arxiv.org/pdf/1602.01255v2), detection [li2022grounded](http://arxiv.org/pdf/2210.09263v1), [zhong2022regionclip](http://arxiv.org/pdf/1606.00540v1), [liu2023grounding](http://arxiv.org/pdf/2401.02361v2), segmentation [li2022language](http://arxiv.org/pdf/2304.10326v1), [zou2022generalized](http://arxiv.org/pdf/2212.11270v1), [zhang2023simple](http://arxiv.org/pdf/1809.09299v1) and captioning [wang2022git](http://arxiv.org/pdf/2204.07780v1), [li2023blip](http://arxiv.org/pdf/2301.12597v3), as well as visual generation and editing [DALLE2](http://arxiv.org/pdf/2204.06125v1), [LDM](http://arxiv.org/pdf/2307.10094v1), [PARTI](http://arxiv.org/pdf/2102.11495v1), [MAKEASCENE](http://arxiv.org/pdf/2211.01602v2), [Imagen](http://arxiv.org/pdf/2205.11487v1), [li2023gligen](http://arxiv.org/pdf/2311.09144v2). We refer readers to the *Computer Vision in the Wild* reading list for a more up-to-date literature compilation [cvinw](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings). In this line of work, each task is solved independently by one single large vision model, with the task instruction implicitly considered in the model design. Further, language is only utilized to describe the image content. While this allows language to play an important role in mapping visual signals to language semantics—a common channel for human communication, it leads to models that usually have a fixed interface with limited interactivity and adaptability to the user’s instructions. Large language models (LLM), on the other hand, have shown that language can play a wider role: a universal interface for a general-purpose assistant, where various task instructions can be explicitly represented in language and guide the end-to-end trained neural assistant to switch to the task of interest to solve it. For example, the recent success of ChatGPT [chatgpt](https://openai.com/blog/chatgpt/) and GPT-4 [gpt4](https://arxiv.org/pdf/2303.08774) have demonstrated the power of aligned LLMs in following human instructions, and have stimulated tremendous interest in developing open-source LLMs. Among them, LLaMA [touvron2023llama](http://arxiv.org/pdf/2402.08075v1) is an open-source LLM that matches the performance of GPT-3. Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca), Vicuna [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), GPT-4-LLM [peng2023instruction](None) utilize various machine-generated high-quality instruction-following samples to improve the LLM’s alignment ability, reporting impressive performance compared with proprietary LLMs. Importantly, this line of work is *text-only*. In this paper, we present *visual instruction-tuning*, the first attempt to extend instruction-tuning to the language-image multimodal space, to pave the way towards building a general-purpose visual assistant. In particular, our paper makes the following contributions: - *Multimodal instruction-following data*. One key challenge is the lack of vision-language instruction-following data. We present a data reformation perspective and pipeline to convert image-text pairs into an appropriate instruction-following format, using ChatGPT/GPT-4. - *Large multimodal models*. We develop a large multimodal model (LMM), by connecting the open-set visual encoder of CLIP [radford2021learning](http://arxiv.org/pdf/2404.19696v1) with the language decoder Vicuna [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), and fine-tuning end-to-end on our generated instructional vision-language data. Our empirical study validates the effectiveness of using generated data for LMM instruction-tuning, and suggests practical tips for building a general-purpose instruction-following visual agent. When ensembled with GPT-4, our approach achieves SoTA on the Science QA [lu2022learn](http://arxiv.org/pdf/2209.09513v2) multimodal reasoning dataset. - *Multimodal instruction-following benchmark*. We present LLaVA-Bench with two challenging benchmarks, with a diverse selection of paired images, instructions and detailed annotations. - *Open-source*. We release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo. # Related Work **Multimodal Instruction-following Agents.**  In computer vision, existing works that build instruction-following agents can be broadly categorized into two classes: $(i)$ End-to-end trained models, which are separately explored for each specific research topic. For example, the vision-language navigation task [anderson2018vision](http://arxiv.org/pdf/2402.11498v1), [hao2020towards](http://arxiv.org/pdf/2304.04907v1) and Habitat [szot2021habitat](http://arxiv.org/pdf/2106.14405v2) require the embodied AI agent to follow natural language instructions and take a sequence of actions to complete goals in visual environments. In the image editing domain, given an input image and a written instruction that tells the agent what to do, InstructPix2Pix [brooks2022instructpix2pix](http://arxiv.org/pdf/2405.14785v1) edits images by following the human instructions. $(ii)$ A system that coordinates various models via LangChain [langchain](https://github.com/hwchase17/langchain) / LLMs [chatgpt](https://openai.com/blog/chatgpt/), such as Visual ChatGPT [wu2023visual](http://arxiv.org/pdf/2303.04671v1), X-GPT [zou2022generalized](http://arxiv.org/pdf/2212.11270v1), MM-REACT [yang2023mm](http://arxiv.org/pdf/2303.11381v1), VisProg [gupta2022visual](http://arxiv.org/pdf/2203.15442v1), and ViperGPT [suris2023vipergpt](http://arxiv.org/pdf/1905.11127v1). While sharing the same goal in building instruction-following agents, we focus on developing an end-to-end trained language-vision multimodal model for *multiple* tasks. **Instruction Tuning.** In the natural language processing (NLP) community, to enable LLMs such as GPT-3 [brown2020language](http://arxiv.org/pdf/2112.07522v2), T5 [raffel2020exploring](http://arxiv.org/pdf/1910.10683v4), PaLM [chowdhery2022palm](http://arxiv.org/pdf/2209.05735v4), and OPT [zhang2022opt](None) to follow natural language instructions and complete real-world tasks, researchers have explored methods for LLM instruction-tuning [ouyang2022training](http://arxiv.org/pdf/2302.05206v1), [wang2022benchmarking](http://arxiv.org/pdf/2212.12017v3), [wang2022self](http://arxiv.org/pdf/2311.00233v2), leading to instruction-tuned counterparts such as InstructGPT [ouyang2022training](http://arxiv.org/pdf/2302.05206v1)/ChatGPT [chatgpt](https://openai.com/blog/chatgpt/), FLAN-T5 [chung2022scaling](http://arxiv.org/pdf/2202.03371v1), FLAN-PaLM [chung2022scaling](http://arxiv.org/pdf/2202.03371v1), and OPT-IML [iyer2022opt](http://arxiv.org/pdf/2210.11617v1), respectively. It turns out that this simple approach can effectively improve the zero- and few-shot generalization abilities of LLMs. It is thus natural to borrow the idea from NLP to computer vision. More broadly, the teacher-student distillation ideas with foundation models have been studied in other topics such as image classification [faghri2023reinforce](http://arxiv.org/pdf/2303.08983v3). Flamingo [alayrac2022flamingo](http://arxiv.org/pdf/2205.07065v1) can be viewed as the GPT-3 moment in the multimodal domain, due to its strong performance on zero-shot task transfer and in-context-learning. Other LMMs trained on image-text pairs include BLIP-2 [li2023blip](http://arxiv.org/pdf/2301.12597v3), FROMAGe [koh2023grounding](http://arxiv.org/pdf/2401.13388v2), and KOSMOS-1 [huang2023language](http://arxiv.org/pdf/2302.14045v2). PaLM-E [driess2023palm](None) is an LMM for embodied AI. Based on the recent “best” open-source LLM LLaMA, OpenFlamingo [anas_awadalla_2023_7733589](https://doi.org/10.5281/zenodo.7733589) and LLaMA-Adapter [zhang2023llama](http://arxiv.org/pdf/2207.10858v1) are open-source efforts that enable LLaMA to use image inputs, paving the way to build open-source multimodal LLMs. While these models present promising task transfer generalization performance, they are not explicitly tuned with vision-language instruction data, and their performance in multimodal tasks usually falls short compared to language-only tasks. In this paper, we aim to fill this gap and study its effectiveness. Finally, note that visual instruction tuning is different from visual prompt tuning [jia2022visual](http://arxiv.org/pdf/2309.01155v2): the former aims to improve the model’s instruction-following abilities, while the latter aims to improve the parameter-efficiency in model adaptation. # GPT-assisted Visual Instruction Data Generation [sec:visual_instruc_data] The community has witnessed a surge in the amount of public multimodal data such as image-text pairs, ranging from CC [changpinyo2021conceptual](http://arxiv.org/pdf/2102.08981v2) to LAION [schuhmann2022laion](http://arxiv.org/pdf/2312.15897v1). However, when it comes to multimodal instruction-following data, the available amount is limited, partially because the process for creating such data is time-consuming and less well-defined when human crowd-scouring is considered. Inspired by the success of recent GPT models in text-annotation tasks [gilardi2023chatgpt](http://arxiv.org/pdf/2303.15056v2), we propose to leverage ChatGPT/GPT-4 for multimodal instruction-following data collection, based on the widely existing image-pair data. For an image $\Xmat_{\texttt{v}}$ and its associated caption $\Xmat_{\texttt{c}}$, it is natural to create a set of questions $\Xmat_{\texttt{q}}$ with the intent to instruct the assistant to describe the image content. We prompt GPT-4 to curate such a list of questions (see details in Appendix). Therefore, a simple way to expand an image-text pair to its instruction-following version is $\texttt{Human}: \Xmat_{\texttt{q}} ~\Xmat_{\texttt{v}} \texttt{}~ \texttt{Assistant}: \Xmat_{\texttt{c} } \texttt{}$. Though cheap to construct, this simple expanded version lacks diversity and in-depth reasoning in both the instructions and responses.
| | | |:---|:--:| | | | | A group of people standing outside of a black vehicle with various luggage. | | | Luggage surrounds a vehicle in an underground parking area | | | People try to fit all of their luggage in an SUV. | | | The sport utility vehicle is parked in the public garage, being packed for a trip | | | Some people with luggage near a van that is transporting it. | | | | | | person: \[0.681, 0.242, 0.774, 0.694\], backpack: \[0.384, 0.696, 0.485, 0.914\], suitcase: ...\ | | | | | | | | | Question: What type of vehicle is featured in the image? | | | Answer: The image features a black sport utility vehicle (SUV) ...\ | | | | | | The image is an underground parking area with a black sport utility vehicle (SUV) parked. There are three people in the scene, with one person standing closer to the left side of the vehicle, another person in the middle, and the third person on the right side. They are all working together to pack their luggage into the SUV for a trip. ...\ | | | | | | Question: What challenges do these people face? | | | Answer: In the image, a group of people is standing outside a black SUV in a parking area, surrounded by various pieces of luggage, including suitcases and backpacks. They are facing the challenge of fitting all their luggage into the black SUV. There are multiple suitcases and backpacks to be packed, which suggests that the group has a significant amount of belongings ...\ | |
To mitigate this issue, we leverage language-only GPT-4 or ChatGPT as the strong teacher (both accept only text as input), to create instruction-following data involving visual content. Specifically, in order to encode an image into its visual features to prompt a text-only GPT, we use two types of symbolic representations: $(i)$ *Captions* typically describe the visual scene from various perspectives; $(ii)$ *Bounding boxes* usually localize the objects in the scene, and each box encodes the object concept and its spatial location. One example is shown in the top block of Table [tab:full_example_car_bbox]. This symbolic representation allows us to encode the image as an LLM-recognizable sequence. We use COCO images [lin2014microsoft](None) and generate three types of instruction-following data. One example per type is shown in the bottom block of Table [tab:full_example_car_bbox]. For each type, we first manually design a few examples. They are the only human annotations we have during data collection, and are used as seed examples in in-context-learning to query GPT-4. - *Conversation*. We design a conversation between the assistant and a person asking questions about this photo. The answers are in a tone as if the assistant is seeing the image and answering the question. A diverse set of questions are asked about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects. Only questions that have definite answers are considered. Please see Appendix for the detailed prompt. - *Detailed description*. To include a rich and comprehensive description for an image, we create a list of questions with such an intent. We prompt GPT-4 then curate the list (see detailed prompts and curation process in Appendix). For each image, we randomly sample one question from the list to ask GPT-4 to generate the detailed description. - *Complex reasoning*. The above two types focus on the visual content itself, based on which we further create in-depth reasoning questions. The answers typically require a step-by-step reasoning process by following rigorous logic. We collect 158K unique language-image instruction-following samples in total, including 58K in conversations, 23K in detailed description, and 77k in complex reasoning, respectively. We ablated the use of ChatGPT and GPT-4 in our early experiments, and found that GPT-4 consistently provides higher quality instruction-following data, such as spatial reasoning. # Visual Instruction Tuning ## Architecture The primary goal is to effectively leverage the capabilities of both the pre-trained LLM and visual model. The network archtecture is illustrated in Figure 1. We choose Vicuna [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) as our LLM $f_{\phiv}(\cdot)$ parameterized by $\phiv$, as it has the best instruction following capabilities in language tasks among publicly available checkpoints [alpaca](https://github.com/tatsu-lab/stanford_alpaca), [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), [peng2023instruction](None).
network architecture.
For an input image $\Xmat_{\texttt{v}}$, we consider the pre-trained CLIP visual encoder ViT-L/14 [radford2021learning](http://arxiv.org/pdf/2404.19696v1), which provides the visual feature $\Zmat_{\texttt{v}} = g(\Xmat_{\texttt{v}})$. The grid features before and after the last Transformer layer are considered in our experiments. We consider a simple linear layer to connect image features into the word embedding space. Specifically, we apply a trainable projection matrix $\Wmat$ to convert $\Zmat_{\texttt{v}}$ into language embedding tokens $\Hmat_{\texttt{v}}$, which have the same dimensionality as the word embedding space in the language model: $$\Hmat_{\texttt{v}} = \Wmat \cdot \Zmat_{\texttt{v}},~ \text{with}~~ \Zmat_{\texttt{v}} = g(\Xmat_{\texttt{v}}) \label{eq:image_encoding}$$ Thus, we have a sequence of visual tokens $\Hmat_{\texttt{v}}$. Note that our simple projection scheme is lightweight, which allows us to iterate data centric experiments quickly. More sophisticated schemes to connect the image and language representations can also be considered, such as gated cross-attention in Flamingo [alayrac2022flamingo](http://arxiv.org/pdf/2205.07065v1) and Q-former in BLIP-2 [li2023blip](http://arxiv.org/pdf/2301.12597v3). We leave exploring possibly more effective and sophisticated architecture designs for as future work. ## Training For each image $\Xmat_{\texttt{v}}$, we generate multi-turn conversation data $(\Xmat_{\texttt{q}}^1, \Xmat_{\texttt{a}}^1, \cdots, \Xmat_{\texttt{q}}^T, \Xmat_{\texttt{a}}^T )$, where $T$ is the total number of turns. We organize them as a sequence, by treating all answers as the assistant’s response, and the instruction $\Xmat_{\texttt{instruct}}^t$ at the $t$-th turn as: $$\begin{aligned} \label{eq:organize_data_turn_rule} \Xmat_{\texttt{instruct}}^t = \left\{\begin{matrix} & \text{Randomly choose}~~ [\Xmat_{\texttt{q}}^1, \Xmat_{\texttt{v}}] ~~\text{or}~~ [ \Xmat_{\texttt{v}}, \Xmat_{\texttt{q}}^1] , ~~~\text{the first turn}~t=1 \\ & \Xmat_{\texttt{q}}^t, \hspace{45mm} \text{the remaining turns}~t>1 \end{matrix}\right. \end{aligned}$$ This leads to the unified format for the multimodal instruction-following sequence illustrated in Table [tab:input_sequence]. We perform instruction-tuning of the LLM on the prediction tokens, using its original auto-regressive training objective.
$\Xmat_{\texttt{system-message}}$ $\texttt{Human}: \Xmat_{\texttt{instruct}}^1$ $\texttt{Assistant}$: $\texttt{Human}: \Xmat_{\texttt{instruct}}^2$ $\texttt{Assistant}$: $\cdots$
Specifically, for a sequence of length $L$, we compute the probability of the target answers $\Xmat_{\texttt{a}}$ by: $$p( \Xmat_{\texttt{a}} | \Xmat_{\texttt{v}}, \Xmat_{\texttt{instruct}}) = \prod_{i=1}^{L} p_{\thetav} ( {\color{mygreen} \xv_i} | \Xmat_{\texttt{v}}, \Xmat_{\texttt{instruct}, [tab:input_sequence] for an illustration of the prediction tokens. For the conditionals in [eq:auto_regressive], we explicitly add $\Xmat_{\texttt{v}}$ to emphasize the fact that the image is grounded for all answers, and we omit $\Xmat_{\texttt{system-message}}$ and all previous `` for better readability. For model training, we consider a two-stage instruction-tuning procedure. #### Stage 1: Pre-training for Feature Alignment. To strike a balance between concept coverage and training efficiency, we filter CC3M to 595K image-text pairs. Please see Appendix for details of the filtering process. These pairs are converted to the instruction-following data using the naive expansion method describe in Section 1. Each sample can be treated as a single-turn conversation. To construct the input $\Xmat_{\texttt{instruct}}$ in [eq:organize_data_turn_rule], for an image $\Xmat_{\texttt{v}}$, a question $\Xmat_{\texttt{q}}$ is randomly sampled, which is a language instruction to request the assistant to describe the image briefly. The ground-truth prediction answer $\Xmat_{\texttt{a}}$ is the original caption. In training, we keep both the visual encoder and LLM weights frozen, and maximize the likelihood of [eq:auto_regressive] with trainable parameters $\thetav = \Wmat$ (the projection matrix) only. In this way, the image features $\Hmat_{\texttt{v}}$ can be aligned with the pre-trained LLM word embedding. This stage can be understood as training a compatible visual tokenizer for the frozen LLM. #### Stage 2: Fine-tuning End-to-End. We always keep the visual encoder weights frozen, and continue to update both the pre-trained weights of the projection layer and LLM in ; i.e., the trainable parameters are $\thetav = \{\Wmat, \phiv \}$ in [eq:auto_regressive]. We consider two specific use case scenarios: - *Multimodal Chatbot*. We develop a Chatbot by fine-tuning on the 158K language-image instruction-following data in Section 1. Among the three types of responses, conversation is multi-turn while the other two are single-turn. They are uniformly sampled in training. - *Science QA*. We study our method on the ScienceQA benchmark [lu2022learn](http://arxiv.org/pdf/2209.09513v2), the first large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations. Each question is provided a context in the form of natural language or an image. The assistant provides the reasoning process in natural language and selects the answer among multiple choices. For training in [eq:organize_data_turn_rule], we organize the data as a single turn conversation, the question & context as $\Xmat_{\texttt{instruct}}$, and reasoning & answer as $\Xmat_{\texttt{a}}$. # Experiments We assess the performance of in instruction-following and visual reasoning capabilities with two primary experimental settings: multimodal chatbot and the ScienceQA dataset, respectively. We train all models with 8$\times$ A100s, following Vicuna’s hyperparameters [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/). We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. See Appendix for more training details. ## Multimodal Chatbot We developed a chatbot demo to show the image understanding and conversation abilities of , and to study how well is able to digest visual inputs and exhibit instruction-following capabilities. We first use the examples in the original GPT-4 paper [gpt4](https://arxiv.org/pdf/2303.08774), shown in Table [tab:visual_example_ironing] (more examples in Appendix), that require in-depth image understanding. For comparisons, we quote the prompt and response of the multimodal GPT-4 from their paper, and query BLIP-2 and OpenFlamingo model checkpoints to get their response. Surprisingly, although is trained with a small multimodal instruction-following dataset ($\sim$``{=html}80K unique images), it demonstrates quite similar reasoning results with multimodal GPT-4 on these examples. Note that while these images are out-of-domain for , is still able to understand the scenes and follow the question instruction to provide a reasonable response. In contrast, BLIP-2 and OpenFlamingo focus on describing the image, instead of following the user instruction to answer in an appropriate manner. #### Quantitative Evaluation. To gain a systematic understanding of the performance of , we propose a quantitative metric to measure the model’s instruction-following capability on multimodal data. Inspired by [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), we leverage GPT-4 to measure the quality of generated responses. Specifically, we create triplets consisting of image, ground-truth textual descriptions, and question. The candidate models () predict the answers based on the question and the image. To provide an *approximate theoretical upper bound*, we create a reference prediction based on the question and the *ground-truth* textual descriptions, using the text-only GPT-4. After obtaining the responses from both models, we feed the question, visual information (in the format of textual descriptions), and the generated responses from both assistants, to the judge (text-only GPT-4). It evaluates the helpfulness, relevance, accuracy, and level of detail of the responses from the assistants, and gives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. It is also asked to provide a comprehensive explanation for the evaluation, for us to better understand the models. We report relative scores *w.r.t.* the text-only GPT-4 model that uses the textural ground truth description as visual input. We create two benchmarks to evaluate the model’s performance. #### . We randomly select 30 images from COCO-Val-2014, and for each image, we generate three types of questions (conversation, detailed description, complex reasoning) using the proposed data generation pipeline in Sec. [sec:visual_instruc_data], totaling 90 questions. This benchmark studies the model’s alignment behavior and capabilities with consistent visual inputs. We vary the training datasets to study the effectiveness of different types of instruction-following data, and show the results in Table [tab:results]. First, with instruction tuning, the model’s ability of following user instructions improves significantly by over 50 points. Second, adding a small amount of detailed description and complex reasoning questions contributes to a considerable improvement of the model’s overall capability by 7 points. Furthermore, it also improves the model’s performance on conversational questions, suggesting that improvements in reasoning capabilities complement conversational abilities. Finally, we show that having all three types of data yields the best performance at 85.1%. #### . To evaluate the model’s capability in more challenging tasks and generalizability to novel domains, we collect a diverse set of 24 images with 60 questions in total, including indoor and outdoor scenes, memes, paintings, sketches, , and associate each image with a highly-detailed and manually-curated description and a proper selection of questions. We compare , BLIP, and OpenFlamingo in Table [tab:results_wild]. Thanks to visual instruction tuning, achieves significantly better performance compared with BLIP-2 (+29%) and OpenFlamingo (+48%). Compared to the text-only GPT-4 that has access to ground-truth labels, achieves an impressive 81.7% performance on complex reasoning questions, with an overall score of 67.3%. #### Limitations. This is designed to be challenging and to reveal a model’s weaknesses. We provide two examples with associated captions and questions in Table [tab:example_bench]. For the ramen example (left), to correctly answer the name of the restaurant, it requires the model to have a large knowledge coverage and multilingual understanding capability; to correctly describe the side dishes, the model may need to retrieve relevant multimodal information from Internet. For the fridge example (right), perceiving the correct brand of the yogurt requires the model to process high resolution images and possess extensive knowledge coverage. We also observed an interesting failure of , as it responds with *yes* when asked if strawberry-flavored yogurt is present, even though the fridge contains only yogurt *and* strawberries. This indicates that, at times, perceives the image as a “bag of patches”, failing to grasp the complex semantics within the image. We hope serves as a solid baseline on the benchmarks, on which our findings can inspire future work in developing more capable LMMs. ## ScienceQA ScienceQA [lu2022learn](http://arxiv.org/pdf/2209.09513v2) contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. The benchmark dataset is split into training, validation, and test splits with 12726, 4241, and 4241 examples, respectively. We consider two representative methods, including GPT-3.5 model () with and without chain-of-thought (CoT), LLaMA-Adapter [zhang2023llama](http://arxiv.org/pdf/2207.10858v1), as well as multimodal chain-of-thought (MM-CoT) [zhang2023multimodal](http://arxiv.org/pdf/2401.06805v2), which is the current SoTA method on this dataset. For more baseline numbers, please see [lu2022learn](http://arxiv.org/pdf/2209.09513v2). The results are reported in Table [tab:scienceqa_model_performance]. For , we use the visual features before the last layer, ask the model to first predict reasons and then the answer, and train it for 12 epochs. It yields 90.92% accuracy, which is quite close to the SoTA 91.68%. To explore the limit of LLMs, we also prompt GPT-4 using 2-shot in-context-learning and achieve 82.69% accuracy, which is a 7.52% absolute gain compared with 75.17% from GPT-3.5. For a substantial number of questions, we note that GPT-4 fails simply because it reports that there is insufficient context such as images or plots. We consider two schemes to combine the outcomes from our model and GPT-4. $(i)$ *A GPT-4 complement*. Whenever GPT-4 fails to provide answers, we use the prediction from our method. This schemes yields 90.97% accuracy, which is almost the same as applying our method alone. $(ii)$ *GPT-4 as the judge*. Whenever GPT-4 and produce different answers, we prompt GPT-4 again, asking it to provide its own final answer based on the question and two outcomes. The spirit is similar with CoT, but with the external knowledge from the other model. Surprisingly, this scheme is able to provide consistent improvement over all question classes, and achieves a new SoTA accuracy of 92.53%. Interestingly, the text-only GPT-4, which cannot process images, improves the overall performance of the model on questions that have an image as context. This is because some of these questions do not actually require the image context for a correct answer. The GPT-4 judge can identify such cases and correct some of the errors that makes. See the example in Appendix. To the best of our knowledge, this is the first time that GPT-4 is used for model ensembling. We hope this finding can encourage future research to explore more effective methods to leverage LLMs for model ensembling.
r0.5
#### Ablations. We ablate several design choices on ScienceQA in Table [tab:scienceqa_ablation]. $(i)$ *Visual features*. We tried using the last layer feature from CLIP vision encoder, which yields 89.96% and is 0.96% lower than the feature before the last layer. We hypothesize that this is because CLIP’s last layer features may focus more on global and abstract image properties compared to the layer before it, which can focus more on localized properties that are useful for understanding specific image details. $(ii)$ *Chain-of-thought*. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training. Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence, but contributes relatively little to the final performance. $(iii)$ *Pre-training*. We skip pre-training and directly train on Science QA from scratch – performance drops to 85.81% accuracy. The 5.11% absolute degradation indicates the importance of our pre-training stage, in aligning multimodal features while preserving the vast pre-trained knowledge. $(iv)$ *Model size*. We keep all configurations the same as our best 13B model, and train a 7B model. This yields 89.84% accuracy, which is 1.08% lower than 90.92%, demonstrating the importance of model scale. # Conclusion This paper demonstrated the effectiveness of visual instruction tuning. We presented an automatic pipeline to create language-image instruction-following data, based on which we train , a multimodal model to follow human intent to complete visual tasks. It achieves the new SoTA accuracy when fine-tuned on ScienceQA, and excellent visual chat capabilities when fine-tuned on multimodal chat data. Besides, we present the first benchmark to study multimodal instruction-following capability. This paper is an initial step in visual instruction tuning, and mainly focuses on real-life tasks. For more quantitative results of on academic benchmarks, please refer to the improved baselines with visual instruction tuning [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1). We hope our work can inspire future research on building more capable multimodal models. **Acknowledgements.** We thank Baolin Peng and Pan Lu for valuable discussions on instruction-tuning language models and Science QA, respectively. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna. This work was supported in part by NSF CAREER IIS2150012, and Institute of Information & communications Technology Planning & Evaluation(IITP) grants funded by the Korea government(MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) and (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training). # Broader Impact The broader impact of , a general-purpose visual assistant, has potential benefits and risks associated with its deployment and release. Some considerations are unique to due to its visual nature, while others share similarities with existing instruction-following LLMs (Alpaca, Vicuna, ). As is built upon LLaMA, Vicuna, and CLIP, it inherits some of the issues associated with LLMs and vision encoders. In the following, we outline both the risks and mitigation strategies in place for the release of this model. #### Malicious input. To minimize potential misuse and harmful consequences, we employ two precautionary measures for : (1) *OpenAI Filter API* for user input text to prevent harmful or inappropriate text instructions from being processed by the model, and (2) *NSFW Filter* for uploaded user images to detect and block Not Safe For Work (NSFW) content or any other potentially harmful image inputs. #### Hallucination. Similar to LLMs, might generate outputs that aren’t grounded in facts or input data. This raises concerns about inferences made, especially in critical applications (medical). #### Biases. Bias can be transferred from the base models to , both from the vision encoder (CLIP) and the language decoder (LLaMA/Vicuna). This may lead to biased outcomes or unfair representations of diverse content. #### Energy consumption. Though energy consumption is not a primary concern for due to a smaller pretraining dataset (see details in Sec. 3), it may become a concern when scaling up the pretraining dataset or increasing the model size, e.g., to a larger LLaMA version like the 65B model. #### Evaluation complexities. Assessing the performance of is challenging as it involves both language and visual tasks. Our evaluation benchmark covers several aspects, including accuracy, concept coverage, reasoning ability, and creativity. However, additional aspects need consideration, such as the degree of visual content hallucination and fine-grained understanding of visual content. While text-only GPT-4 based multimodal evaluation is consistent and accurate in our study, its robustness in different situations and capability to evaluate other unexplored aspects are subjects for future work. Despite these risks, we believe that the benefits of releasing to the research community outweigh the potential harm. It allows for ongoing investigation and improvement of the model and engages the community in developing better mitigation strategies to address these concerns. Moreover, the release of can spur the development of new applications and research directions, ultimately contributing to the progress and responsible deployment of foundation models in vision-language tasks. # More Results We present more qualitative results of to analyze its emergent behaviors and observed weaknesses. For more quantitative results of on academic benchmarks, please refer to the improved baselines with visual instruction tuning [liu2023improvedllava](http://arxiv.org/pdf/2310.19145v1). In Table [tab:visual_example_chichken], demonstrates a similar behavior as GPT-4 in another example from its paper. Similar to the GPT-4 live demo by OpenAI, is capable of generating the HTML/JS/CSS code for an interactive joke website based on a simplified user input sketch in Fig. 1, despite a minor error. As shown in Fig. 2, can follow user’s instructions in a conversational style and provide detailed responses or creative writings. Furthermore, is able to relate the visual content to the textual knowledge from the pretrained LLM, as demonstrated in Fig. 3 and Fig. 4. One interesting emergent behavior of is that it is able to understand visual contents that are not covered in the training. For example, in Fig. 5, it is able to recognize Elon Musk both in a headshot and in a humorous meme where he is dressed as a doge, even though Elon Musk *never* appears in the training data for either the visual feature alignment or visual instruction tuning stages of . also demonstrates impressive OCR (optical character recognition) ability in Table [tab:visual_example_chichken] and Fig. 1, which is rarely covered in our training data. We hope these additional results and observations showcase the potential of in various application areas. In future work, it is important to investigate these emergent behaviors more thoroughly and to understand the underlying mechanisms that enable to demonstrate such generalization abilities. This will pave the way towards building better LMMs, including enhancing robustness, reducing biases, and improving the alignment and the scope of the learned vision-language representations.
generates HTML/JS code for an interactive website based on user sketch inputs. The interactive interface works after fixing a minor error (in red) in the generated output. There is room for improvement in ’s output, such as splitting the joke and punchline into two rows, and only revealing the punchline upon button click, to better reflect the user’s intent.

is capable of recognizing the visual content following the user’s intent, without directly prompting for visual recognition. It also provides a detailed response when prompted with a follow-up request, and the generated response is closely related to the provided visual content.
relates the movie scenes to the textual knowledge from the pretrained LLM.
recognizes the famous art work, Mona Lisa, by Leonardo da Vinci. When we start a new conversation, it also explains the humourous artwork created on the web, mimicking the Mona Lisa.
An interesting emergent behavior of is its ability to recognize Elon Musk both in a headshot and in a humorous meme where he is dressed as a doge. This implies that the pre-trained CLIP vision encoder may have seen images of Elon Musk. However, it is still surprising because Elon Musk never appears in the training data for either the visual feature alignment or visual instruction tuning stages of , which indicates that the base language model generalizes to unseen visual concepts.
| | | |:---|:--:| | | | | Which material is this rocking chair made of? | | | Options: (A) wood (B) silk | | | | | | LECTURE: A material is a type of matter. | | | Wood, glass, metal, and plastic are common materials. | | | Some objects are made of just one material. | | | Most nails are made of metal. | | | Other objects are made of more than one material. | | | This hammer is made of metal and wood. | | | SOLUTION: Look at the picture of the rocking chair. The rocking chair is made of two different materials. The legs are made of wood, and the back and seat are made of silk. | | | The answer is B. | | | | | | The answer is A. BECAUSE: Rocking chairs are typically made of sturdy materials that can support the weight of a person and withstand the rocking motion. Wood is a common material used for making rocking chairs due to its strength, durability, and natural appearance. | | | | | | | | | Since there is no context or image to refer to, I cannot independently analyze the rocking chair in question. However, I can try to analyze the reasoning given by the assistants. | | | Assistant 1, who claims to have seen the image, mentions that the legs of the rocking chair are made of wood, and the back and seat are made of silk, implying that it’s a combination of two materials. However, this response seems less plausible, as silk is not a common material used for the construction of rocking chairs due to issues with stability and durability. Typically, one would expect the frame, legs, and seat or back to be made of wood, with cushions and padding made of fabrics, such as cotton or polyester. | | | Assistant 2 argues that rocking chairs are typically made of sturdy materials, and wood is a common material used for making rocking chairs due to its strength, durability, and natural appearance. This reasoning stands true and is based on common knowledge. | | | Given the information available, the most plausible answer is that the rocking chair is made of wood, as suggested by Assistant 1. The answer is A. | |
# Training Details [sec:appendix_training_details] We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. Following Vicuna, we use the Adam optimizer with no weight decay and a cosine learning rate with a warmup ratio of 3%. During finetuning, FSDP (Full Shard Data Parallel) and gradient checkpointing is used to save GPU memory, and offloading is not used. BF16 and TF32 are enabled to achieve a balance between speed and precision. We train all models with 8$\times$ A100s. Pretraining on CC-595K completes within 4 hours. Finetuning on Instruct-158K completes within 10 hours. Finetuning on ScienceQA completes within 4 hours. # Assets Our source code, generated instruction-tuning data, proposed benchmark are uploaded to the anonymized GitHub repository: [LLaVA-Annonymous/LLaVA](https://github.com/LLaVA-Annonymous/LLaVA). 1. Source Code: [link](https://github.com/LLaVA-Annonymous/LLaVA) 2. README: [link](https://github.com/LLaVA-Annonymous/LLaVA) 3. Instructions to launch the demo: [link](https://github.com/LLaVA-Annonymous/LLaVA#web-ui) 4. All prompts and few shot examples for querying GPT-4: [link](https://github.com/LLaVA-Annonymous/LLaVA/tree/master/playground/data/prompts) 5. LLaVA-Instruct-158K: [link](https://github.com/LLaVA-Annonymous/LLaVA/blob/master/playground/data/llava_instruct_150k.json) 6. LLaVA-Bench: [COCO](https://github.com/LLaVA-Annonymous/LLaVA/blob/master/playground/data/coco2014_val_gpt4_qa_30x3.jsonl), [In-The-Wild](https://github.com/LLaVA-Annonymous/LLaVA/tree/master/playground/data/llava_bench_in_the_wild) 7. Model checkpoints. The size of the model checkpoints after compression is 25GB, which exceeds the 5GB limit of GitHub LFS (Large File Storage). We’ll release the checkpoint to the public, or upon request with reviewers for this submission. # Data [sec:appendix_data] #### Instructions for brief image description. The list of instructions used to briefly describe the image content are shown in Table [tab:concise_describe_instructions]. They present the same meaning with natural language variance.
- "Describe the image concisely." - "Provide a brief description of the given image." - "Offer a succinct explanation of the picture presented." - "Summarize the visual content of the image." - "Give a short and clear explanation of the subsequent image." - "Share a concise interpretation of the image provided." - "Present a compact description of the photo’s key features." - "Relay a brief, clear account of the picture shown." - "Render a clear and concise summary of the photo." - "Write a terse but informative summary of the picture." - "Create a compact narrative representing the image presented."
#### Instructions for detailed image description. The list of instructions used to describe the image content in detail are shown in Table [tab:detailed_describe_instructions]. They present the same meaning with natural language variance.
- "Describe the following image in detail" - "Provide a detailed description of the given image" - "Give an elaborate explanation of the image you see" - "Share a comprehensive rundown of the presented image" - "Offer a thorough analysis of the image" - "Explain the various aspects of the image before you" - "Clarify the contents of the displayed image with great detail" - "Characterize the image using a well-detailed description" - "Break down the elements of the image in a detailed manner" - "Walk through the important details of the image" - "Portray the image with a rich, descriptive narrative" - "Narrate the contents of the image with precision" - "Analyze the image in a comprehensive and detailed manner" - "Illustrate the image through a descriptive explanation" - "Examine the image closely and share its details" - "Write an exhaustive depiction of the given image"
#### CC3M. We extract noun-phrases using Spacy for each caption over the whole CC3M dataset, and count the frequency of each unique noun-phrase. We skip noun-phrases whose frequency is smaller than $3$, as they are usually rare combinations concept and attributes that has already been covered by other captions. We then start from the noun-phrases with lowest remaining frequency, add the captions that contain this noun-phrase to the candidate pool. If the frequency of the noun-phrase is larger than $100$, we randomly choose a subset of size $100$ out of all its captions. This results in around 595K image-text pairs. The comparison of noun-phrase statistics before and after filtering CC3M is shown in Figure [fig:cmp_noun_phrase_counter]. The filtered dataset shows a good coverage of concepts whose frequency is higher from 3, but with a smaller number of image-text pairs.
| | |:------------------------------------------------------------------------:| | |
# Prompts The prompt used to generate image-based conversation from ChatGPT/GPT-4 is shown in Table [tab:prompt_conversation].
| | |:---| | | | Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers. | | Include questions asking about the visual content of the image, including the **object types, counting the objects, object actions, object locations, relative positions between objects**, etc. Only include questions that have definite answers: | | \(1\) one can see the content in the image that the question asks about and can answer confidently; | | \(2\) one can determine confidently from the image that it is not in the image. Do not ask any question that cannot be answered confidently. | | Also include complex questions that are relevant to the content in the image, for example, asking about background knowledge of the objects in the image, asking to discuss about events happening in the image, etc. Again, do not ask about uncertain details. Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized. You can include multiple paragraphs if necessary."""} | | \] | | |
| | | |:---|:--:| | | | | A group of people standing outside of a black vehicle with various luggage. | | | Luggage surrounds a vehicle in an underground parking area | | | People try to fit all of their luggage in an SUV. | | | The sport utility vehicle is parked in the public garage, being packed for a trip | | | Some people with luggage near a van that is transporting it. | | | | | | person: \[0.681, 0.242, 0.774, 0.694\], person: \[0.63, 0.222, 0.686, 0.516\], person: \[0.444, 0.233, 0.487, 0.34\], backpack: \[0.384, 0.696, 0.485, 0.914\], backpack: \[0.755, 0.413, 0.846, 0.692\], suitcase: \[0.758, 0.413, 0.845, 0.69\], suitcase: \[0.1, 0.497, 0.173, 0.579\], bicycle: \[0.282, 0.363, 0.327, 0.442\], car: \[0.786, 0.25, 0.848, 0.322\], car: \[0.783, 0.27, 0.827, 0.335\], car: \[0.86, 0.254, 0.891, 0.3\], car: \[0.261, 0.101, 0.787, 0.626\] | | | | | | | | | Question: What type of vehicle is featured in the image? | | | Answer: The image features a black sport utility vehicle (SUV). | | | Question: Where is the vehicle parked? | | | Answer: The vehicle is parked in an underground parking area, likely in a public garage. | | | Question: What are the people in the image doing? | | | Answer: The people are trying to fit all of their luggage into the SUV, likely preparing for a trip. | | | | | | The image is an underground parking area with a black sport utility vehicle (SUV) parked. There are three people in the scene, with one person standing closer to the left side of the vehicle, another person in the middle, and the third person on the right side. They are all working together to pack their luggage into the SUV for a trip. | | | In the parking area, various luggage items are scattered around the vehicle. There are two backpacks, one located near the left rear wheel and the other closer to the right side of the vehicle. Additionally, there are two suitcases, one on the right side of the car and another further away near the center of the parking area. A bicycle can also be seen on the left side of the vehicle. | | | Other cars are parked around the main SUV, with one car positioned behind it and slightly to the left, another behind and slightly to the right, and the third car further behind on the right side. | | | | | | Question: What challenges do these people face? | | | Answer: In the image, a group of people is standing outside a black SUV in a parking area, surrounded by various pieces of luggage, including suitcases and backpacks. They are facing the challenge of fitting all their luggage into the black SUV. There are multiple suitcases and backpacks to be packed, which suggests that the group has a significant amount of belongings to accommodate. They might have to strategize and arrange the luggage efficiently to ensure that everything fits properly into the vehicle. Additionally, they need to consider the comfort of the passengers and visibility while driving, so the placement of the luggage must not obstruct the driver’s view or make the passengers uncomfortable during the trip. | |
| | |:---| | | | There is a movie theater that displays the show times above the doors. | | A red fire hydrant is deep in the snow. | | The fire hydrant is in the snow near a recently plowed sidewalk. | | This city has had a very hard winter with snow. | | A hotel for dogs in the snow in winter. | | | | Question: | | What color is the fire hydrant in the image? | | === | | Answer: | | The fire hydrant in the image is red. | | === | | Question: | | Is the fire hydrant fully covered by snow or partially visible? | | === | | Answer: | | The fire hydrant is partially visible, with a significant portion of it covered by the deep snow. | | === | | Question: | | What is the condition of the sidewalk near the fire hydrant? | | === | | Answer: | | The sidewalk near the fire hydrant has been recently plowed, so it is relatively clear of snow compared to the surrounding area. | | === | | Question: | | What does the image suggest about the winter season in the city? | | === | | Answer: | | The image suggests that the winter season in the city has been quite harsh and challenging. The significant snow accumulation seen in the image indicates that the city has experienced heavy snowfall, which can lead to various difficulties and disruptions for its residents. | | | | For instance, the deep snow covering the fire hydrant could potentially hinder its accessibility in case of emergencies, which poses safety concerns. The recently plowed sidewalk implies that the city’s maintenance crews have been actively working to keep the walkways clear and safe for pedestrians, but the sheer amount of snowfall might make it difficult to keep up with the continuous snow removal efforts. | | | | Furthermore, such extreme winter conditions can impact transportation, leading to traffic issues, delays in public transit, and increased risks of accidents due to slippery roads. It can also cause problems for businesses and schools, as people might face challenges commuting to work or attending classes. Additionally, the heavy snow can put extra strain on infrastructure, such as roofs and power lines, increasing the likelihood of structural damage or power outages. | | | | In conclusion, the image of the red fire hydrant deep in the snow and the recently plowed sidewalk suggest that the city has faced a particularly severe winter season, with substantial snowfall that has likely caused various challenges and disruptions for its residents and infrastructure. |
| | |:---| | | | A man is skiing in the open snow covered hills | | A skier is making his way into the snow. | | A skier on the foothills of a huge mountain range. | | A skier looks at mountains while standing near a trail sign. | | a single person skiing by an area with a lot of bushes | | | | Question: | | What is the skier doing? | | === | | Answer: | | The skier is skiing in the snow-covered hills, making his way through the snow and enjoying the beautiful mountain scenery. They are also standing near a trail sign, which indicates their location on the mountain. | | === | | Question: | | Are there any other people or skiers visible in the image? | | === | | Answer: | | No, there is only a single person skiing in the image, and they are surrounded by an area with a lot of bushes. | | === | | Question: | | What kind of skiing is the person doing? Cross-country or downhill? | | === | | Answer: | | Based on what I can see in the photo, it is not possible to definitively determine whether the skier is doing cross-country or downhill skiing. However, given the presence of a trail sign and the fact that they are skiing in the foothills, it is more likely that the skier is engaged in cross-country skiing. Downhill skiing typically takes place on more steep slopes and involves ski lifts, whereas cross-country skiing often occurs on relatively flatter terrain with marked trails. | | === | | Question: | | Can you explain cross-country and downhill skiing a bit more? | | === | | Answer: | | Cross-country skiing and downhill skiing are two distinct types of skiing, each with its own techniques, equipment, and terrain preferences. | | | | 1\. Cross-Country Skiing: | | Also known as Nordic skiing, cross-country skiing involves traversing relatively flat terrain or gentle slopes using skis and poles. It is a popular recreational and competitive sport that focuses on endurance, balance, and aerobic fitness. Cross-country skiers use their own locomotion to move across snow-covered terrain, rather than relying on ski lifts or other forms of assistance. | | There are two main techniques in cross-country skiing: | | \- Classic: This technique involves skiing in parallel tracks, using a forward-and-backward gliding motion while keeping the skis parallel to each other. The skier uses their poles to propel themselves forward. | | \- Skate skiing: This technique is similar to ice skating, where the skier pushes off one ski at an angle and glides on the other. It requires a smooth, groomed surface and is typically faster than the classic technique. | | Cross-country ski equipment consists of lightweight skis, boots, bindings, and poles. The boots are more flexible than those used in downhill skiing, allowing for better ankle movement and control. | | | | 2\. Downhill Skiing: | | Also known as alpine skiing, downhill skiing involves descending slopes at high speeds using skis and poles for balance and control. This sport is more focused on speed, technique, and navigating challenging terrain, including steep slopes, moguls, and even jumps. | | Downhill skiing can be further categorized into several disciplines, such as slalom, giant slalom, super-G, and downhill racing. Each discipline has its own set of rules, courses, and ski equipment. | | Downhill ski equipment includes heavier and stiffer skis, boots, bindings, and poles than those used in cross-country skiing. The boots are more rigid to provide better support and control during high-speed descents and sharp turns. | | | | In summary, cross-country skiing is an endurance-based sport that involves traveling across flat or gently sloping terrain, while downhill skiing is focused on speed and technique as skiers navigate steeper slopes and challenging terrain. Both sports require specialized equipment and techniques, but they offer different experiences and challenges to participants. |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 2023-01-30 Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
Show Paper Content
# Introduction Vision-language pre-training (VLP) research has witnessed a rapid advancement in the past few years, where pre-trained models with increasingly larger scale have been developed to continuously push the state-of-the-art on various downstream tasks [clip](None), [ALBEF](None), [blip](None), [ofa](None), [flamingo](None), [beit3](None). However, most state-of-the-art vision-language models incur a high computation cost during pre-training, due to end-to-end training using large-scale models and datasets. Vision-language research sits at the intersection between vision and language, therefore it is naturally expected that vision-language models can harvest from the readily-available unimodal models from the vision and natural language communities. In this paper, we propose a *generic* and *compute-efficient* VLP method by bootstrapping from off-the-shelf pre-trained vision models and language models. Pre-trained vision models offer high-quality visual representation. Pre-trained language models, in particular *large language models* (LLMs), offer strong language generation and zero-shot transfer abilities. To reduce computation cost and counteract the issue of catastrophic forgetting, the unimodal pre-trained models remain frozen during the pre-training. In order to leverage pre-trained unimodal models for VLP, it is key to facilitate cross-modal alignment. However, since LLMs have not seen images during their unimodal pre-training, freezing them makes vision-language alignment in particular challenging. In this regard, existing methods ( Frozen [Frozen](None), Flamingo [flamingo](None)) resort to an image-to-text generation loss, which we show is insufficient to bridge the modality gap. To achieve effective vision-language alignment with frozen unimodal models, we propose a Querying Transformer () pre-trained with a new two-stage pre-training strategy. As shown in Figure [fig:teaser],  is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM, where it feeds the most useful visual feature for the LLM to output the desired text. In the first pre-training stage, we perform vision-language representation learning which enforces the  to learn visual representation most relevant to the text. In the second pre-training stage, we perform vision-to-language generative learning by connecting the output of the  to a frozen LLM, and trains the  such that its output visual representation can be interpreted by the LLM. We name our VLP framework as BLIP-2: Bootstrapping Language-Image Pre-training with frozen unimodal models. The key advantages of BLIP-2 include: - BLIP-2 effectively leverages both frozen pre-trained image models and language models. We bridge the modality gap using a  pre-trained in two-stages: representation learning stage and generative learning stage. BLIP-2 achieves state-of-the-art performance on various vision-language tasks including visual question answering, image captioning, and image-text retrieval. - Powered by LLMs ( OPT [opt](None), FlanT5 [flanT5](None)), BLIP-2 can be prompted to perform zero-shot image-to-text generation that follows natural language instructions, which enables emerging capabilities such as visual knowledge reasoning, visual conversation, etc. (see Figure [fig:example] for examples). - Due to the use of frozen unimodal models and a lightweight , BLIP-2 is more compute-efficient than exisiting state-of-the-arts. For example, BLIP-2 outperforms Flamingo [flamingo](None) by 8.7% on zero-shot VQAv2, while using 54$\times$ fewer trainable parameters. Furthermore, our results show that BLIP-2 is a generic method that can harvest more advanced unimodal models for better VLP performance. # Related Work ## End-to-end Vision-Language Pre-training Vision-language pre-training aims to learn multimodal foundation models with improved performance on various vision-and-language tasks. Depending on the downstream task, different model architectures have been proposed, including the dual-encoder architecture [clip](None), [align](None), the fusion-encoder architecture [LXMERT](None), [ALBEF](None), the encoder-decoder architecture [VL_T5](None), [simvlm](None), [pali](None), and more recently, the unified transformer architecture [blip](None), [beit3](None). Various pre-training objectives have also been proposed over the years, and have progressively converged to a few time-tested ones: image-text contrastive learning [clip](None), [filip](None), [ALBEF](None), [blip](None), image-text matching [ALBEF](None), [blip](None), [VLMo](None), and (masked) language modeling [ALBEF](None), [blip](None), [coca](None), [beit3](None). Most VLP methods perform end-to-end pre-training using large-scale image-text pair datasets. As the model size keeps increasing, the pre-training can incur an extremely high computation cost. Moreover, it is inflexible for end-to-end pre-trained models to leverage readily-available unimodal pre-trained models, such as LLMs [gpt3](None), [opt](None), [flanT5](None). ## Modular Vision-Language Pre-training More similar to us are methods that leverage off-the-shelf pre-trained models and keep them frozen during VLP. Some methods freeze the image encoder, including the early work which adopts a frozen object detector to extract visual features [uniter](None), [oscar](None), [vinvl](None), and the recent LiT [LiT](None) which uses a frozen pre-trained image encoder for CLIP [clip](None) pre-training. Some methods freeze the language model to use the knowledge from LLMs for vision-to-language generation tasks [Frozen](None), [flamingo](None), [vgpt](None), [mapl](None), [pnp-vqa](None), [img2prompt](None). The key challenge in using a frozen LLM is to align visual features to the text space. To achieve this, Frozen [Frozen](None) finetunes an image encoder whose outputs are directly used as soft prompts for the LLM. Flamingo [flamingo](None) inserts new cross-attention layers into the LLM to inject visual features, and pre-trains the new layers on billions of image-text pairs. Both methods adopt the language modeling loss, where the language model generates texts conditioned on the image. Different from existing methods, BLIP-2 can effectively and efficiently leverage both frozen image encoders and frozen LLMs for various vision-language tasks, achieving stronger performance at a lower computation cost. # Method [sec:method] We propose BLIP-2, a new vision-language pre-training method that bootstraps from frozen pre-trained unimodal models. In order to bridge the modality gap, we propose a Querying Transformer () pre-trained in two stages: (1) vision-language representation learning stage with a frozen image encoder and (2) vision-to-language generative learning stage with a frozen LLM. This section first introduces the model architecture of , and then delineates the two-stage pre-training procedures. ## Model Architecture We propose  as the trainable module to bridge the gap between a frozen image encoder and a frozen LLM. It extracts a fixed number of output features from the image encoder, independent of input image resolution. As shown in Figure [fig:stage1],  consists of two transformer submodules that share the same self-attention layers: (1) an image transformer that interacts with the frozen image encoder for visual feature extraction, (2) a text transformer that can function as both a text encoder and a text decoder. We create a set number of learnable query embeddings as input to the image transformer. The queries interact with each other through self-attention layers, and interact with frozen image features through cross-attention layers (inserted every other transformer block). The queries can additionally interact with the text through the same self-attention layers. Depending on the pre-training task, we apply different self-attention masks to control query-text interaction. We initialize  with the pre-trained weights of BERT$_\text{base}$ [bert](None), whereas the cross-attention layers are randomly initialized. In total,  contains 188M parameters. Note that the queries are considered as model parameters. In our experiments, we use 32 queries where each query has a dimension of 768 (same as the hidden dimension of the ). We use $Z$ to denote the output query representation. The size of $Z$ ($32\times768$) is much smaller than the size of frozen image features ( $257\times1024$ for ViT-L/14). This bottleneck architecture works together with our pre-training objectives into forcing the queries to extract visual information that is most relevant to the text. ## Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder In the representation learning stage, we connect  to a frozen image encoder and perform pre-training using image-text pairs. We aim to train the  such that the queries can learn to extract visual representation that is most informative of the text. Inspired by BLIP [blip](None), we jointly optimize three pre-training objectives that share the same input format and model parameters. Each objective employs a different attention masking strategy between queries and text to control their interaction (see Figure [fig:stage1]). **Image-Text Contrastive Learning** (ITC) learns to align image representation and text representation such that their mutual information is maximized. It achieves so by contrasting the image-text similarity of a positive pair against those of negative pairs. We align the output query representation $Z$ from the image transformer with the text representation $t$ from the text transformer, where $t$ is the output embedding of the `[CLS]` token. Since $Z$ contains multiple output embeddings (one from each query), we first compute the pairwise similarity between each query output and $t$, and then select the highest one as the image-text similarity. To avoid information leak, we employ a unimodal self-attention mask, where the queries and text are not allowed to see each other. Due to the use of a frozen image encoder, we can fit more samples per GPU compared to end-to-end methods. Therefore, we use in-batch negatives instead of the momentum queue in BLIP. **Image-grounded Text Generation** (ITG) loss trains the  to generate texts, given input images as the condition. Since the architecture of  does not allow direct interactions between the frozen image encoder and the text tokens, the information required for generating the text must be first extracted by the queries, and then passed to the text tokens via self-attention layers. Therefore, the queries are forced to extract visual features that capture all the information about the text. We employ a multimodal causal self-attention mask to control query-text interaction, similar to the one used in UniLM [UniLM](None). The queries can attend to each other but not the text tokens. Each text token can attend to all queries and its previous text tokens. We also replace the `[CLS]` token with a new `[DEC]` token as the first text token to signal the decoding task. **Image-Text Matching** (ITM) aims to learn fine-grained alignment between image and text representation. It is a binary classification task where the model is asked to predict whether an image-text pair is positive (matched) or negative (unmatched). We use a bi-directional self-attention mask where all queries and texts can attend to each other. The output query embeddings $Z$ thus capture multimodal information. We feed each output query embedding into a two-class linear classifier to obtain a logit, and average the logits across all queries as the output matching score. We adopt the hard negative mining strategy from [ALBEF](None), [blip](None) to create informative negative pairs. ## Bootstrap Vision-to-Language Generative Learning from a Frozen LLM In the generative pre-training stage, we connect  (with the frozen image encoder attached) to a frozen LLM to harvest the LLM’s generative language capability. As shown in Figure [fig:stage2], we use a fully-connected (FC) layer to linearly project the output query embeddings $Z$ into the same dimension as the text embedding of the LLM. The projected query embeddings are then prepended to the input text embeddings. They function as *soft visual prompts* that condition the LLM on visual representation extracted by the . Since the  has been pre-trained to extract language-informative visual representation, it effectively functions as an information bottleneck that feeds the most useful information to the LLM while removing irrelevant visual information. This reduces the burden of the LLM to learn vision-language alignment, thus mitigating the catastrophic forgetting problem. We experiment with two types of LLMs: decoder-based LLMs and encoder-decoder-based LLMs. For decoder-based LLMs, we pre-train with the language modeling loss, where the frozen LLM is tasked to generate the text conditioned on the visual representation from . For encoder-decoder-based LLMs, we pre-train with the prefix language modeling loss, where we split a text into two parts. The prefix text is concatenated with the visual representation as input to the LLM’s encoder. The suffix text is used as the generation target for the LLM’s decoder. ## Model Pre-training **Pre-training data.** We use the same pre-training dataset as BLIP with 129M images in total, including COCO [coco](None), Visual Genome [VG](None), CC3M [CC](None), CC12M [cc12m](None), SBU [sbu](None), and 115M images from the LAION400M dataset [laion](None). We adopt the CapFilt method [blip](None) to create synthetic captions for the web images. Specifically, we generate 10 captions using the BLIP$_\mathrm{large}$ captioning model, and rank the synthetic captions along with the original web caption based on the image-text similarity produced by a CLIP ViT-L/14 model. We keep top-two captions per image as training data and randomly sample one at each pre-training step. **Pre-trained image encoder and LLM.** For the frozen image encoder, we explore two state-of-the-art pre-trained vision transformer models: (1) ViT-L/14 from CLIP [clip](None) and (2) ViT-g/14 from EVA-CLIP [eva](None). We remove the last layer of the ViT and uses the second last layer’s output features, which leads to slightly better performance. For the frozen language model, we explore the unsupervised-trained OPT model family [opt](None) for decoder-based LLMs, and the instruction-trained FlanT5 model family [flanT5](None) for encoder-decoder-based LLMs. **Pre-training settings.** We pre-train for 250k steps in the first stage and 80k steps in the second stage. We use a batch size of 2320/1680 for ViT-L/ViT-g in the first stage and a batch size of 1920/1520 for OPT/FlanT5 in the second stage. During pre-training, we convert the frozen ViTs’ and LLMs’ parameters into FP16, except for FlanT5 where we use BFloat16. We found no performance degradation compared to using 32-bit models. Due to the use of frozen models, our pre-training is more computational friendly than existing large-scale VLP methods. For example, using a single 16-A100(40G) machine, our largest model with ViT-g and FlanT5-XXL requires less than 6 days for the first stage and less than 3 days for the second stage. The same set of pre-training hyper-parameters are used for all models. We use the AdamW [adamw](None) optimizer with $\beta_1=0.9$, $\beta_1=0.98$, and a weight decay of 0.05. We use a cosine learning rate decay with a peak learning rate of 1e-4 and a linear warmup of 2k steps. The minimum learning rate at the second stage is 5e-5. We use images of size 224$\times$``{=html}224, augmented with random resized cropping and horizontal flipping. # Experiment Table 1 provides an overview of the performance of BLIP-2 on various zero-shot vision-language tasks. Compared to previous state-of-the-art models, BLIP-2 achieves improved performance while requiring substantially fewer number of trainable parameters during vision-language pre-training. ## Instructed Zero-shot Image-to-Text Generation BLIP-2 effectively enables a LLM to understand images while preserving its capability in following text prompts, which allows us to control image-to-text generation with instructions. We simply append the text prompt after the visual prompt as input to the LLM. Figure [fig:example] shows examples to demonstrate a wide range of zero-shot image-to-text capabilities including visual knowledge reasoning, visual commensense reasoning, visual conversation, personalized image-to-text generation, etc. **Zero-shot VQA**. We perform quantitative evaluation on the zero-shot visual question answering task. For OPT models, we use the prompt “Question: {} Answer:”. For FlanT5 models, we use the prompt “Question: {} Short answer:”. During generation, we use beam search with a beam width of 5. We also set the length-penalty to -1 which encourages shorter answers that align better with human annotation. As shown in Table [tbl:vqa_zeroshot]. BLIP-2 achieves state-of-the-art result on the VQAv2 [VQA2](None) and GQA [GQA](None) datasets. It outperforms Flamingo80B by 8.7% on VQAv2, despite having 54x fewer trainable parameters. On the OK-VQA [okvqa](None) dataset, BLIP-2 comes secondary to Flamingo80B. We hypothesis that this is because OK-VQA focuses more on open-world knowledge than visual understanding, and the 70B Chinchilla [chinchilla](None) language model from Flamingo80B possesses more knowledge than the 11B FlanT5$_\text{XXL}$. We make a promising observation from Table [tbl:vqa_zeroshot]: **a stronger image encoder or a stronger LLM both lead to better performance.** This observation is supported by several facts: (1) ViT-g outperforms ViT-L for both OPT and FlanT5. (2) Within the same LLM family, larger models outperform smaller ones. (3) FlanT5, an instruction-tuned LLM, outperforms the unsupervised-trained OPT on VQA. This observation validates BLIP-2 as a **generic vision-language pre-training method** that can efficiently harvest the rapid advances in vision and natural language communities. **Effect of Vision-Language Representation Learning.** The first-stage representation learning pre-trains the  to learn visual features relevant to the text, which reduces the burden of the LLM to learn vision-language alignment. Without the representation learning stage,  relies solely on the vision-to-language generative learning to bridge the modality gap, which is similar to the Perceiver Resampler in Flamingo. Figure [fig:qformer_effect] shows the effect of representation learning on generative learning. Without representation learning, both types of LLMs give substantially lower performance on zero-shot VQA. In particular, OPT suffers from catastrophic forgetting where performance drastically degrades as training proceeds. ## Image Captioning We finetune BLIP-2 models for the image captioning task, which asks the model to generate a text description for the image’s visual content. We use the prompt “a photo of” as an initial input to the LLM and trains the model to generate the caption with the language modeling loss. We keep the LLM frozen during finetuning, and updates the parameters of the  together with the image encoder. We experiment with ViT-g and various LLMs. Detailed hyperparameters can be found in the appendix. We perform finetuning on COCO, and evaluate on both COCO test set and zero-shot transfer to NoCaps [nocaps](None) validation set. The results are shown in Table [tbl:caption]. BLIP-2 achieves state-of-the-art performance with significant improvement on NoCaps over existing methods, demonstrating strong generalization ability to out-domain images. ## Visual Question Answering Given annotated VQA data, we finetune the parameters of the  and the image encoder while keeping the LLM frozen. We finetune with the open-ended answer generation loss, where the LLM receives ’s output and the question as input, and is asked to generate the answer. In order to extract image features that are more relevant to the question, we additionally condition  on the question. Specifically, the question tokens are given as input to the  and interact with the queries via the self-attention layers, which can guide the ’s cross-attention layers to focus on more informative image regions. Following BLIP, our VQA data includes the training and validation splits from VQAv2, as well as training samples from Visual Genome. Table [tbl:vqa_finetune] demonstrates the state-of-the-art results of BLIP-2 among open-ended generation models. ## Image-Text Retrieval Since image-text retrieval does not involve language generation, we directly finetune the first-stage-pretrained model w/o LLM. Specifically, we finetune the image encoder together with  on COCO using the same objectives ( ITC, ITM, and ITG) as pre-training. We then evaluate the model for both image-to-text retrieval and text-to-image retrieval on COCO and Flickr30K [flickr](None) datasets. During inference, we follow [ALBEF](None), [blip](None) which first select $k=128$ candidates based on the image-text feature similarity, followed by a re-ranking based on pairwise ITM scores. We experiment with both ViT-L and ViT-g as the image encoder. Detailed hyperparameters can be found in the appendix. The results are shown in Table [tbl:retrieval]. BLIP-2 achieves state-of-the-art performance with significant improvement over existing methods on zero-shot image-text retrieval. The ITC and ITM losses are essential for image-text retrieval as they directly learn image-text similarity. In Table [tbl:retrieval_ablation], we show that the ITG (image-grounded text generation) loss is also beneficial for image-text retrieval. This result supports our intuition in designing the representation learning objectives: the ITG loss enforces the queries to extract visual features most relevant to the text, thus improving vision-language alignment. # Limitation [sec:limitation] Recent LLMs can perform in-context learning given few-shot examples. However, our experiments with BLIP-2 do not observe an improved VQA performance when providing the LLM with in-context VQA examples. We attribute the lack of in-context learning capability to our pre-training dataset, which only contains a single image-text pair per sample. The LLMs cannot learn from it the correlation among multiple image-text pairs in a single sequence. The same observation is also reported in the Flamingo paper, which uses a close-sourced interleaved image and text dataset (M3W) with multiple image-text pairs per sequence. We aim to create a similar dataset in future work. BLIP-2’s image-to-text generation could have unsatisfactory results due to various reasons including inaccurate knowledge from the LLM, activating the incorrect reasoning path, or not having up-to-date information about new image content (see Figure [fig:example_limitation]). Furthermore, due to the use of frozen models, BLIP-2 inherits the risks of LLMs, such as outputting offensive language, propagating social bias, or leaking private information. Remediation approaches include using instructions to guide model’s generation or training on a filtered dataset with harmful content removed. # Conclusion We propose BLIP-2, a generic and compute-efficient method for vision-language pre-training that leverages frozen pre-trained image encoders and LLMs. BLIP-2 achieves state-of-the-art performance on various vision-language tasks while having a small amount of trainable parameters during pre-training. BLIP-2 also demonstrates emerging capabilities in zero-shot instructed image-to-text generation. We consider BLIP-2 as an important step towards building a multimodal conversational AI agent.
Flamingo: a Visual Language Model for Few-Shot Learning 2022-04-29 Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
Show Paper Content
#### Acknowledgments and Disclosure of Funding. This research was funded by DeepMind. We would like to thank many colleagues for useful discussions, suggestions, feedback, and advice, including: Samuel Albanie, Relja Arandjelović, Kareem Ayoub, Lorrayne Bennett, Adria Recasens Continente, Tom Eccles, Nando de Freitas, Sander Dieleman, Conor Durkan, Aleksa Gordić, Raia Hadsell, Will Hawkins, Lisa Anne Hendricks, Felix Hill, Jordan Hoffmann, Geoffrey Irving, Drew Jaegle, Koray Kavukcuoglu, Agustin Dal Lago, Mateusz Malinowski, Soňa Mokrá, Gaby Pearl, Toby Pohlen, Jack Rae, Laurent Sifre, Francis Song, Maria Tsimpoukelli, Gregory Wayne, and Boxi Wu. # Appendix [appendix]
results overview. Left: Our largest model, dubbed , outperforms state-of-the-art fine-tuned models on 6 of the 16 tasks we consider with no fine-tuning. For the 9 tasks with published few-shot results, sets the new few-shot state of the art. Note: We omit RareAct, our 16th benchmark, as it is a zero-shot benchmark with no available fine-tuned results to compare to. Right: performance improves with model size and number of shots.
# Introduction One key aspect of intelligence is the ability to quickly learn to perform a new task given a short instruction [griffiths2019doing](None), [markman1989categorization](None). While initial progress has been made towards a similar capability in computer vision, the most widely used paradigm still consists of first pretraining on a large amount of supervised data, before fine-tuning the model on the task of interest [lu2019vilbert](None), [wang2021ufo](None), [zellers2022merlot](None). However, successful fine-tuning often requires many thousands of annotated data points. In addition, it often requires careful per-task hyperparameter tuning and is also resource intensive. Recently, multimodal vision-language models trained with a contrastive objective [align](None), [clip](None) have enabled zero-shot adaptation to novel tasks, without the need for fine-tuning. However, because these models simply provide a similarity score between a text and an image, they can only address limited use cases such as classification, where a finite set of outcomes is provided beforehand. They crucially lack the ability to generate language, which makes them less suitable to more open-ended tasks such as captioning or visual question-answering. Others have explored visually-conditioned language generation [wang2021simvlm](None), [tsimpoukelli2021multimodal](None), [cho2021unifying](None), [wang2022unifying](None), [xu2021vlm](None) but have not yet shown good performance in low-data regimes. We introduce , a Visual Language Model (VLM) that sets a new state of the art in few-shot learning on a wide range of open-ended vision and language tasks, simply by being prompted with a few input/output examples, as illustrated in Figure [fig:teaser]. Of the 16 tasks we consider, also surpasses the fine-tuned state of the art on 6 tasks, despite using orders of magnitude less task-specific training data (see Figure 1). To achieve this, Flamingo takes inspiration from recent work on large language models (LMs) which are good few-shot learners [gpt3](None), [gopher](None), [chinchilla](None), [chowdhery2022palm](None). A single large LM can achieve strong performance on many tasks using only its text interface: a few examples of a task are provided to the model as a prompt, along with a query input, and the model generates a continuation to produce a predicted output for that query. We show that the same can be done for image and video understanding tasks such as classification, captioning, or question-answering: these can be cast as text prediction problems with visual input conditioning. The difference from a LM is that the model must be able to ingest a multimodal prompt containing images and/or videos interleaved with text. have this capability—they are visually-conditioned autoregressive text generation models able to ingest a sequence of text tokens interleaved with images and/or videos, and produce text as output. leverage two complementary pre-trained and frozen models: a vision model which can “perceive” visual scenes and a large LM which performs a basic form of reasoning. Novel architecture components are added in between these models to connect them in a way that preserves the knowledge they have accumulated during computationally intensive pre-training. are also able to ingest high-resolution images or videos thanks to a Perceiver-based [jaegle2021perceiver](None) architecture that can produce a small fixed number of visual tokens per image/video, given a large and variable number of visual input features. A crucial aspect for the performance of large LMs is that they are trained on a large amount of text data. This training provides general-purpose generation capabilities that allows these LMs to perform well when prompted with task examples. Similarly, we demonstrate that the way we train the models is crucial for their final performance. They are trained on a carefully chosen mixture of complementary large-scale multimodal data coming only from the web, *without using any data annotated for machine learning purposes*. After this training, a model can be directly adapted to vision tasks via simple few-shot learning without any task-specific tuning. **Contributions.** In summary, our contributions are the following: **(i)** We introduce the family of VLMs which can perform various multimodal tasks (such as captioning, visual dialogue, or visual question-answering) from only a few input/output examples. Thanks to architectural innovations, the models can efficiently accept arbitrarily interleaved visual data and text as input and generate text in an open-ended manner. **(ii)** We quantitatively evaluate how  models can be adapted to various tasks via few-shot learning. We notably reserve a large set of held-out benchmarks which have not been used for validation of any design decisions or hyperparameters of the approach. We use these to estimate unbiased few-shot performance. **(iii)** sets a new state of the art in few-shot learning on a wide array of 16 multimodal language and image/video understanding tasks. On 6 of these 16 tasks, also outperforms the fine-tuned state of the art despite using only 32 task-specific examples, around 1000 times less task-specific training data than the current state of the art. With a larger annotation budget,  can also be effectively fine-tuned to set a new state of the art on five additional challenging benchmarks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes. # Approach [sec:approach]
architecture overview. Flamingo is a family of visual language models (VLMs) that take as input visual data interleaved with text and produce free-form text as output.
This section describes Flamingo: a visual language model that accepts text interleaved with images/videos as input and outputs free-form text. The key architectural components shown in Figure 2 are chosen to leverage pretrained vision and language models and bridge them effectively. First, the Perceiver Resampler (Section 2.1) receives spatio-temporal features from the Vision Encoder (obtained from either an image or a video) and outputs a fixed number of visual tokens. Second, these visual tokens are used to condition the frozen LM using freshly initialised cross-attention layers (Section 2.2) that are interleaved between the pretrained LM layers. These new layers offer an expressive way for the LM to incorporate visual information for the next-token prediction task. Flamingo models the likelihood of text $y$ conditioned on interleaved images and videos $x$ as follows: $$\begin{aligned} p(y | x) = \prod_{\ell=1}^L p(y_\ell | y_{< \ell}, x_{\leq \ell}), \label{eq:modeling} \end{aligned}$$ where $y_{\ell}$ is the $\ell$-th language token of the input text, $y_{<\ell}$ is the set of preceding tokens, $x_{\leq \ell}$ is the set of images/videos preceding token $y_{\ell}$ in the interleaved sequence and $p$ is parametrized by a model. The ability to handle interleaved text and visual sequences (Section 2.3) makes it natural to use models for in-context few-shot learning, analogously to GPT-3 with few-shot text prompting. The model is trained on a diverse mixture of datasets as described in Section 2.4. ## Visual processing and the Perceiver Resampler [sec:transformer_resampler] **Vision Encoder: from pixels to features.** Our vision encoder is a pretrained and frozen Normalizer-Free ResNet (NFNet) [nfnets](None) – we use the F6 model. We pretrain the vision encoder using a contrastive objective on our datasets of image and text pairs, using the two-term contrastive loss from [clip](None). We use the output of the final stage, a 2D spatial grid of features that is flattened to a 1D sequence. For video inputs, frames are sampled at 1 FPS and encoded independently to obtain a 3D spatio-temporal grid of features to which learned temporal embeddings are added. Features are then flattened to 1D before being fed to the Perceiver Resampler. More details on the contrastive model training and performance are given in Appendix [app:contrastive_details][app:contrastive_details] and Appendix [app:contrastive_ablation][app:contrastive_ablation], respectively. **Perceiver Resampler: from varying-size large feature maps to few visual tokens.** This module connects the vision encoder to the frozen language model as shown in Figure 2. It takes as input a variable number of image or video features from the vision encoder and produces a fixed number of visual outputs (64), reducing the computational complexity of the vision-text cross-attention. Similar to Perceiver [jaegle2021perceiver](None) and DETR [carion2020end](None), we learn a predefined number of latent input queries which are fed to a Transformer and cross-attend to the visual features. We show in our ablation studies (Section 3.3) that using such a vision-language resampler module outperforms a plain Transformer and an MLP. We provide an illustration, more architectural details, and pseudo-code in Appendix [app:transformer_resampler][app:transformer_resampler]. ## Conditioning frozen language models on visual representations [sec:xattn_dense] Text generation is performed by a Transformer decoder, conditioned on the visual representations produced by the Perceiver Resampler. We interleave pretrained and frozen text-only LM blocks with blocks trained from scratch that cross-attend to the visual output from the Perceiver Resampler.
gated xattn-dense layers. To condition the LM on visual inputs, we insert new cross-attention layers between existing pretrained and frozen LM layers. The keys and values in these layers are obtained from the vision features while the queries are derived from the language inputs. They are followed by dense feed-forward layers. These layers are gated so that the LM is kept intact at initialization for improved stability and performance.
**Interleaving new gated xattn-dense layers within a frozen pretrained LM.** We freeze the pretrained LM blocks, and insert *gated cross-attention dense* blocks (Figure 3) between the original layers, trained from scratch. To ensure that at initialization, the conditioned model yields the same results as the original language model, we use a $\tanh$-gating mechanism [hochreiter1997long](http://arxiv.org/pdf/2103.15232v1). This multiplies the output of a newly added layer by $\tanh(\alpha)$ before adding it to the input representation from the residual connection, where $\alpha$ is a layer-specific learnable scalar initialized to $0$ [bachlechner2021rezero](None). Thus, at initialization, the model output matches that of the pretrained LM, improving training stability and final performance. In our ablation studies (Section 3.3), we compare the proposed gated xattn-dense layers against recent alternatives [desai2021virtex](None), [luo2022vc](None) and explore the effect of how frequently these additional layers are inserted to trade off between efficiency and expressivity. See Appendix [app:xattn_dense][app:xattn_dense] for more details. **Varying model sizes.** We perform experiments across three models sizes, building on the 1.4B, 7B, and 70B parameter Chinchilla models [chinchilla](None); calling them respectively , and . For brevity, we refer to the last as throughout the paper. While increasing the parameter count of the frozen LM and the trainable vision-text gated xattn-dense modules, we maintain a fixed-size frozen vision encoder and trainable Perceiver Resampler across the different models (small relative to the full model size). See Appendix [sec:models_details][sec:models_details] for further details. ## Multi-visual input support: per-image/video attention masking [sec:multi_im_att] The image-causal modelling introduced in Equation [eq:modeling] is obtained by masking the full text-to-image cross-attention matrix, limiting which visual tokens the model sees at each text token. At a given text token, the model attends to the visual tokens of the image that appeared just before it in the interleaved sequence, rather than to all previous images (formalized and illustrated in Appendix [app:multi-visual-details][app:multi-visual-details]). Though the model only *directly* attends to a single image at a time, the dependency on all previous images remains via self-attention in the LM. This single-image cross-attention scheme importantly allows the model to seamlessly generalise to any number of visual inputs, regardless of how many are used during training. In particular, we use only up to 5 images per sequence when training on our interleaved datasets, yet our model is able to benefit from sequences of up to 32 pairs (or “shots”) of images/videos and corresponding texts during evaluation. We show in Section 3.3 that this scheme is more effective than allowing the model to cross-attend to all previous images directly. ## Training on a mixture of vision and language datasets [sec:datasets] We train the models on a mixture of three kinds of datasets, all scraped from the web: an interleaved image and text dataset derived from webpages, image-text pairs, and video-text pairs. **M3W: Interleaved image and text dataset.** The few-shot capabilities of Flamingo models rely on training on interleaved text and image data. For this purpose, we collect the *MultiModal MassiveWeb* () dataset. We extract both text and images from the HTML of approximately 43 million webpages, determining the positions of images relative to the text based on the relative positions of the text and image elements in the Document Object Model (DOM). An example is then constructed by inserting `` tags in plain text at the locations of the images on the page, and inserting a special `` (*end of chunk*) token (added to the vocabulary and learnt) prior to any image and at the end of the document. From each document, we sample a random subsequence of $L=256$ tokens and take up to the first $N=5$ images included in the sampled sequence. Further images are discarded in order to save compute. More details are provided in Appendix [app:datasets][app:datasets]. **Pairs of image/video and text.** For our image and text pairs we first leverage the ALIGN [align](None) dataset, composed of 1.8 billion images paired with alt-text. To complement this dataset, we collect our own dataset of image and text pairs targeting better quality and longer descriptions: LTIP (Long Text & Image Pairs) which consists of 312 million image and text pairs. We also collect a similar dataset but with videos instead of still images: VTP (Video & Text Pairs) consists of 27 million short videos (approximately 22 seconds on average) paired with sentence descriptions. We align the syntax of paired datasets with the syntax of M3W by prepending `` and appending `` to each training caption (see Appendix [app:vtp_and_itp][app:vtp_and_itp] for details). **Multi-objective training and optimisation strategy.** We train our models by minimizing a weighted sum of per-dataset expected negative log-likelihoods of text, given the visual inputs: $$\sum_{m=1}^{M} \lambda_m \cdot \mathbb{E}_{(x, y)\sim \mathcal{D}_m} \left[ -\sum_{\ell=1}^L \log p(y_\ell | y_{< \ell}, x_{\leq \ell})\right],$$ where $\mathcal{D}_m$ and $\lambda_m$ are the $m$-th dataset and its weighting, respectively. Tuning the per-dataset weights $\lambda_m$ is key to performance. We accumulate gradients over all datasets, which we found outperforms a “round-robin” approach [cho2021unifying](None). We provide further training details and ablations in Appendix [app:large_scale_training][app:large_scale_training]. ## Task adaptation with few-shot in-context learning [sec:adapt-vlm] Once Flamingo is trained, we use it to tackle a visual task by conditioning it on a multimodal interleaved prompt. We evaluate the ability of our models to rapidly adapt to new tasks using **in-context learning**, analogously to GPT-3 [gpt3](None), by interleaving support example pairs in the form of $(image, text)$ or $(video, text)$, followed by the query visual input, to build a prompt (details in Appendix [app:in_context_eval_details][app:in_context_eval_details]). We perform **open-ended** evaluations using beam search for decoding, and **close-ended** evaluations using our model’s log-likelihood to score each possible answer. We explore **zero-shot generalization** by prompting the model with two text-only examples from the task, with no corresponding images. Evaluation hyperparameters and additional details are given in Appendix [app:fewshot-eval-hyper][app:fewshot-eval-hyper]. # Experiments [sec:experiments] Our goal is to develop models that can rapidly adapt to diverse and challenging tasks. For this, we consider a wide array of 16 popular multimodal image/video and language benchmarks. In order to validate model design decisions during the course of the project, 5 of these benchmarks were used as part of our development (dev) set: COCO, OKVQA, VQAv2, MSVDQA and VATEX. Performance estimates on the dev benchmarks may be biased, as a result of model selection. We note that this is also the case for prior work which makes use of similar benchmarks to validate and ablate design decisions. To account for this, we report performance on an additional set of 11 benchmarks, spanning captioning, video question-answering, as well as some less commonly explored capabilities such as visual dialogue and multi-choice question-answering tasks. The evaluation benchmarks are described in Appendix [sec:eval_benchmarks][sec:eval_benchmarks]. We keep all evaluation hyperparameters fixed across all benchmarks. Depending on the task, we use four few-shot prompt templates we describe in more detail in Appendix [app:fewshot-eval-hyper][app:fewshot-eval-hyper]. We emphasize that *we do not validate any design decisions on these 11 benchmarks* and use them solely to estimate unbiased few-shot learning performance of our models. Concretely, estimating few-shot learning performance of a model involves prompting it with a set of *support* samples and evaluating it on a set of *query* samples. For the dev benchmarks that are used both to validate design decisions and hyperparameters, as well as to report final performance, we therefore use four subsets: *validation support*, *validation query*, *test support* and *test query*. For other benchmarks, we need only the latter two. We report in Appendix [sec:eval_benchmarks][sec:eval_benchmarks] how we form these subsets. We report the results of the  models on few-shot learning in Section 3.1. Section 3.2 gives  fine-tuned results. An ablation study is given in Section 3.3. Appendix [app:more_performance][app:more_performance] provides more results including ’s performance on the ImageNet and Kinetics700 classification tasks, and on our contrastive model’s performance. Appendix [app:qual_res][app:qual_res] includes additional qualitative results. ## Few-shot learning on vision-language tasks [sec:fewshot_openended] **Few-shot results.** Results are given in Table [tab:fewshot_all_tasks]. outperforms by a large margin *all* previous zero-shot or few-shot methods on the 16 benchmarks considered. This is achieved with as few as four examples per task, demonstrating practical and efficient adaptation of vision models to new tasks. More importantly, is often competitive with state-of-the-art methods additionally fine-tuned on up to hundreds of thousands of annotated examples. On six tasks, even outperforms the fine-tuned SotA despite using a *single* set of model weights and only 32 task-specific examples. Finally, despite having only used the dev benchmarks for design decisions, our results generalize well to the other benchmarks, confirming the generality of our approach. **Scaling with respect to parameters and shots.** As shown in Figure 1, the larger the model, the better the few-shot performance, similar to GPT-3 [gpt3](None). The performance also improves with the number of shots. We further find that the largest model better exploits larger numbers of shots. Interestingly, even though our models were trained with sequences limited to only 5 images on , they are still able to benefit from up to 32 images or videos during inference. This demonstrates the flexibility of the architecture for processing a variable number of videos or images. ## Fine-tuning as a pretrained vision-language model [sec:ft_results] While not the main focus of our work, we verify that when given more data, models can be adapted to a task by fine-tuning their weights. In Table [tab:ft-sota-table-compressed], we explore fine-tuning our largest model, , for a given task with no limit on the annotation budget. In short, we do so by fine-tuning the model on a short schedule with a small learning rate by additionally unfreezing the vision backbone to accommodate a higher input resolution (details in Appendix [app:finetuning][app:finetuning]). We find that we can improve results over our previously presented in-context few-shot learning results, setting a new state of the art on five additional tasks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes. ## Ablation studies [sec:ablations] In Table [tab:ablation-table-no-classif], we report our ablation results using  on the *validation* subsets of the five dev benchmarks with 4 shots. Note that we use smaller batch sizes and a shorter training schedule compared to the final models. The **Overall score** is obtained by dividing each benchmark score by its state-of-the-art (SotA) performance from Table [tab:fewshot_all_tasks] and averaging the results. More details and results are given in Appendix [app:all_ablation_studies][app:all_ablation_studies] and Table [tab:ablation-table-appendix]. **Importance of the training data mixture.** As shown in row **(i)**, getting the right training data plays a crucial role. In fact, removing the interleaved image-text dataset leads to a *decrease of more than $17\%$* in performance while removing the conventional paired image-text pairs also decreases performance (by $9.8\%$), demonstrating the need for different types of datasets. Moreover, removing our paired video-text dataset negatively affects performance on all video tasks. We ablate replacing our image-text pairs (ITP) by the publicly available LAION-400M dataset [schuhmann2021laion](None), which leads to a slight degradation in performance. We show in row **(ii)** the importance of our gradient accumulation strategy compared to using round-robin updates [cho2021unifying](None). **Visual conditioning of the frozen LM.** We ablate the use of the 0-initialized tanh gating when merging the cross-attention output to the frozen LM output in row **(iii)**. Without it, we see a drop of $4.2\%$ in our overall score. Moreover, we have noticed that disabling the 0-initialized tanh gating leads to training instabilities. Next, we ablate different conditioning architectures in row **(iv)**. vanilla xattn, refers to the vanilla cross-attention from the original Transformer decoder [vaswani2017attention](None). In the grafting approach from [luo2022vc](None), the frozen LM is used as is with no additional layers inserted, and a stack of interleaved self-attention and cross-attention layers that take the frozen LM output are learnt from scratch. Overall, we show that our gated xattn-dense conditioning approach works best. **Compute/Memory vs. performance trade-offs.** In row **(v)**, we ablate the frequency at which we add new gated xattn-dense blocks. Although adding them at every layer is better, it significantly increases the number of trainable parameters and time complexity of the model. Notably, inserting them every fourth block accelerates training by $66\%$ while only decreasing the overall score by $1.9\%$. In light of this trade-off, we maximize the number of added layers under hardware constraints and add a gated xattn-dense every fourth layer for and every seventh for . We further compare in row **(vi)** the Perceiver Resampler to a MLP and a vanilla Transformer given a parameter budget. Both underperform the Perceiver Resampler while also being slower. **Vision encoder.** In row **(vii)**, we compare our NFNet-F6 vision encoder pretrained with contrastive learning (details in Appendix [app:contrastive_details][app:contrastive_details]) to the publicly available CLIP ViT-L/14 [clip](None) model trained at 224 resolution. Our NFNet-F6 has a $+5.8\%$ advantage over the CLIP ViT-L/14 and $+8.0\%$ over a smaller NFNet-F0 encoder, which highlights the importance of using a strong vision backbone. **Freezing LM components prevents catastrophic forgetting.** We verify the importance of freezing the LM layers at training in row **(viii)**. If trained from scratch, we observe a large performance decrease of $-12.9\%$. Interestingly, fine-tuning our pretrained LM also leads to a drop in performance of $-8.0\%$. This indicates an instance of “catastrophic forgetting” [mccloskey1989catastrophic](None), in which the model progressively forgets its pretraining while training on a new objective. In our setting, freezing the language model is a better alternative to training with the pre-training dataset (MassiveText) in the mixture. # Related work **Language modelling and few-shot adaptation.** Language modelling has recently made substantial progress following the introduction of Transformers [vaswani2017attention](None). The paradigm of first pretraining on a vast amount of data followed by an adaptation on a downstream task has become standard [mikolov2010recurrent](None), [graves2013generating](None), [jozefowicz2016exploring](None), [howard2018universal](None), [bert](None), [t5](None), [sutskever2011generating](None), [gpt3](None). In this work, we build on the 70B Chinchilla language model [chinchilla](None) as the base LM for . Numerous works have explored techniques to adapt language models to novel tasks using a few examples. These include adding small adapter modules [houlsby2019parameter](None), fine-tuning a small part of the LM [zaken_bitfit_2022](None), showing in-context examples in the prompt [gpt3](None), or optimizing the prompt [li2021prefix](None), [lester2021power](None) through gradient descent. In this paper, we take inspiration from the in-context [gpt3](None) few-shot learning technique instead of more involved few-shot learning approaches based on metric learning [doersch2020crosstransformers](None), [vinyals2016matching](None), [snell2017prototypical](None), [tian2020rethinking](None) or meta-learning [finn2017model](None), [bertinetto2018meta](None), [zintgraf2019fast](None), [requeima2019fast](None), [gordon2018meta](None), [bertinetto2016learning](None). **When language meets vision.** These LM breakthroughs have been influential for vision-language modelling. In particular, BERT [bert](None) inspired a large body of vision-language work [lu2019vilbert](None), [su2019vl](None), [chen2020uniter](None), [hendricks2021decoupling](None), [wang2021vlmo](None), [li2020oscar](None), [tan2019lxmert](None), [zhu2020actbert](None), [wang2021ufo](None), [li2020hero](None), [gan2020large](None), [fu2021violet](None), [zellers2021merlot](None), [zellers2022merlot](None), [singh2021flava](None), [sun2019videobert](None). We differ from these approaches as do not require fine-tuning on new tasks. Another family of vision-language models is based on contrastive learning [alayrac2020self](None), [clip](None), [align](None), [zhai2021lit](None), [pham2021combined](None), [miech2020end](None), [bain2021frozen](None), [yuan2021florence](None), [li2021align](None), [yao2021filip](None), [jain2021mural](None). differs from contrastive models as it can generate text, although we build and rely upon them for our vision encoder. Similar to our work are VLMs able to generate text in an autoregressive manner [vinyals2015show](None), [donahue2015long](None), [luo2020univl](None), [hu2021scaling](None), [dai2022](None). Concurrent works [wang2021simvlm](None), [cho2021unifying](None), [wang2022unifying](None), [zhu2021uni](None), [li2022blip](None) also propose to formulate numerous vision tasks as text generation problems. Building on top of powerful pretrained language models has been explored in several recent works. One recent line of work [tsimpoukelli2021multimodal](None), [eichenberg2021magma](None), [mokady2021clipcap](None), [luo2022vc](None), [yang2021empirical](None), [zeng2022socraticmodels](None) proposes to freeze the pretrained LM weights to prevent catastrophic forgetting [mccloskey1989catastrophic](None). We follow this idea by freezing the Chinchilla LM layers [chinchilla](None) and adding learnable layers within the frozen LM. We differ from prior work by introducing the first LM that can ingest arbitrarily interleaved images, videos, and text. **Web-scale vision and language training datasets.** Manually annotated vision and language datasets are costly to obtain and thus relatively small (10k-100k) in scale [young2014image](None), [chen2015microsoft](None), [antol2015vqa](None), [marino2019ok](None), [wang2019vatex](None), [xiao2021next](None). To alleviate this lack of data, numerous works [align](None), [sharma2018conceptual](None), [changpinyo2021conceptual](None), [thomee2016yfcc100m](None) automatically scrape readily available paired vision-text data. In addition to such paired data, we show the importance of also training on entire multimodal webpages containing interleaved images and text as a single sequence. Concurrent work CM3 [aghajanyan2022cm3](None) proposes to generate HTML markup from pages, while we simplify the text prediction task by only generating plain text. We emphasize few-shot learning and vision tasks while CM3 [aghajanyan2022cm3](None) primarily evaluates on language-only benchmarks in a zero-shot or fine-tuned setup. # Discussion [sec:discussion] **Limitations.** First, our models build on pretrained LMs, and as a side effect, directly inherit their weaknesses. For example, LM priors are generally helpful, but may play a role in occasional hallucinations and ungrounded guesses. Furthermore, LMs generalise poorly to sequences longer than the training ones. They also suffer from poor sample efficiency during training. Addressing these issues can accelerate progress in the field and enhance the abilities of VLMs like Flamingo. Second, the classification performance of lags behind that of state-of-the-art contrastive models [clip](None), [pham2021combined](None). These models directly optimize for text-image retrieval, of which classification is a special case. In contrast, our models handle a wider range of tasks, such as open-ended ones. A unified approach to achieve the best of both worlds is an important research direction. Third, in-context learning has significant advantages over gradient-based few-shot learning methods, but also suffers from drawbacks depending on the characteristics of the application at hand. We demonstrate the effectiveness of in-context learning when access is limited to only a few dozen examples. In-context learning also enables simple deployment, requiring only inference, generally with no hyperparameter tuning needed. However, in-context learning is known to be highly sensitive to various aspects of the demonstrations [zhao2021calibrate](None), [truefewshot](None), and its inference compute cost and absolute performance scale poorly with the number of shots beyond this low-data regime. There may be opportunities to combine few-shot learning methods to leverage their complementary benefits. We discuss the limitations of our work in more depth in Appendix [sec:limitations][sec:limitations]. **Societal impacts.** In terms of societal impacts, offers a number of benefits while carrying some risks. Its ability to rapidly adapt to a broad range of tasks have the potential to enable non-expert users to obtain good performance in data-starved regimes, lowering the barriers to both beneficial and malicious applications. is exposed to the same risks as large language models, such as outputting offensive language, propagating social biases and stereotypes, as well as leaking private information [weidinger2021harms](None), [chinchilla](None). Its ability to additionally handle visual inputs poses specific risks such as gender and racial biases relating to the contents of the input images, similar to a number of visual recognition systems [hendricks2018women](None), [zhao2021understanding](None), [buolamwini2018gender](None), [de2019does](None), [schwemmer2020diagnosing](None). We refer the reader to Appendix [sec:broader_impact][sec:broader_impact] for a more extensive discussion of the societal impacts of our work, both positive and negative; as well as mitigation strategies and early investigations of risks relating to racial or gender bias and toxic outputs. Finally we note that, following prior work focusing on language models [thoppilan2022lamda](None), [perez2022red](None), [menick2022teaching](None), the few-shot capabilities of could be useful for mitigating such risks. **Conclusion.** We proposed Flamingo, a general-purpose family of models that can be applied to image and video tasks with minimal task-specific training data. We also qualitatively explored interactive abilities of  such as “chatting” with the model, demonstrating flexibility beyond traditional vision benchmarks. Our results suggest that connecting pre-trained large language models with powerful visual models is an important step towards general-purpose visual understanding. # Checklist [checklist] 1. For all authors... 1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? 2. Did you describe the limitations of your work? 3. Did you discuss any potential negative societal impacts of your work? 4. Have you read the ethics review guidelines and ensured that your paper conforms to them? 2. If you are including theoretical results... 1. Did you state the full set of assumptions of all theoretical results? 2. Did you include complete proofs of all theoretical results? 3. If you ran experiments... 1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? 2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? 3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? 4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... 1. If your work uses existing assets, did you cite the creators? 2. Did you mention the license of the assets? 3. Did you include any new assets either in the supplemental material or as a URL? 4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? 5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? . 5. If you used crowdsourcing or conducted research with human subjects... 1. Did you include the full text of instructions given to participants and screenshots, if applicable? 2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? 3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
Classic VLMs trained on document images These classic Vision-Language Models are trained on natural images to perform tasks such as image-based question answering. However, to adapt them to text-rich images like documents (the data is way less abundant than natural images), some work have fine-tuned these vision-language models on datasets containing text-rich images such as documents. Here are some examples of classic vision-language models fine-tuned on text-rich data:
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model 2023-11-30 Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang
Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.
Show Paper Content
[^1]: Equal Contribution. [^2]: Corresponding Author # Introduction [sec:intro] The strong text creation ability of the Large Language Model(LLM)[llama](http://arxiv.org/pdf/2402.08075v1), [gpt3](http://arxiv.org/pdf/2112.07522v2), [vicuna](https://github.com/lm-sys/FastChat), [alpaca](https://github.com/tatsu-lab/stanford_alpaca) inspires the development of paper-writing copilot recently, such as jenni[^1]. However, existing LLMs or Multimodal LLMs are still not fully competent to assist academic paper writing due to the weak scientific diagram analysis abilities.
An inllustration of scientific diagram analysis copilot.
As shown in 1, to assist the user in writing academic analysis about scientific diagrams, the copilot should be equipped with major three abilities. **First and most basically**, the model should be able to understand multiple diagrams of various types (figures, tables, etc.) and in different formats (image or latex). **Second**, the diagram analysis should remain consistent with the preceding texts and therefore ask to model to correlate multimodal context and diagram information. **Third**, for better aligning the user’s intention, the copilot should be interactable with the user, which requires the model controllable. Recently, there have been many Multimodal Large Language Models(MLLMs)[Alayrac2022FlamingoAV](http://arxiv.org/pdf/2205.07065v1), [ye2023mplugowl2](https://arxiv.org/pdf/2311.04257), [qwenvl](None), [minigpt4](http://arxiv.org/pdf/2402.17510v1), [llava](http://arxiv.org/pdf/2402.11690v1), [llava1.5](http://arxiv.org/pdf/2310.19145v1), [instructblip](None), [cogvlm2023](http://arxiv.org/pdf/2210.00066v1) proposed by connecting a vision encoder with a Large Language Model as the language decoder. These MLLMs are good at chatting about a general image but poor at understanding diagrams. Some work[ureader](None), [docowl](None) tried to develop MLLMs for Multimodal Document Understanding, covering tables, charts, webpages,.etc. However, these models mainly focus on strengthening the vision comprehension of a single diagram and can’t generate detailed scientific analysis. In this work, to develop scientific diagram analysis skills for the paper-writing copilot, we first build a comprehensive dataset  to support the learning of the three critical abilities mentioned above. By parsing Latex source files of high-quality academic papers, we carefully extract diagrams in both image and latex formats and align them with their captions and paragraph analysis. To simulate two main scenarios of scientific diagrammatic understanding, we design two main tasks, namely ***Multimodal Diagram Captioning*** and ***Multimodal Diagram Analysis***, where multiple diagrams are the main comprehending objects. In addition, we provide the preceding text, namely $[Context]$, as part of the input to teach the model how to utilize background knowledge and maintain fluency with previous content. Furthermore, to better align users’ writing intentions, we design $[Outline]$ as control signals, which are comprised of concise key points to be covered in the analysis. We utilize the ChatGPT to construct $[Outline]$ based on ground-truth paragraph analysis and feed it as the input for *Multimodal Diagram Analysis*. For more user-friendly interaction, recommending $[Outline]$ could inspire users or reduce interaction costs. Thus, we set up another ***Outline Recommendation*** task to make the copilot more versatile and user-friendly. For accurately evaluating the diagram analysis quality, besides commonly used ngram-based metrics (e.g. CIDEr[cider](http://arxiv.org/pdf/2106.15553v1)), we carefully designed a $\rm{CIDEr}^{gpt}$ score to evaluate both n-gram matching and semantic similarity with the help of ChatGPT. We benchmark multiple state-of-the-art MLLMs on our dataset, validating the challenge of our three tasks. Based on the DocOwl[docowl](None), we perform instruction-tuning on a combination of training data from three tasks and propose a strong generalist as the baseline, named . Comprehensive experiments validate the effectiveness of introducing $[Context]$ and $[Outline]$ as inputs. Besides, we perform sufficient ablation studies about vision encoding to provide insights about the model improvement, such as increasing the image resolution and enhancing the ability to correlate multiple diagrams. In summary, our contributions are three-fold: =0.1em - We build the first high-quality scientific diagram analysis dataset  to support the learning of correlating multiple diagrams, keeping consistency with the preceding content, and being interactable with users. - Simulating real paper-writing scenarios, we carefully design three multimodal tasks and propose a GPT-based metric, $\rm{CIDEr}^{gpt}$, to measure the paragraph analysis quality by considering both detailed n-gram and overall semantic similarity. - We carefully tune a generalist based on an existing MLLM as the baseline and perform comprehensive experiments to validate the effectiveness of multimodal inputs and training strategies. [^1]: # Related Work [sec:rela] **Text-only Paper Understanding**[S2ORC2020](None), [AnPaperSum2021](http://arxiv.org/pdf/2104.03057v1), [AbuPaperSum2011](http://arxiv.org/pdf/1907.01272v2), [SaierF19](None), [Ammar18](http://arxiv.org/pdf/2301.10140v1), [ShenMW18](http://arxiv.org/pdf/1805.12216v1) focuses on text and citation graph comprehension in academic papers. Such models are competent for a number of text-only thesis comprehension tasks, including information extraction, text classification, paper summarization, or citation recommendation. Benefiting from the strong text understanding ability of Large Language Models(LLMs), many LLM-based tools have been developed as paper-reading assistants, such as ChatDoc[^1], ChatPDF[^2] and Zhiwen[^3]. However, they are still not capable of assisting paper writing due to a lack of multimodal abilities to understand vision information and generate helpful diagram analyses, which are indispensable in scientific papers. **Multimodal Document Understanding** aims to develop multimodal comprehension abilities for images with rich text information, including charts[chartqa2022](None), [chart2text2022](None), [VisText2023](None), tables[wikitableqa](http://arxiv.org/pdf/2009.13845v2), [TabFact](http://arxiv.org/pdf/2311.06592v1), documents[docvqa](None), [mpmqa](None), [deepform](http://arxiv.org/pdf/2303.13839v1), [klc](None) and infographic images[infovqa](http://arxiv.org/pdf/2104.12756v2), etc. In particular, some works[SciGraphQA2023](None), [scicap2021](http://arxiv.org/pdf/2403.17784v1), [scicap+2023](None) focus on understanding scientific figures from papers. Task formats of these work range from Information Extraction[deepform](http://arxiv.org/pdf/2303.13839v1), [klc](None), Question Answering[docvqa](None), [chartqa2022](None), [infovqa](http://arxiv.org/pdf/2104.12756v2), Natural Language Inference[TabFact](http://arxiv.org/pdf/2311.06592v1) to Image Captioning[chart2text2022](None), [VisText2023](None), [scicap2021](http://arxiv.org/pdf/2403.17784v1), [scicap+2023](None). Recently, some works[docowl](None), [ureader](None), [llavar](http://arxiv.org/pdf/2306.17107v2), [qwenvl](None), [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [wang2023tgdoc](http://arxiv.org/pdf/2311.13194v2) have proposed Multimodal Large Language Models with visually-situated text understanding ability. For example, UReader[ureader](None) performs instruction tuning on an ensembled dataset covering various types of images and designs a Shape-adaptive Cropping Module to process high-resolution document images. However, these MLMMs are still far from acting as a paper-writing copilot for scientific diagram analysis due to main two shortages. First, they can only generate a short answer or description and lack comprehensive diagram analysis abilities. Second, they are all trained to understand a single image, and thus can’t correlate context and multiple figures or tables for accurate multimodal analysis. To empower MLMMs with such abilities, we carefully build a scientific diagram analysis dataset  based on high-quality academic papers. Fineunted on this dataset, our  shows stronger multimodal diagram analysis abilities and moves a step closer to paper-writing copilot. [^1]: [^2]: [^3]:
# [sec:data] Towards a paper-writing copilot, this work aims to build  to help develop multimodal scientific diagram analysis abilities. The dataset construction and task definition are shown in [fig:data_process]. ## Paper Collection The arXiv[^1] is an open-access repository of electronic preprints and postprints, consisting of scientific papers in computer science, mathematics, physics, etc. Due to the field gap, diagrams, writing, and analysis styles are quite different across these fields. In this work, we chose ‘Computer Science’ as the study object. Due to that not all papers are reviewed by peers before posting, the paper quality in arXiv varies a lot and low-quality papers may hurt the model’s logical analysis abilities. Considering PapersWithCode[^2] is a community-driven platform for learning about state-of-the-art research papers on machine learning, we think the quality of papers listed in PapersWithCode is reliable enough. Therefore, with the PapersWithCode API[^3], we collect 48k arXiv ids, ranging from 2012 to 2023, covering 15 categories and then download their corresponding Latex source files following official instructions[^4]. ## Paper Parse PDF and Latex are two kinds of commonly used paper formats in paper-related research. In this work, we choose to parse Latex source files for two main reasons. Firstly, by comparing the content in the ‘$\backslash$`ref{.}`’ tag and ‘$\backslash$`label{.}`’ tag in Latex files, it’s easy to accurately align diagrams with paragraph analysis in papers. Secondly, the Latex format is more natural and general for LLM to understand or generate diverse texts, including plain text and mathematical expression, etc. Taking into account these two points, Latex-style text understanding and generation is more suitable for a paper-writing copilot. Following S2ORC[S2ORC2020](None), we first parse Latex source files into XML format and then extract diagrams and correlate them with captions and paragraphs. More details on text cleaning can be found in the supplementary material. Both figures and tables are widely used in scientific academic papers. By parsing the Latext source file, it’s easy to align figure reference with figures in image format (e.g.,‘jpg’) by the ‘$\backslash$`includegraphics`’ tag. But for tables, there are only Latex codes and no image-format files provided. Towards wider application scenarios, a diagram analysis copilot is necessary to understand tables in both latex and image formats. To support learning such abilities, we further collect table images as inputs. Directly extracting table bounding boxes from PDF-format papers with pdf-parsing tools (e.g., GROBID[^5]) and then cropping table image is a naive way. However, due to the diverse layout in scientific papers, table coordinates given by such tools are not accurate enough. In this work, we collect accurate table images by following three steps. Firstly, we revise the Latex source file to ensure that each table will occupy a separate page after PDF compiling. This operation could greatly reduce the difficulty of table recognition. Then, for each PDF page containing a table, we utilize the classical Edge Detection algorithm Canny[canny1986](http://arxiv.org/pdf/1711.01606v2) to recognize the table bounding box. Finally, the table image is cropped from the PDF page according to the table coordinates. It’s worth noting that, to also support the table captioning task and avoid leaking caption information in the cropped table image, the content within the ‘$\backslash$`caption{.}`’ tag is removed during the first step. During paper writing, for an identical figure or table, even different co-authors can give analysis from different perspectives. Therefore, although a paper-writing copilot can give a comprehensive analysis of a diagram, its analysis can still go against the author’s wishes or be inconsistent with the preceding texts. To better cater to users’ intentions, we propose to use the ‘outline’ as the intermediate control signal during diagram analysis. Besides directly generating the paragraph analysis, the copilot should also be able to analyze the diagram more accurately following provided key points, namely ‘outline’. During paper writing, the outline could given by users or generated by the copilot and revised by users. For developing such a versatile and controllable copilot, it’s necessary to construct appropriate training data for outline generation and analysis generation with outlines. To construct these training samples, in this work, we utilize the GPT-3.5[^6] to generate corresponding outlines for each paragraph by in-context learning. More details can be found in the supplementary material. ## Task Definition After processing Latex source files as mentioned above, we carefully organize these data to support the training and test of multiple tasks designed for the paper-writing copilot, including *Multimodal Diagram Captioning*, *Multimodal Diagram Analysis*, and *Outline Recommendation*. Different from conventional Image Captioning which aims to describe the attributes and relation between objects, Diagram Captioning requires the model to accurately summarize the content in the figure or table, including some concrete mathematical symbols and proper nouns. Besides, due to partial diagrams being a combination of sub-diagrams, it also asks the model to correlate multiple images. Further, the table during paper-writing can be an image or Latex code, which requires the model to understand different formats of input. By parsing the Latex source file, it’s easy to get diagram captions by extracting content from the ‘$\backslash$`caption{.}`’ tag. For generating captioning more consistent with the paper content and better mentioning prop nouns, we also provide preceding text as the textual input, denoted as $[Context]$. To keep the completeness of semantics, the preceding text is comprised of multiple un-truncated paragraphs before the first reference of the diagram, with max 512 tokens. Thus, the input of Multimodal Diagram Captioning is a triplet of $\langle[Context], [Diagrams], [Inst]\rangle$, where $[Diagrams]$ can be images of a diagram or Latex code of a table, $[Inst]$ is the instruction. Following classical image captioning tasks, we utilize BELU[papineni2002bleu](http://arxiv.org/pdf/2202.11027v1), METEOR[banerjee2005meteor](None), ROUGE-L[lin2004rouge](http://arxiv.org/pdf/2209.06517v2), and CIDEr[vedantam2015cider](http://arxiv.org/pdf/2106.15553v1) as evaluation metrics. The CIDEr is valued most because it puts higher weight on rarer tokens (e.g., proper nouns), which are more informative. Much more difficult than writing a caption, Diagram Analysis requires the model to generate a paragraph analysis according to multiple diagrams, even a combination of figures and tables. Besides, diagram analysis is more open-ended than captioning. Different people can analyze a diagram from quite different perspectives. As a paper-writing copilot, the diagram analysis should follow users’ intentions as well as possible, otherwise, it will not improve the writing efficiency. Therefore, besides providing the preceding text like the Multimodal Diagram Captioning task to imply the author’s intention, we further design the ‘outline’ as the explicit control signal, which instructs key points to discuss with diagrams. Overall, the input of Multimodal Diagram Analysis is a quartet of $\langle[Context], [Outline], [Diagrams], [Inst]\rangle$. Captioning metrics are not quite suitable for paragraph analysis because they mainly measure the n-gram similarity and neglect overall semantic matching. To better evaluate the analysis quality, we design a metric to measure the semantic similarity based on GPT 3.5, namely $F1^{gpt}$. Concretely, given the predicted analysis and the ground-truth one, we first prompt the GPT to extract their key points in the list format, respectively. Then, we prompt GPT to judge whether each pair of predicted key point and ground-truth key point matched or not. Finally, we calculate the semantic precision, recall, and F1 score ($F1^{gpt}$) based on GPT’s judgment. The detailed prompt design for these two steps can be found in the supplementary material. The $F1^{gpt}$ is good at measuring semantic similarity but hard to assess the quality of detailed descriptions, which is rather what CIDEr is good at. For paragraph analysis, accurately describing key points is more important and we are more tolerant of the form of expression. Considering $F1^{gpt}$ reflects the percentage of mentioning key points and CIDEr measures the n-gram similarity of the whole paragraph. we therefore multiply the CIDEr with $F1^{gpt}$ as the final evaluation metric $\rm{CIDEr}^{gpt}$, where $F1^{gpt}$ plays a critical role. As shown in 1, prediction A gets a lower CIDEr score because it mentions fewer n-grams within ground truth. However, it describes semantics more accurately and therefore gets a higher $\rm{CIDEr}^{gpt}$ score.
A case of the comparsion of CIDEr and $\rm{CIDEr}^{gpt}$.
Towards a user-friendly paper-writing copilot, the ‘outline’ can be given directly by users or generated by the copilot and then revised by the user. So recommending outlines accurately is also an important ability for inspiring users or improving writing efficiency. In this work, to develop such ability, we also design an Outline Recommendation task, where the input can be $\langle[Context], [Inst]\rangle$ or $\langle[Context], [Diagrams], [Inst]\rangle$ and the target is $[Outline]$. Captioning metrics are used to evaluate this task. Instructions of these three tasks can be found in the supplementary material. ## Statistic [sec:statistic] **Paper Category.**  contains 48,688 papers from more than 15 categories, covering almost all popular research directions in ‘Deep Learning’, especially Computer Vision(CV) and Natural language Processing(NLP). The detailed category distribution can be found in the supplementary material. **Dataset Splits.** 1 shows the split statistic of *Multimodal Diagram Captioning*, *Multimodal Diagram Analysis* and *Outline Recommendation*. For each task, there is no paper overlap across the training, validation and test splits. Both *Multimodal Diagram Captioning* and *Multimodal Diagram Analysis* cover more than 40k papers and provide sufficient training samples. As for *Outline Recommendation*, considering that ‘outlines’ are just intermediate control signals used to interact with users, we don’t expect perfect quality of generated outlines. Thus only partial papers are processed to support the training and test of this task.
| **Task** |   | **Train** | **Val** | **Test** | |:--------------:|:------:|:---------:|:-------:|:--------:| | Diagram | paper | 46,649 | 479 | 455 | | Captioning | sample | 343,546 | 1,131 | 1,133 | | Diagram | paper | 40,567 | 412 | 449 | | Analysis | sample | 267,476 | 1,087 | 1,195 | | Outline | paper | 2,548 | 543 | 577 | | Recommendation | sample | 78,041 | 3,425 | 3,442 | Statistics of  training, validation and test sets.
**Diagram.** As shown in 2, the distribution of diagram counts varies across different tasks. For *Multimodal Diagram Analysis*, there are more than 25% samples with multiple diagrams as inputs, much more than *Multimodal Diagram Captioning*. This indicates that correlating multiple diagrams is a major challenge for *Multimodal Diagram Analysis*. Besides, 3 shows the distribution of diagram types in *Multimodal Diagram Analysis* task. Our dataset is not limited to a single diagram type but a fusion of figures and tables in the form of vision or latex code. Especially, for better evaluating analysis ability on different diagram types, we slightly balance the diagram type distribution in the test.
The distribution (%) of diagram count across 3 tasks.
The distribution of diagram types on the training and test set of Multimodal Diagram Analysis.
**Token Length.** 2 presents the token length statistic of different textual components in our tasks. The average caption length is much smaller than the paragraph analysis, indicating the *Multimodal Diagram Analysis* task requires both more comprehensive diagram understanding and more detailed description. Besides, the length of the ‘outline’ is far from the ‘analysis’, showing that the input ‘outline’ will not leak too much information about the target analysis but just point out some key points to discuss.
|   | **Context** | **Outline** | **Table Latex** | **Caption** | **Analysis** | |:----:|:-----------:|:-----------:|:---------------:|:-----------:|:------------:| | Mean | 410 | 36 | 177 | 58 | 135 | | Max | 512 | 126 | 256 | 256 | 256 | Token length statistic of different textual components.
[^1]: [^2]: [^3]: [^4]: [^5]: [^6]: # mPLUG-PaperOwl [sec:model]
The overall architecture of .
Existing Multimodal Large Language Models(MLLMs) [mplugowl](http://arxiv.org/pdf/2405.00390v2), [minigpt4](http://arxiv.org/pdf/2402.17510v1), [llava](http://arxiv.org/pdf/2402.11690v1), [qwenvl](None) follow a three-module framework, consisting of a vision encoder, a vision-to-text connector, and a Large Language Model as the language decoder. Models with such a framework are easy to adapt to our multimodal tasks by constructing image-text interleaved sequences. In this work, we choose one of the state-of-the-art MLLMs: mPLUG-DocOwl[docowl](None) as the base model to perform instruction-tuning on our . ## Model Architecture The overall architecture of PaperOwl is shown in 1. Following UReader[ureader](None), to better recognize texts in the image, we utilize a parameter-free Cropping Module to cut a 448x448 image to 4 sub-images of 224x224 resolution and then feed each sub-image to the following Vision Encoder independently. The ViT-L/14[vit2021](http://arxiv.org/pdf/2105.15075v2) is utilized as the Vision Encoder, comprised of 24 transformer layers with 16 attention heads and the dimension of hidden states set to 1024. For each image $I$ in the $[Diagrams]$, it’s represented as a sequence of visual features $V=\{v_{1}, ...,v_{n}\}$ after the Vision Encoder. The Vision Abstractor is used to align visual features with the language decoder and aggregate or filter vision semantics. It consists of 6 transformer layers with 8 attention heads and the dimension of hidden states is set as 1024. With 64 learnable tokens $Q=\{q_1,..q_k\}$ as the query, the concatenated sequence $[V:Q]$ as the key and value, the visual features are finally condensed to $\hat{V}=\{\hat{v}_{1}, ...,\hat{v}_{k}\}$ after cross attention. The architecture of Language Decoder is the same as LLaMA-7B[llama](http://arxiv.org/pdf/2402.08075v1). To adapt to vision-and-language tasks and alleviate catastrophic forgetting, LoRA[hu2022lora](https://openreview.net/forum?id=nZeVKeeFYf9) is utilized in the LLM with the rank set as 8. ## Model Training To develop a versatile paper-writing copilot for scientific diagram understanding, we aim to perform instruction-tuning to enhance an existing MLLM to be a generalist capable of Multimodal Diagram Captioning, Multimodal Diagram Analysis, and Outline Recommendation. Therefore, the training data is a combination of three tasks. Besides, for *Multimodal Diagram Analysis*, to avoid the model heavily relying on ‘outline’ to guess paragraph analysis, samples removing outlines from inputs are also added to the training data to strengthen vision understanding ability. Finally, the total number of instruction-tuning samples is 702,247. Following most MLLMs[mplugowl](http://arxiv.org/pdf/2405.00390v2), [minigpt4](http://arxiv.org/pdf/2402.17510v1), [llava](http://arxiv.org/pdf/2402.11690v1), the Vision Encoder in the PaperOwl is frozen during instruction-tuning to avoid hurting the strong vision representation ability learned during large-scale pretraining. The Vision Abstactro is fine-tuned to better learn how to filter usefully visual diagram information for generating analysis. The raw parameters of LLaMA-7B are frozen, and only the LoRA in the Language Decoder is updated to learn the analysis logic of academic papers. Our model is trained for 10 epochs with the learning rate set as $1e-4$ and the batch size as 256, costing 64 A100 days. # Experiments [sec:exper]
c\|ccc\|ccca\|ccca\|ccccca \***Model** & & & &   & Img & Text & Doc &B4 & R & M & C & B4 & R & M & C &B4 & R & M & C & $F1^{gpt}$ & $\rm{C}^{gpt}$ mPLUG-Owl[mplugowl](http://arxiv.org/pdf/2405.00390v2) & 224 & & $\times$ & 0.36 & 8.60 & 5.30 & 0.74 & 0.62 & 9.12 & 8.55 & 0.32 & 2.48 & 15.12 & 14.67 & 0.53 & 0.21 & 0.15 mPLUG-Owl2[ye2023mplugowl2](https://arxiv.org/pdf/2311.04257) & 448 & & $\times$ & 1.62 & 10.33 & 5.30 & 5.63 & 1.30 & 11.99 & **10.48** & 2.71 & 6.92 & 19.65 & 14.96 & 11.85 & 0.25 & 3.89 LLaVA 1.5[llava1.5](http://arxiv.org/pdf/2310.19145v1) & 336 & & $\times$ & 0.97 & 10.71 & 6.78 & 2.74 & 1.32 & 11.79 & 10.46 & 0.79 & 6.11 & 18.83 & 12.43 & 13.70 & 0.20 & 4.64 Qwen-VL[qwenvl](None) & 448 & & & 1.84 & 7.64 & 6.61 & 2.31 & 1.32 & 7.29 & 8.52 & 0.53 & 6.72 & 10.26 & 10.74 & 3.68 & 0.27 & 1.39 UReader[ureader](None) & 448 & $\times$ & & 0.56 & 9.84 & 3.34 & 5.95 & 0.25 & 8.17 & 2.88 & 4.59 & 1.22 & 10.59 & 4.33 & 1.02 & 0.05 & 0.05 DocOwl[docowl](None) & 448 & & & 0.87 & 10.40 & 3.64 & 8.08 & 0.45 & 9.20 & 5.98 & 2.51 & 1.90 & 14.33 & 10.28 & 4.78 & 0.19 & 1.23 & **2.37** & **18.31** & **7.19** & **25.50** & **2.16** & **17.96** & 7.33 & **30.65** & **14.89** & **30.03** & **17.56** & **22.38** & **0.39** & **11.50**
## Comparison with SOTA MLLMs. We first compare the zero-shot performance of existing MLLMs on our three tasks. As shown in [tab:sota_mllm], mPLUG-Owl[mplugowl](http://arxiv.org/pdf/2405.00390v2) achieves the worst performance, showing the importance of high resolution for our tasks. After increasing image resolution, mPLUG-Owl2[ye2023mplugowl2](https://arxiv.org/pdf/2311.04257) and LLaVA 1.5[llava1.5](http://arxiv.org/pdf/2310.19145v1) outperform the other 3 models trained with multimodal document understanding samples on *Multimodal Diagram Analysis* task. Besides, UReader[ureader](None), a model fine-tuned only on document benchmarks, achieves the worst analysis performance. This validates that existing multimodal document understanding data is far from energizing the comprehensive diagram analysis ability of MLLMs and may cause overfitting on question answering or information extraction benchmarks. However, Owl2, LLaVA 1.5 and Qwen-VL all optimize the whole LLM during instruction-tuning while UReader and DocOwl only tune the LoRA. Considering both the performance and training costs, we finally chose DocOwl as our basic model. After fine-tuning with a combination of three tasks, PaperOwl achieves much better performance across three tasks. ## Ablation Study For comprehensively analyzing critical elements for developing a scientific diagram analysis copilot, we perform sufficient comparison experiments to validate the effectiveness of $[Context]$ and $[Outline]$, and present the influence of vision encoding strategies. For *Multimodal Diagram Captioning* and *Multimodal Diagram Captioning* tasks, we provide $[Context]$ as auxiliary inputs to implicitly represent users’ next writing intention and provide some background information of proper nouns. We first utilize Owl[mplugowl](http://arxiv.org/pdf/2405.00390v2) as the basic model to study whether using $[Context]$ during training and testing. All models are just trained on captioning and analysis tasks and remove $[Outline]$ from inputs. As shown in [tab:context_abla], for the model trained without $[Context]$, providing $[Context]$ during inference could improve the captioning performance (r2 vs r1), showing $[Context]$ is critical for Diagram Captioning. However, adding $[Context]$ only in testing hurts the analysis performance, indicating the model is hard to balance the comprehension of preceding texts and multiple diagrams for paragraph analysis generation. After adding $[Context]$ in training, the model achieves better performance on both two tasks (r3 vs r2), validating that for better scientific diagram comprehension, it’s necessary to incorporate $[Context]$ during both training and inference.
c\|cc\|cca\|ccca   & & &   & Train & Test & R & M & C & R & M & C & $\rm{C}^{gpt}$ r1 & $\times$ & $\times$ & 15.43 & 5.45 & 14.67 & 16.56 & 8.71 & 4.45 & 1.47 r2 & $\times$ & & 16.62 & **6.82** & 17.72 & 14.44 & 7.66 & 2.87 & 0.94 r3 & & & **17.08** & 6.76 & **21.36** & **19.25** & **10.97** & **7.02** & **1.81**
To better align the diagram analysis from a paper-writing copilot with users’ intention, we propose to introduce $[Outline]$ as explicit control signals. For validating the effectiveness of $[Outline]$, we further compare variants of Owl about whether utilizing $[Outline]$ during training and testing. As presented in [tab:outline_abla], for models trained with $[Outline]$ as inputs or not, adding $[Outline]$ during inference could both improve the performance (r2 vs r1, r5 vs r3), showing ‘Outlines’ is an effective control signal for guiding diagram analysis. Besides, even adding pseudo $[Outline]$ generated by the model itself as inputs, the analysis quality could also be improved (r4 vs r3). This indicates that ‘recommending $[Outline]$ first and then generating diagram analysis’ may be a better two-step framework, where the user could also control the copilot by slightly revising the recommended $[Outline]$. Finally, trained with $[Outline]$ makes a significant improvement (r5 vs r2), validating it’s essential to teach the model how to correlate multimodal $[Context]$, $[Outline]$ and $[Diagrams]$ for scientific diagram analysis.
c\|cc\|ccccaa   & & \***B4** & \***R** & \***M** & \***C** & &   & Train & Test &   &   &   &   & \***$F1^{gpt}$** & \***$\rm{C}^{gpt}$** r1 & $\times$ & $\times$ & 6.28 & 19.25 & 10.97 & 7.02 & 0.18 & 1.81 r2 & $\times$ & gpt & 7.23 & 19.86 & 11.24 & 8.99 & 0.22 & 3.10 r3 & gpt & $\times$ & 6.42 & 19.47 & 11.15 & 7.90 & 0.17 & 2.13 r4 & gpt & auto & 5.98 & 19.58 & 11.23 & 9.10 & 0.19 & 2.59 r5 & gpt & gpt & **15.27** & **30.36** & **17.49** & **21.85** & **0.41** & **11.23**
For vision-and-language tasks, the visual features play a big role in the final performance. In this section, we compare the influence of different vision-representing strategies, including image resolution, whether to fine-tune the Vision Abstractor, and whether to crop the image. As shown in [tab:vision_abla], during instruction-tuning, freezing the Vision Abstractor greatly hurt the diagram analysis performance (r1 vs r2), validating that fine-tuning the Vision Abstractor is important for adapting an existing MLLM for professional diagram understanding. Besides, at the condition of freezing the Vision Encoder, directly increasing the image resolution and expanding patch position embeddings by bicubic interpolation doesn’t bring significant improvement (r3 vs r2), showing that only finetuning the Vsion Abstractor is not enough to adapt to higher-resolution images. When equipped with a parameter-free Cropping Module as UReader[ureader](None) to cut the 448x448 image to 4 sub-images of 224x224 resolutions, the model achieves significantly better performance on the diagram captioning task (r4 vs r2), showing that when the Vision Encoder is frozen, cropping images is a better solution for leveraging higher-resolution images. But, compared with the diagram captioning task, the cropping module still brings a weak improvement to the analysis task. This is mainly because the cropping module results in too many visual tokens (max 1024 tokens from 16 sub-images) and therefore greatly increases the difficulty of multimodal understanding for the language decoder. This shows that how to better encode high-resolution images and balance multimodal inputs is a major challenge for the *Multimodal Diagram Analysis* task.
c\|ccc\|ca\|ccca   & & &   & Img & VA & Crop & M & C & M & C & $F1^{gpt}$ & $\rm{C}^{gpt}$ r1 & 224 & $\times$ & $\times$ & 5.94 & 23.73 & 16.70 & 18.73 & 0.29 & 8.78 r2 & 224 & & $\times$ & 6.89 & 22.18 & 17.49 & 21.85 & **0.41** & 11.23 r3 & 448 && $\times$ & 6.83 & 21.86 & 17.45 & **22.94** & 0.40 & 11.46 r4 & 448 & & & **7.19** & **25.50** & **17.56** & 22.38 & 0.39 & **11.50**
## Qualitative Results [fig:case] presents a qualitative result of *Multimodal Diagram Analysis*. With preceding texts as the input and a simple $[Outline]$ as the control signal, PaperOwl generates a paragraph analysis following the $[Outline]$ and describes more details about diagrams. However, PaperOwl still makes some mistakes about the concrete numbers in the figure, showing the challenge of accurately understanding details among multiple scientific diagrams. More qualitative results of *Multimodal Diagram Captioning* and the comparison of using $[Outline]$ or not can be found in the supplementary material. # Conclusion To enhance the scientific diagram analysis ability of Multimodal LLMs, we carefully build a multimodal dataset  based on high-quality Latex files of papers by aligning diagrams with captions and paragraph analysis. Simulating real scenarios of paper writing, we design Multimodal Diagam Captioning, Multimodal Diagram Analysis, and Outline Recommendation tasks. To better evaluate the analysis quality, we propose a GPT-based metric to measure both detailed n-gram matching and overall semantic similarity. We benchmark multiple state-of-the-art MLLMs and propose a strong baseline, PaperOwl, by performing instruction tuning on ensembled training data. Comprehensive experiments validate the effectiveness of the input of the preceding text and outline. Finally, our ablation study provides insights into model improvement, such as increasing image resolution to see more details and how to balance the multimodal information of context, outline and diagrams. # ## Text Cleaning [sup:text_clean] Towards paper-writing copilot, this work focuses on improving the model’s multimodal diagram analysis abilities and pays little attention to other writing abilities, such as equation generation or citation recommendation. Both formulas and paper references are virtually impossible to infer from diagrams or preceding texts. Therefore, we further clean paragraph texts by removing such unnecessary information. Concretely, we first replace all citation tags ‘$\backslash$`cite{.}`’ with a special token ‘``’ to remove citation reference. Besides, to avoid generating too long equations, paragraphs containing too long equations ($>40$ chars) in ‘`$.$`’ tags are dropped. ## Outline Construction Taking into account that the ‘outline’ given by users could be multiple content-related key points or a highly concise summary, such as ‘the overall architecture of our model’, we construct two types of outlines by designing different prompts and in-context demonstrations for GPT-3.5, as shown in [tab:simple_summary_prompt] and [tab:detailed_summary_prompt].
The category distribution of 48,688 academic papers.
## Statistic The detailed category distribution of papers in  is shown in 1. ## Task Instruction As shown in [tab:instruct_templates], for each task, we apply various instructions to enhance the model’s instruction-following ability. # GPT-based Metric For evaluating the overall semantic similarity of a predicted diagram analysis and ground-truth one, we design a GPT-based metric, namely $F1^{gpt}$. We first prompt GPT to extract key points of prediction and ground truth. Then, for each pair of predicted key point and ground-truth one, we further prompt GPT to judge whether it matches or not. Finally, based on GPT’s judgments, we calculate the precision, recall, and F1 score ($F1^{gpt}$). The prompts used in these two steps are shown in [tab:gpt_metric]. In particular, during the keypoint extraction process, we prompt GPT to simultaneously process both the prediction and the ground truth to better capture their similarities and differences. # Experiments ## Influence of Table Format For developing a copilot capable of analyzing different formats of diagrams during paper-writing,  evaluates table understanding in both image and Latex formats. As shown in [tab:table_abla], for writing a caption to summarize the table content, understanding Latex is much easier than understanding the image because all data is well-organized in text. However, the Latex format doesn’t bring significant improvement for *Multimodal Diagram Anaylysis* and even a decrease in the CIDEr score. This is because when provided latex code of a table, the model tends to describe more rare prop nouns or numbers in the table, which may not be necessary for the discussion and don’t appear in the ground-truth analysis. This shows that generating diagram analysis is more challenging at correlating $[Context]$, $[Outline]$, and $[Diagrams]$, rather than mainly understanding the diagram content.
c\|cca\|cccaca **Table** & & **Format** & R & M & C & B4 & R & M & C & $F1^{gpt}$ & $\rm{C}^{gpt}$ Image & 22.51 & 9.60 & 51.77 & 12.25 & 30.52 & 17.83 & 25.64 & 0.45 & 14.50 Latex & 26.69 & 10.54 & 80.03 & 12.03 & 30.38 & 18.11 & 21.56 & 0.47 & 10.51
## More Qualitative Results [fig:case_table_cap] and [fig:case_figure_cap] show more qualitative results of Multimodal Diagram Captioning, including multiple sub-figures and tables in the image or latex code format. Besides, as shown in [fig:case_figure_analysis], without the $[Outline]$, PaperOwl could generate analysis related to diagrams but different from the author’s intention. With a concise $[Outline]$, it generates a much better analysis with minor errors, showing the necessity of utilizing $[Outline]$ as the control signal. As mentioned in [sec:statistic], during outline construction, the average length of the $[Outline]$ is around 36.43% of the target diagram analysis to avoid leaking too much information about diagrams. Although we perform such data preprocess to enforce the model learning to describe more diagram details during training,  sometimes still makes little revisions based on the outline and doesn’t provide informative analysis about the diagram, as presented in [fig:case_analysis_bad]. Therefore, it is also a challenge of Multimodal Diagram Analysis to encourage the model to follow the outline while giving more details by understanding diagrams.
| | |:---| | Please provide the main point of the following paragraph which is from a scientific paper. The main point is the central issue in the paragraph and the format like some items in the outline, and it should be as concise and brief as possible!!!! | | | | Due to the paragraph being from a scientific paper, it can be like: the background of some tasks, or the challenge of previous methods, our methods involve A and B modules, etc for the paragraph from the Introduction section; or experiments results on some datasets for the paragraph from Experiments section, or the pipeline of feature extractor, or the detailed design of some network for the paragraph from Method section. | | | | Please provide a highly abstract writing purpose for this paragraph like an outline, rather than simply summarizing the content of the paragraph. | | | | And please generate the main point with less than 20 words! less than 20 words! less than 20 words!!! | | | | There are some examples of "Paragraph" and "Main Points" pairs. The examples are split by "##############################": | | | | \############################## | | Paragraph: | | \noindent \textbf{Low Reference Dependency} The Kendall and Spearman correlations between automatic metrics and human judgments with the different numbers of references are shown in Fig.\ref{fig:changing_reference_number}. Our EMScore without any references can achieve competitive results, compared with reference-based metrics which need at least 4 or 5 references, such as BLEU_1 and Improved_BERTScore. Besides, our EMScore_ref with only one reference can achieve comparable results with reference-based metrics, which need at least 8 or 9 references, such as CIDEr and BERTScore. The results show that our metric has lower reference dependency, which benefits from the introduction of video content in evaluation. | | | | Main Points: | | Our metric has a lower reference dependency. | | \############################## | | Paragraph: | | Fig.\ref{fig:fine_grained_matching} visualizes how fine-grained EMScore matches the most similar visual elements to the tokens (as the calculation of precision). For the first example, “bubbles” occurs in the 106th frame, “another boy” occurs in the 160th and 187th frames, and compared with other frames, “face paint” appears in a larger proportion in the 4th and 6th frames. For the second example, the visual concept “boy” appears as the main visual element in the 53rd frame, so the token ’boy’ matches this frame instead of 84th\$\sim\$298th frames where multiple visual elements appear. Compared with coarse-grained embedding matching, our fine-grained one can take into account the characteristics of the video, and provide more interpretability for EMScore. | | | | Main Points: | | The visualization results of fine-grained EMScore. | | \############################## | | | | Paragraph: $[Paragraph]$ | | Main Points: $[Main~Points]$ |
Please use one or several concise sentences to summarize the main points of the following paragraph which is from a scientific paper.
And please note that:
(1) Each sentence should strive to express one main point as succinctly as possible.
(2) Please summarize the most critical points, preferably no more than 3. And one main point is enough for some short paragraphs!!!
(3) If there are multiple main points, use “1. 2. 3." to list them and use “\n" to split them.
There are some wrong formats with prefix like this: “The article introduces xxx".
“The authors conduct experiments xxx".
“They introduce xx".
“xxx proposed by the author".
Please directly generate the key points of the paragraph, and don’t use the prefix like above.
There are some examples of "Paragraph" and "Main Points" pairs. The examples are split by "##############################":
##############################
Paragraph:
Video Captioning\cite{DBLP:journals/tcsv/DengLZWZH22} aims to generate a text describing the visual content of a given video. Driven by the neural encoder-decoder paradigm, research in video captioning has made significant progress \cite{DBLP:conf/iccv/VenugopalanRDMD15, DBLP:conf/cvpr/ZhangSY0WHZ20}. To make further advances in video captioning, it is essential to accurately evaluate generated captions. The most ideal metric is human evaluation while carrying human judgments is time-consuming and labor-intensive. Thus, various automatic metrics are applied for video caption evaluation.
Main Points:
Accurately evaluating the generated descriptions is necessary, and due to the time-consuming and labor-intensive nature of human judgments, automatic evaluation metrics are widely used.
##############################
Paragraph:
However, most of the widely applied video caption metrics like BLEU\cite{DBLP:conf/acl/PapineniRWZ02}, ROUGE\cite{lin-2004-rouge}, CIDEr\cite{7299087}, and BERTScore\cite{DBLP:conf/iclr/ZhangKWWA20} come from the other tasks, such as machine translation, text summarization and image captioning, which may neglect the special characteristic of video captioning and then limit the development of video captioning. Furthermore, these automatic metrics require human-labeled references — and thus they are called reference-based metrics — and such requirements cause three intrinsic drawbacks: (1) They can not be used when provided videos have no human-labeled references, which is not uncommon in this age that millions of reference-free videos are produced online every day. (2) They may over-penalize the correct captions since references hardly describe all details of videos due to the one-to-many nature\cite{DBLP:conf/acl/YiDH20} of captioning task, especially when the number of references is limited. Fig.\ref{fig:introductionexample} (a) shows one such example where a candidate caption correctly describes the “a rock” while reference-based metrics punish this word since references do not contain it. (3) As pointed by \cite{rohrbach-etal-2018-object}, these reference-based metrics may under-penalize the captions with “hallucinating” descriptions since these metrics only measure similarity to references, and the visual relevance cannot be fully captured. For example, as shown in Fig.\ref{fig:introductionexample} (b), due to the word “games” appearing in the references, some reference-metrics return higher scores for caption B than caption A, even though “different games” is a “hallucinating” phrase which is not related to the video.
Main Points:
1. Commonly used video caption metrics come from other tasks and may not fully capture the unique characteristics of video captioning.

2. The requirement of reference causes three intrinsic drawbacks: (1) Cannot be applied in real time. (2) Over-penalize the correct captions. (3) Under-penalize the captions with “hallucinating” descriptions.

##############################

Paragraph: [Paragraph]
Main Points: [Main Points]
| **Multimodal Diagram Captioning** | |:---| | Describe $[object]$ concisely. | | Write a caption of $[object]$. | | Provide a brief description of $[object]$. | | Write a short caption for $[object]$. | | come up with a concise caption that captures the essence of $[object]$. | | Encapsulate the key information presented in $[object]$ in a brief statement. | | I need a succinct caption for $[object]$. | | Please provide a pithy summary of $[object]$ that effectively communicates its message. | | Can you provide a snappy caption that perfectly encapsulates the message conveyed by $[object]$? | | Please write a brief but compelling caption that grabs the reader’s attention and draws them into $[object]$. | | Give a short caption that accurately conveys the main idea of $[object]$. | | **Multimodal Diagram Anaysis** | | Based on the previous content and the outline, write a detailed and fluent paragraph analysis. | | With reference to the preceding content and the given summary, compose a comprehensive and articulate paragraph analysis. | | Considering the information provided earlier and following the provided outline, produce a detailed and fluent analysis in paragraph form. | | Drawing from the preceding content and adhering to the outlined structure, write a thorough and coherent paragraph analysis. | | Based on the previous content and guided by the summary, construct a detailed and fluid analysis in paragraph format. | | Taking into account the preceding information and following the provided outline, generate a comprehensive and well-developed paragraph analysis. | | Considering the content discussed earlier and following the provided outline, present a detailed and fluent analysis in paragraph form. | | With reference to the previous content and the summary, provide a comprehensive and articulate paragraph analysis. | | Based on the preceding discussion and in accordance with the outlined structure, compose a detailed and coherent paragraph analysis. | | Considering the information presented earlier and adhering to the provided summary, formulate a thorough and seamless paragraph analysis. | | **Outline Recommendation** | | *more than 1 diagrams* | | Based on the previous content and $[object]$, list some key points that should be covered in the next paragraph. | | Considering the preceding text with $[object]$, the next paragraph needs to address these essential aspects. | | Drawing from the preceding text and image information, what crucial points should be focused on in the ensuing paragraph? | | Given the multimodal information provided earlier, write some key factors for the next paragraph. | | With reference to the previous discussion and $[object]$, the next paragraph should discuss the following important elements. | | In light of the preceding content with $[object]$, which significant points should be analyzed in the subsequent paragraph? | | Based on the previous text and $[object]$, the next paragraph should delve into these core aspects. | | Considering the text and vision information presented before, give some main factors that should be addressed in the ensuing paragraph. | | Taking into account the preceding discussion and $[object]$, what primary points should be emphasized in the next paragraph? | | Given the previous context with $[object]$, generate some key elements that should be discussed in the next paragraph should discuss. | | *no diagrams* | | Based on the previous content, list some key points that should be covered in the next paragraph. | | Considering the preceding text, the next paragraph needs to address these essential aspects. | | Drawing from the preceding information, what crucial points should be focused on in the ensuing paragraph? | | Given the information provided earlier, write some key factors for the next paragraph. | | With reference to the previous discussion, the next paragraph should discuss the following important elements. | | In light of the preceding content, which significant points should be analyzed in the subsequent paragraph? | | Based on the previous text, the next paragraph should delve into these core aspects. | | Considering the information presented before, give some main factors that should be addressed in the ensuing paragraph. | | Taking into account the preceding discussion, what primary points should be emphasized in the next paragraph? | | Given the previous context, generate some key elements that should be discussed in the next paragraph should discuss. |
| **Prompt GPT for Extracting Key Points** | |:---| | Please summarize the main points of the prediction and ground truth. And strictly with the format: | | 1\. xxx. | | 2\. xxx. | | ... | | Please ensure that the generated main points comprehensively condense the information of the original text (prediction or ground truth). The number of generated main points can be as many as possible, but no more than 10. | | | | If there are parts of the prediction and ground truth that are the same, reflect that in main points, such as some main points of them are the same, and other main points summarize the unique content of themselves. | | | | Please note that if there are any overlapping contents between the prediction and ground truth, the main points for these contents should remain consistent. However, for different content of them, please provide separate main points for each. | | | | The format is as follows: | | $\#\#\#\#\#\#\#$ | | Predicted text: xxx. | | | | Ground Truth text: xxx. | | | | The main points of the predicted text: | | 1\. xx | | 2\. xx | | ... | | | | The main points of the ground truth text: | | 1\. xx | | 2\. xx | | ... | | $\#\#\#\#\#\#\#$ | | | | Now, please generate the main points of the given prediction and ground truth, please strictly use the prompt ’The main points of the xxx’ in the response. | | | | Predicted text: $[Prediction]$ | | Ground Truth text: $[Ground~Truth]$ | | **Prompt GPT for Judging Semantic Matching** | | Given a predicted text and a reference text, please judge whether the semantics of the predicted text can match the reference text. | | And use Yes or No to represent match or mismatch. | | The format is as follows: | | Predicted text: xxx. | | Reference text: xxx. | | Yes/No | | ———- | | Predicted text: $[Predicted~Point]$ | | Reference text: $[GT~Point]$ |
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding 2023-08-19 Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang
In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
Show Paper Content
# Introduction Nowdays, considerable advancements have been observed in the domain of Large Language Models (LLMs), such as ChatGPT, [^1] BLOOM [scao2022bloom](http://arxiv.org/pdf/2106.06683v2), and LLaMA [touvron2023llama](http://arxiv.org/pdf/2402.08075v1), [touvron2023llama2](http://arxiv.org/pdf/2403.00858v4). These developments constitute significant strides towards the achievement of artificial general intelligence (AGI) and exhibit superior zero-shot proficiency across various linguistic applications. By employing these LLMs as language decoders, their Multimodal counterparts (LMMs), which include models like BLIP [li2023blip](http://arxiv.org/pdf/2301.12597v3), MiniGPT-4 [zhu2023minigpt](http://arxiv.org/pdf/2402.17510v1), LLaVA [liu2023visual](http://arxiv.org/pdf/2402.11690v1), and mPLUG-Owl [ye2023mplug](http://arxiv.org/pdf/2405.00390v2), have showcased noteworthy efficacy in understanding visual and linguistic data. While these large multimodal models exhibit astonishing zero-shot multimodal understanding capabilities, their comprehension of text-rich images remains limited [liu2023hidden](http://arxiv.org/pdf/2305.07895v5). To address this gap, LLaVAR [zhang2023LLaVAR](zhang2023LLaVAR) proposes incorporating a text recognition pre-training task to enhance the understanding of text-rich images. Besides, mPLUG-DocOwl [ye2023mplug](http://arxiv.org/pdf/2405.00390v2) constructs a large-scale dataset about the document image understanding. Although their text-rich scene understanding capabilities have shown notable promise, the vast potential of these pretrained large visual and language models remains largely unexplored and underutilized, analyzed next. Firstly, a salient absence of text detection capabilities is observed in the current large multimodal models. Since these large visual and linguistic models are pre-trained on extremely large-scale datasets, they possess powerful representational capabilities and a wealth of world knowledge, endowing them with the ability to localize objects/text in images. Their potential can be further harnessed and explored. Secondly, the training strategies of advanced methods suffer from data distribution inconsistencies between the pre-training and fine-tuning phases [brown2020language](http://arxiv.org/pdf/2112.07522v2), leading to suboptimal performance. Typically, LLaVAR [zhang2023LLaVAR](zhang2023LLaVAR) solely conducts text recognition tasks during the pre-training phase and proceeds with document understanding training in the fine-tuning phase. Thirdly, text detection and recognition inherently fall under the umbrella of high-level scene understanding tasks, with the location and content of the text being associated with scene semantics. Existing LMMs for text-rich image understanding have not effectively capitalized on these beneficial connections among OCR tasks [li2017towards](http://arxiv.org/pdf/1707.03985v1) to enhance the performance on the individual tasks. Formally, we introduce UniDoc, a universal large multimodal model for simultaneous text detection, recognition, spotting, and understanding. UniDoc aims to establish comprehensive OCR and multimodal understanding capabilities tailored for text-rich images. We integrate all these tasks into a cohesive framework driven by natural language instructions for multimodal understanding, as shown in Fig. [fig1]. Based on such a unified multimodal instruct tuning, not only have we endowed our UniDoc with various OCR capabilities, but the beneficial interactions among these tasks have also enhanced the performance across individual task. To implement our UniDoc, we collected and annotated a large-scale instruction following dataset for this tasks. Extensive quantitative and qualitative experimental results demonstrate the superior performance of UniDoc and its strong generalization ability. To our best knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding. In summary, we make three-fold contributions as follows: - We introduce UniDoc, the first large multimodal model capable of simultaneous text detection, recognition, spotting, and multimodal understanding of text-rich images. - We contribute a large-scale multimodal instruction tuning dataset, tailored for tasks of text detection, recognition, and spotting within text-rich images. - We achieve state-of-the-art performance on multiple publicly available benchmark datasets. Moreover, we conduct extensive quantitative and qualitative experiments to validate the effectiveness of UniDoc. # Related Work In this section, we broadly review the recent research on instruction tuning and multimodal instruction tuning.
## Instruction Tuning Instruction tuning is an effective technique to align large language models (LLMs) with human intents. It aims to teach language models to follow natural language (including prompt, positive or negative examples, and constraints etc.), to perform better multi-task learning on training tasks and generalization on unseen tasks. Recently, models like GPT-3 [brown2020language](http://arxiv.org/pdf/2112.07522v2) and others have significantly leveraged instructional fine-tuning. Typically, Stanford’s Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca) employs self-instruction [wang2022self](http://arxiv.org/pdf/2311.00233v2) to provide a cost-effective approach to obtain instruction data for fine-tuning LLaMA. Vicuna [chiang2023vicuna](None) that is a instructional fine-tuned LLaMA based on dialogues between users and ChatGPT, achieves performance comparable to ChatGPT [zheng2023judging](https://arxiv.org/pdf/2306.05685). ## Multimodal Instruction Tuning Recent advancements in the confluence of natural language processing and computer vision have seen the rise of Large Multimodal Models (LMMs), which integrate large language models and visual encoders to address complex tasks involving both text and vision. Prominent works in this domain include MiniGPT-4 [zhu2023minigpt](http://arxiv.org/pdf/2402.17510v1), which fuses components from BLIP-2 [li2023blip](http://arxiv.org/pdf/2301.12597v3) and Vicuna [chiang2023vicuna](None) for modality mapping and adopts a two-stage fine-tuning strategy. The LLaVA model, on the other hand, employs a supplementary linear layer to map visual features to the text space and undergoes additional fine-tuning under multimodal instructions. In the same vein, mPLUG-Owl from Alibaba’s DAMO Academy incorporates Flamingo’s Perceiver Resampler structure to facilitate visual and language modalities alignment. Another significant contribution is from InstructBLIP, which introduces a novel multimodal instruction dataset and uses Q-Former and Vicuna as an image encoder and language model respectively. Finally, X-LLM has introduced a Chinese multimodal instruction dataset and employs several adapters to map different modalities to the text space. While these multimodal large models exhibit promising visual-linguistic understanding capabilities, their potential are yet to be fully harnessed in specific domains. To bridge this divide, LLaVAR [zhang2023LLaVAR](zhang2023LLaVAR) puts forward the inclusion of a text recognition pre-training task, thus bolstering the comprehension of text-heavy imagery. In addition, mPLUG-DocOwl [ye2023mplug](http://arxiv.org/pdf/2405.00390v2) has compiled an expansive dataset designed specifically for the fine-tuning of document comprehension tasks. Shikra [chen2023shikra](http://arxiv.org/pdf/2306.15195v2) integrates LMMs with visual grounding ability by recasting detection task as a prompt-guided seq2seq task. Although these approaches somewhat augment the multimodal comprehension ability of models in text-rich scenarios, they fall short in offering a comprehensive ability for text detection, recognition and spotting. Moreover, they do not effectively harness the potential reciprocal enhancements that could be achieved by learning these capabilities in tandem. # Methodology ## Model Architecture Fig. [frame] presents an overview of our UniDoc. Our design follows the paradigm established by MiniGPT-4 [zhu2023minigpt](http://arxiv.org/pdf/2402.17510v1) and LLaVA [liu2023visual](http://arxiv.org/pdf/2402.11690v1). Specifically, given an input *RGB* image $\bm{I} \in \mathbb{R}^{H\times W\times3}$ and a natural language instruction $\bm{Q}$, UniDoc first abstracts the visual features from $\bm{I}$ utilizing CLIP-ViT-L/14 [radford2021learning](http://arxiv.org/pdf/2404.19696v1) as the visual encoder. Both pre- and post- Transformer layer grid features are incorporated in our method. The extracted feature map is then flattened into a sequence of visual embedding sequence and projected into the embedding dimension of the LLM with a linear layer. The output sequence $\bm{E}_v$ and then concatenated with embedding sequence $\bm{E}_l$ tokenized from the language instruction $\bm{Q}$. Thereafter, the concatenated embedding sequence are fed into Vicuna [chiang2023vicuna](None), a large language model originating from the LLaMA [touvron2023llama](http://arxiv.org/pdf/2402.08075v1) and specifically tuned with the instruction following data. Vicuna [chiang2023vicuna](None) then generates the response based on the received visual and text cues. Note that the visual embedding here can be considered as a soft prompt for LLM. ## Unified Multimodal Instruct Tuning Our training process is divided into two stages. Both stages employ our unified multimodal instruct tuning. The first pre-training phase aims to align the output features from the pre-trained visual encoder with the feature space of the large language model. During the second fine-tuning stage, we further optimize the weights of the large language model. Concretely, during the pre-training phase, we freeze both the pre-trained large visual and language models, training only the linear projector to align the visual and language features. Our instruction following data involves four tasks: text detection, recognition, spotting, and image captioning. We argue that detection, recognition, and spotting inherently involve high-level semantic understanding, as the position and content of text within an image often have a strong correlation with their surrounding context. The image captioning task enhances the model’s understanding of natural scene images. All of these tasks were performed in a natural language instruction following manner.
| **Satge** | **Data** | **Image** | **Instruction** | **\# Conv** | **Task** | |:---|:---|:--:|:--:|:--:|:--:| | Pre-train | LLaVA | CC3M | CC3M | 595K | $\mathcal{C}$ | | | UniDoc | LAION | OCR | 600K | $\mathcal{D},\mathcal{R},\mathcal{S},\mathcal{C}$ | | Fine-tune | LLaVA | COCO | GPT-4 | 158K | $\mathcal{U}$ | | | LLaVAR | LAION | GPT-4 | 16K | $\mathcal{D},\mathcal{R},\mathcal{S},\mathcal{U}$ | | | UniDoc | LAION | GPT-4 + OCR | 186K | $\mathcal{D},\mathcal{R},\mathcal{S},\mathcal{U}$ | Summary of the dataset statistics. The symbols $\mathcal{C},\mathcal{D},\mathcal{R},\mathcal{S},\mathcal{U}$ correspond to the different instruction following tasks, namely, captioning, detection, recognition, spotting, and multimodal understanding.
In the fine-tuning phase, we unfreeze both the large language model and the projector. Besides the training tasks involved in the pre-training stage, we further incorporate an additional multimodal understanding task for text-rich images which requires a more advanced level of semantic comprehension. The learning of these tasks mutually enhance each other. Through this unified multi-modal unified instruction fine-tuning, UniDoc achieves a comprehensive recognition and understanding capability for text-rich scenarios. # Dataset Construction To train the UniDoc, we construct a large-scale multimodal instruction following dataset. We detail it in the following. **Pre-training.** The pre-training data consists of two parts: one portion includes 595K natural scene images along with their captions, sourced from the CC3M dataset and filtered by LLaVA [liu2023visual](http://arxiv.org/pdf/2402.11690v1); the other portion comprises 600K image-text pairs from PowerPoint presentations that we created. The data were collected from the “Common Crawl" dataset, a vast web corpus containing publicly available web page. [^2] We opt for PowerPoint files based on two primary considerations. On one hand, PowerPoint presentations are characterized by a rich assortment of elements and their complex combinations, such as various fonts, images, tables, as shown in Fig. 1. These elements are interrelated, making them highly conducive to training multimodal understanding tasks in text-rich scenarios. On the other hand, the text within the slides is relatively large, making it legible for existing pre-trained visual models [radford2021learning](http://arxiv.org/pdf/2404.19696v1). In other words, if the text in an image is too small, it becomes unrecognizable when input into the model. To ensure high-quality visuals suitable for our purposes, we conducted rigorous quality assurance checks, eliminating the noisy data to avoid any negative impact on training. Specifically, we first applied text size optimization, excluding images with small-sized text. Then, an in-house OCR tool accurately extracts the text and box annotations from each image and we constructed OCR instruction based on them. The instructions here are categorized into three types: text detection, recognition, and understanding. Furthermore, we employed GPT-4 to generate diverse expressions for each type of instruction. The data for detection, recognition, and spotting each account for one-third of the total.
Example instances from the proposed dataset, featuring diverse fonts in terms of size, style, and color, and a rich array of visual elements.
**Fine-tuning.** During fine-tuning, we extend the 16K instruction following data collected from LAION-5B [schuhmann2022laion](http://arxiv.org/pdf/2312.15897v1) and constructed by LLaVAR [zhang2023LLaVAR](zhang2023LLaVAR). Initially, we curated this dataset, employing the same cleansing methodology as used for the pre-training set. Subsequently, for each image, we constructed OCR instruction following data, adhering to the approach established during the pre-training phase. The data for detection, recognition, and spotting each account for one-third of the total. Furthermore, we further incorporated 150K OCR instruction data as the pre-training stage, in which detection, recognition, and spotting each constitute one-third of the total.
# Experiments ## Training Details To implement UniDoc, we employed a one-cycle learning rate policy [smith2019super](http://arxiv.org/pdf/1708.07120v3). During the pre-training phase, the maximum learning rate was set to 1e-3, and for the fine-tuning phase, it was reduced to 1e-5. Moreover, the batch size was 128 for the pre-training and 32 for the fine-tuning phase, respectively. The AdamW [loshchilov2017decoupled](http://arxiv.org/pdf/2311.11446v2) optimizer was chosen for weight updates. Both the pre-training and fine-tuning phases were executed using eight A100 GPUs. Each of these phases consisted of a single epoch. In this study, for both the training and inference phases, the default input image resolution is set at 224$\times$``{=html}224. It is noteworthy that larger input resolutions are almost certain to yield better results due to the presence of more discernible text [zhang2023LLaVAR](zhang2023LLaVAR), [ye2023mplug-doc](http://arxiv.org/pdf/2403.14252v1). Unless otherwise specified, the performance reported in this study is based on image inputs with an input resolution of 224$\times$``{=html}224.
| Method | Detection | | | |:----------:|:---------:|:---------:|:-----:| | 2-4 | CTW1500 | TotalText | TD500 | | **UniDoc** | 38.27 | 12.60 | 17.36 | Quantitative performance of UniDoc (F-score) on several scene text detection benchmark datasets. Here the input instruction is “Output all the text locations in this photo".
## Evaluation Metrics We evaluate our UniDoc in a series of text-rich scenes from three perspectives (*i.e.,* detection, recognition, and multimodal understanding). For the task of text detection, we employed the F-score metric. For text recognition and visual question answering tasks, we adopted the accuracy metric, where a response generated by the model is considered correct if it contains the string present in the ground truth [liu2023hidden](http://arxiv.org/pdf/2305.07895v5). In this paper, F-score and accuracy are respectively denoted as $\mathcal{F}$ and $\mathcal{A}$.
| Training Task | | Detection | Recognition | Understanding | |:--:|:--:|:--:|:--:|:--:| | 1-2 (rl)3-3 (rl)4-4 (rl)5-5 Pre-train | Fine-tune | $\mathcal{F}$ | $\mathcal{A}$ | $\mathcal{A}$ | | | | 0.00 | 20.01 | 35.78 | | | | 0.00 | 84.13 | **41.28** | | | | 27.89 | 88.93 | 40.46 | | | | **38.27** | **90.60** | 40.72 | Ablation studies about the training tasking settings. The “" indicates that the corresponding training phase including the detection, recognition, and spotting task.
| Experiment | Setting | Detection | Recognition | Understanding | |:-------------------:|:-----------:|:-------------:|:-------------:|:-------------:| | 3-3 (rl)4-4 (rl)5-5 | | $\mathcal{F}$ | $\mathcal{A}$ | $\mathcal{A}$ | | index tokens | w/ | 31.28 | \- | \- | | | w/o | **38.27** | \- | \- | | instruction type | detection | 38.27 | \- | \- | | | spotting | **43.33** | \- | \- | | instruction type | recognition | \- | 90.60 | \- | | | spotting | \- | **91.30** | \- | Ablation studies about variations in detection task configurations, and the impacts of the instruction type on text detection and recognition during inference.
## Comparison with Other LMMs We perform an exhaustive evaluation of publicly accessible large multimodal models (LMMs) and our UniDoc, assessing their efficacy across various benchmarks. In the following, we compare and analyze the experimental results. **Text Detection.** Compared with the existing large multimodal models (LLMs), a unique capability of our UniDoc is its text detection ability. This stems from our approach of incorporating text detection as part of the unified multimodal instruction tuning. In Table 2, we present the quantitative performance of our method on multiple scene text detection datasets, including CTW1500 [liu2019curved](http://arxiv.org/pdf/1712.02170v1), TotalText [ch2017total](http://arxiv.org/pdf/1710.10400v1), and TD500 [yao2012detecting](http://arxiv.org/pdf/1703.01086v3). Moreover, as illustrated in Fig. [fig_spotting], we provide examples showcasing UniDoc’s text detection performance on the CTW1500 dataset [liu2019curved](http://arxiv.org/pdf/1712.02170v1). It can be seen that the text is consistently detected in these images. Notably, the words in these images are located irregularly instead of in a straight horizontal line, and our training phase also does not involve the text detection tasks for such scene images. These findings validate our learning strategy and underscore the substantial generalization ability of LLMs.

Quantitative comparison on multiple recognition datasets based on the recognition instructions and spotting instructions. The x-axis represents the datasets. Spotting instruction consistently performs better.
A case study illustrating the impact of detection (left) and spotting (right) instructions on the response. Spotting effectively mitigates recognition omissions.
**Text Recognition.** Furthermore, we extend our evaluation to assess the text recognition capacity of UniDoc. To commence, as shown in Table [tab:text_reco], UniDoc achieves a series of state-of-the-art scores across numerous benchmark datasets for text recognition. It is noteworthy that these datasets encompass a diverse array of text-rich images, including document text, artistic text, handwritten text, scene text, and more. Moreover, as depicted in Fig. [fig_spotting] and Fig. [fig_recognize], we showcase recognition results of UniDoc on CTW1500 [liu2019curved](http://arxiv.org/pdf/1712.02170v1), WordArt [xie2022toward](http://arxiv.org/pdf/1812.05824v3) and TotalText [ch2017total](http://arxiv.org/pdf/1710.10400v1) dataset. Although these images involve varying fonts, styles, image blurriness, and non-horizontal text distributions, UniDoc consistently manifests a remarkable ability to accurately recognize the embedded text within them.
**Multimodal Understanding.** We conduct both quantitative and qualitative assessments of UniDoc’s multimodal understanding performance. Specifically, as presented in Table [tab:text_reco_vqa_kie_res], UniDoc achieves state-of-the-art and comparable performance on several benchmark datasets. Besides, as illustrated in the Fig. [fig_understanding], we provide examples of multimodal question-answering focused on text-based scenarios. It can be seen that UniDoc effectively integrates the visual cues from the input image and the textual cues from both the image and instructions. Leveraging the inherent world knowledge of the large language model (LLM), it then engages in coherent reasoning to generate corresponding responses. ## Ablation Studies In this section, we conduct ablation studies to validate the efficacy of core settings and components in our UniDoc. In all experiments, for the tasks of text detection, recognition, and multimodal understanding, we report the performance on the CTW1500 [liu2019curved](http://arxiv.org/pdf/1712.02170v1), IIIT5K [mishra2012scene](http://arxiv.org/pdf/1907.09705v1), and TextVQA [singh2019towards](http://arxiv.org/pdf/1811.11903v1) benchamrk datasets, respectively. **Impact of Unified Multimodal Instruct Tuning.** During the pre-training phase, the instruction-following data we trained encompasses text detection, recognition, and spotting tasks. In the fine-tuning phase, the instruction-following data was further augmented with tasks concerning multimodal understanding. we investigate the impact of learning these tasks (ı.e., text detection, recognition, and spotting) on the final performance. As illustrated in Table 3, incorporating the learning of them in individual phases led to enhancements not only in detection and recognition performance, but also in multimodal understanding. Furthermore, incorporating these tasks in both stages yielded the best performance. These results demonstrate that there exists a beneficial interplay and synergy among these tasks. We argue that such a multi-task learning strategy not only endows Large Multimodal Models (LMMs) with comprehensive capabilities, but also bolsters their inherent abilities. **Impact of the Formulation of the Detection Task.** In our default setting, we directly predict the integer coordinates of the text region bounding boxes. Given that our input images are all of the size 224$\times$``{=html}224, these coordinates are normalized to the range \[0, 223\]. An alternative approach is to set up an additional 224 tokens to represent both the horizontal and vertical coordinates in the range \[0, 223\] [chen2021pix2seq](http://arxiv.org/pdf/2305.18279v1). As shown in Table 4, in terms of text detection capabilities, the introduction of additional positional index tokens did not yield a performance gain. **Impact of Instruction Template Type.** In our UniDoc, the detection results can originate from either the detection or the spotting instructions. Similarly, our recognition outcomes can be sourced from either the recognition or the spotting instructions. Consequently, we evaluate the impact of using different types of instructions on the performance of detection and recognition. As shown in Table 4, the text detection and recognition performance based on the spotting instruction works better. This is likely because in autoregressive generation, spotting instruction template makes model provide explicit location information in its responses, enhancing the recognition performance. The same applies to detection tasks. The two tasks are mutually complementary. In Fig. 2, we perform quantitative comparisons on a broader range of recognition benchmarks. Besides, as shown in Fig. 3, we further provide a case to illustrate this finding. # Conclusion In this work, we introduce UniDoc, a universal large multimodal model for simultaneous text detection, recognition, spotting, and understanding. Through our proposed unified multimodal instruct tuning, UniDoc effectively leverages the beneficial interactions among text-based tasks, not only addressing the shortcomings of existing large multimodal models, but also enhancing their original capabilities. To implement UniDoc, we contribute a large-scale multimodal instruction following dataset. Experiments show that our UniDoc sets state-of-the-art scores across multiple benchmarks. Besides, we perform extensive studies to validate its effectiveness. Currently, UniDoc is unable to extract fine-grained visual features for detection and recognition, and the resolution of input images remains a limitation. In the future, we will consider addressing these issues. [^1]: https://openai.com/blog/chatgpt [^2]: https://commoncrawl.org/
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding 2023-07-04 Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang
Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.
Show Paper Content
# Introduction Large language models (LLMs) like ChatGPT [chatgpt](https://openai.com/blog/chatgpt), BLOOM [bloom](None), and LLaMA [llama](http://arxiv.org/pdf/2402.08075v1) have undergone rapid development to enable the realization of general artificial intelligence, boasting impressive zero-shot capabilities across diverse linguistic applications. With the LLM as the language decoder, Multimodal large language models (MLLMs) such as MiniGPT-4 [minigpt4](http://arxiv.org/pdf/2402.17510v1), LLaVA [llava](http://arxiv.org/pdf/2402.11690v1), and mPLUG-Owl [mplugowl](http://arxiv.org/pdf/2405.00390v2) have demonstrated remarkable zero-shot performance in various open-ended vision-and-language tasks. These models are trained to align text and images during the pre-training phase, and then to promote diverse abilities during the instruction tuning phase. Interestingly, these MLLMs exhibit superficial OCR-free text recognition abilities without explicit training on visual text understanding datasets [mplugowl](http://arxiv.org/pdf/2405.00390v2), [llmocr](http://arxiv.org/pdf/2305.07895v5). Nevertheless, due to lacking specific training, these models still face the challenge of comprehending intricate relationships between visual text and objects in diverse types of images, such as charts, documents and webpages. By performing unified instruction tuning for Document Understanding upon the mPLUG-Owl [mplugowl](http://arxiv.org/pdf/2405.00390v2), we further propose a modularized MLLM [mplug](None), [mplug2](None), namely mPLUG-DocOwl. Our approach utilizes a modularized framework similar to mPLUG-Owl [mplugowl](http://arxiv.org/pdf/2405.00390v2), which incorporates a visual abstractor module to link a pre-trained LLM with a visual knowledge module, achieving the alignment of text and images. To enhance diverse document understanding capabilities, we reorganize various downstream document understanding tasks in the same form of instructions. To maintain general uni/multi-modal abilities, we also include language-only and general vision-and-language instruction datasets used by mPLUG-Owl to train the mPLUG-DocOwl. During training, both the visual knowledge module and LLM decoder are frozen, only the visual abstractor and the Low-Rank Adaption (LoRA) [lora](https://openreview.net/forum?id=nZeVKeeFYf9) in LLM are fine-tuned. mPLUG-DocOwl achieves ocr-free state-of-the-art performance on multiple commonly used document understanding datasets. Furthermore, our experiments on a carefully-built document instruction understanding evaluation set LLMDoc shows that mPLUG-DocOwl achieves significantly better visual text understanding performance on various domains than existing MLMMs. Our main contributions can be highlighted as follows: - We propose a modularized MLLM, **mPLUG-DocOwl**, which is the first one to balance language-only, general vision-and-language, and document understanding based on unified instruction tuning. - We carefully construct an instruction understanding test set with human evaluation, dubbed **LLMDoc**, to assess diverse document understanding capabilities. - Empirical results demonstrate that our mPLUG-DocOwl surpasses existing methods on ocr-free document understanding, including multiple standard benchmarks and LLMDoc. # Related Work ## Visual Text Understanding There are two types of models for understanding images that contain rich textual information. The first kind of approaches [layoutlm](https://doi.org/10.1145/3394486.3403172), [layoutlmv3](None), [qctextcap](http://arxiv.org/pdf/2302.02124v2), [udop](http://arxiv.org/pdf/2212.02623v3), [tap](None) utilize off-the-shelf OCR models or APIs to recognize text from images, and then design pretraining tasks to facilitate cross-modality alignment between visual and textual inputs. On the other hand, end-to-end approaches [dessurt](http://arxiv.org/pdf/2203.16618v3), [donut](http://arxiv.org/pdf/2305.09520v1), [pix2struct](None) utilize a high-resolution image encoder to learn text recognition during the pretraining stage. Both two types of models rely on specific finetuning on different downstream datasets and can’t achieve open-domain instruction understanding performance like Multimodal Large Language Models. ## Multimodal Large Language Model Large Language Models (LLMs) have demonstrated impressive zero-shot abilities across various open-ended tasks. Recent research has also explored the application of LLMs for multi-modal generation, utilizing two different paradigms: systematic collaboration and end-to-end trained models. Systematic collaboration approaches, such as Visual ChatGPT [visualchatgpt](None) and MM-REACT [mmreact](None), leverage various vision experts or tools to express visual information with text descriptions. Subsequently, LLMs, such as ChatGPT [chatgpt](https://openai.com/blog/chatgpt), can act as agents and select appropriate experts and tools for visual understanding. Finally, LLMs would summarize the output of these experts to answer user queries. On the other hand, some approaches, such as MiniGPT-4 [minigpt4](http://arxiv.org/pdf/2402.17510v1), LLaVA [llava](http://arxiv.org/pdf/2402.11690v1), and mPLUG-Owl [mplugowl](http://arxiv.org/pdf/2405.00390v2), leverage LLMs to build unified models for multi-modality with limited connected parameters. These methods show superficial OCR-free text recognition abilities under the zero-shot setting. However, for complicated document understanding, due to lacking in-domain training, they encounter challenges in handling diverse image types, recognizing rich texts and comprehending relationships between visual semantic and text information. In this work, through unified instruction tuning, mPLUG-DocOwl achieves much better document understanding performance and maintains general uni/multi-modal abilities. # Conclusion In this work, we infuse diverse ocr-free document understanding capabilities into mPLUG-Owl by incorporating document understanding data into instruction finetuning. Experiment results demonstrate that our mPLUG-DocOwl achieves comparable or even better performance than existing OCR-free methods. Besides, benefiting from language-only and general vision-and-language instruction tuning, mPLUG-DocOwl can better comprehend user instructions and intentions, enabling more complex interactions. Moreover, human evaluation on LLMDoc reveals that mPLUG-DocOwl still struggles with document-related commonsense reasoning, mathematical calculations, and creative generation. This provides valuable insights about developing stronger document understanding abilities with the LLM in the future. [^1]: Equal contribution [^2]: Corresponding author # Experiment ## LLMDoc Existing benchmarks are hard to evaluate the open-ended instruction understanding results given by MLMMs. For better compare the instruction understanding performance in the document domain, we further construct a test set with human evaluation, namely . #### Data Collection To comprehensively evaluate the model’s abilities, we consider five scenarios to construct our evaluation dataset, including table (TabFact [TabFact](http://arxiv.org/pdf/2311.06592v1)), chart (ChartQA [chartqa](None)), document (DocVQA [docvqa](None)), natural image (TextVQA [textvqa](None)) and webpage (VisualMRC [visualmrc](http://arxiv.org/pdf/2101.11272v2)). Specifically, for each dataset, we sample 20 images from the test split. For 10 of these images, we adopt a raw question as the instruction. While for the other 10, we ask annotators to write instructions requiring stronger capabilities like summarization, inference, and calculation. In total, we obtain 100 test samples. #### Human Evaluation Following the rating criteria proposed in Self-Instruct [self-instruct](https://doi.org/10.48550/arXiv.2212.10560), we perform the human evaluation to score the model’s responses, where A \> B \> C \> D and A represents ‘correct and satisfying response’, B means ‘acceptable response with minor imperfections’, C refers to ‘response to the instruction but has significant errors’ and D means ‘irrelevant or invalid response’.
r0.5 image
We compare with other popular mult-modal large language models, including mPLUG-Owl [mplugowl](http://arxiv.org/pdf/2405.00390v2) and Mini-GPT4 [minigpt4](http://arxiv.org/pdf/2402.17510v1), on . As shown in [fig:llm_comp], achieves significantly better performance, with 37 responses being scored as “A”, demonstrating the stronger understanding ability of in diverse document scenarios. Besides, it’s worth noting that all models have some responses scored as “C” or “D”, showing that instruction understanding performance in the document domain is still far from promising and needs more endeavor. ## Benchmark Evaluation Besides human evaluation, we also compare our with ocr-free state-of-the-art document understanding models on public datasets. [tab:due_eval] shows the comparison with Dessurt [dessurt](http://arxiv.org/pdf/2203.16618v3), Donut [donut](http://arxiv.org/pdf/2305.09520v1) and Pix2Struct [pix2struct](None) on DUE-Benchmark [due](None), which mainly requires the text recognition and layout understanding abilities on documents and tables. Besides, [tab:other_eval] presents the evaluation on the chart, natural image and webpage datasets, which ask stronger ability to relate visual semantics and text information. Without finetuning on each dataset, our achieves comparable or even better performance. ## Qualitative Analysis
Qualitative results of . The crucial regions and corresponding words are annotated with the same colors for clearer visualization. Wrong answers are colored red.
#### Benchmark Results. Qualitative results on different types of images are shown in 1. Crucial regions and corresponding responses are annotated with the same colors. Case (a) shows that can accurately find the answer from a webpage screenshot with complex contents. Case (b) shows that is even able to understand hand-drawn tables and correctly recognize handwritten fonts. In case (c), can summarize key points from a chart. It successfully understands that the table is about internet usage and infers that “Never” means “Never used internet”. However, it also generates illusory outputs, such as "in the United States". The question in case (d) requires the model to understand the “Result” column, compare the points and return the date with the best results. Case (e) demonstrates that our model is capable of processing scanned documents and distinguishing company and person names. Case (f) shows that can not only recognize small and blurry text but also perform simple calculations following the user intent.
Qualitative comparison between and Mini-GPT4 on . Part one.
Qualitative comparison between and Mini-GPT4 on . Part two.
#### Results 2 and 3 present the comparison between and Mini-GPT4 on . 2 (a) requires models to convert a table into JSON format. Our correctly understands the instruction and return a string in JSON format, but misses the last row. Mini-GPT4 fails to comprehend the instruction and doesn’t understand the content within the table. In 2 (b), both and Mini-GPT4 correctly recognize the name of the shop. However, Mini-GPT4 overlooks a smaller sign indicating clothes in this shop are medical uniforms. As for chart understanding in 3 (c), Mini-GPT4 gives a wrong answer and redundant response, while our gives a concise and correct response. In 3 (d), Bernadette’s actual purpose is to confirm with Suzy if she would like to have the copy sent overnight. This not only requires the model to accurately recognize the text, but also to understand the relationships between involved persons. recognizes the phrase "request a copy of chapter," but misunderstands the subject and object. Mini-GPT4 only comprehends that this image is a mail scenario and provides a vague and hallucinatory response. In 3 (e), gives a correct summary of the two latest news but Mini-GPT4 generates news irrelevant to the webpage screenshot.
Failure cases on . Part one.
Failure cases on . Part two.
The contains many challenging instruction understanding cases in the document domain. 4 and 5 show some wrong responses given by . In 4 (a), only takes note of the three names in the picture, but ignores the fact that the user itself is also a speaker. In 4 (b), fails to perform multi-step calculations on multiple elements in the image. In 5 (c), the model can understand the scene and the text in it, but fantasizes about non-existent characters. In 5 (d), fails to understand the instruction for writing news and only read the texts in the tablet.
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding 2023-06-29 Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun
Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.
Show Paper Content
# Introduction Instruction tuning [ouyang2022training](https://arxiv.org/pdf/2203.02155), [chung2022scaling](https://arxiv.org/pdf/2210.11416) improves generalization to unseen tasks by formulating various tasks into instructions. Such open-ended question-answering capability fosters the recent chatbot boom since ChatGPT. Recently, visual instruction-tuned models [liu2023visual](https://arxiv.org/pdf/2304.08485), [li2023otter](http://arxiv.org/pdf/2311.00233v2), [Li2023LargeMM](http://arxiv.org/pdf/2306.14895v1) further augment conversation agents with visual encoders such as CLIP-ViT [dosovitskiy2020image](https://arxiv.org/pdf/2010.11929), [radford2021learning](https://arxiv.org/pdf/2103.00020), enabling human-agent interaction based on images. However, possibly due to the dominance of natural images in training data (e.g., Conceptual Captions [changpinyo2021conceptual](https://arxiv.org/pdf/2102.08981) and COCO [lin2015microsoft](https://arxiv.org/pdf/1405.0312)), they struggle with understanding texts within images [liu2023hidden](https://arxiv.org/pdf/2305.07895). However, textual understanding is integral to visual perception in everyday life. Fortunately, tools such as Optical Character Recognition (OCR, `\citealp{156468}`{=latex}) allow us to recognize text in images. One naive way to utilize this is to add recognized texts to the input of visual instruction-tuned models [gao2023llamaadapterv2](https://arxiv.org/pdf/2304.15010). However, such approach significantly increases the computation (longer context lengths), and might not fully leverage the encoding capability of visual encoders. To this end, we propose to enhance the end-to-end visual instruction-tuned model by collecting instruction-following data that require understanding texts within images. Specifically, we first collect 422K noisy instruction-following data using text-rich[^2] images by combining manually written instructions (e.g., “Identify any text visible in the provided image.”) and the OCR results. Such large-scale noisy-aligned data effectively improve feature alignment between visual features and the language decoder. Furthermore, we prompt text-only GPT-4 [openai2023gpt4](https://arxiv.org/pdf/2303.08774) with OCR results and image captions to generate 16K conversations, where each conversation can be multiple turns of question & answer pairs, as high-quality instruction-following examples. This process requires GPT-4 to de-noise the OCR results and develop specific questions to create complex instructions based on the input (Figure [fig:highquality]). To evaluate the effectiveness of the collected data, we use noisy and high-quality examples to augment the pretraining and fine-tuning stages of LLaVA [liu2023visual](https://arxiv.org/pdf/2304.08485) accordingly. We name our model **LLaVAR**, signifying the LLaVA (Large Language and Vision Assistant) that can **R**ead. Compared to the original LLaVA, we also conducted experiments scaling the input resolution from $224^2$ to $336^2$ to better encode small textual details. Empirically, we report the results on four text-based VQA datasets following the evaluation protocol from [liu2023hidden](https://arxiv.org/pdf/2305.07895). Moreover, we apply GPT-4-based instruction-following evaluation to 30 natural images from COCO [lin2015microsoft](https://arxiv.org/pdf/1405.0312), [liu2023visual](https://arxiv.org/pdf/2304.08485) and 50 text-rich images from LAION [schuhmann2022laion](http://arxiv.org/pdf/2312.15897v1). We also provide qualitative analysis (e.g., on posters, website screenshots, and tweets) to test more complex instruction-following skills. To sum up, our contributions are as follows: - We collect 422K noisy instruction-following data and 16K high-quality instruction-following data. Both are shown to be effective in augmenting visual instruction tuning. - Our model, LLaVAR, significantly enhances text understanding within images while slightly improving the model’s performance on natural images. - The enhanced capability enables our model to provide end-to-end interactions based on various forms of online content that combine text and images. - We open source the training and evaluation data together with the model checkpoints.
# Related Work #### Instruction Tuning Following natural language instructions is the key capability for an agent to interact with real-world users. Instruction tuning starts from collecting human-preferred feedback for human written instructions [ouyang2022training](https://arxiv.org/pdf/2203.02155) or formulating multi-task training in a multi-task instruction-following manner [chung2022scaling](https://arxiv.org/pdf/2210.11416), [wang2022supernaturalinstructions](https://arxiv.org/pdf/2204.07705). However, large, capable instruction-tuned models are usually closed-sourced and serve as commercial APIs only. Recently, Alpaca [wang2022selfinstruct](https://arxiv.org/pdf/2212.10560), [alpaca](https://github.com/tatsu-lab/stanford_alpaca), Vicuna [vicuna2023](https://lmsys.org/blog/2023-03-30-vicuna/), and Baize [xu2023baize](https://arxiv.org/pdf/2304.01196) start the trend of generating high-quality instruction-following data based on LLMs such as GPT-3.5 / ChatGPT / GPT-4 and finetuning the open source LLaMA model [touvron2023llama](https://arxiv.org/pdf/2302.13971). However, evaluating the ability to follow instructions remains a challenge. While GPT-4 has demonstrated superior evaluation capabilities [liu2023geval](https://arxiv.org/pdf/2303.16634), there are still a number of concerns, such as bias toward response length [xu2023baize](https://arxiv.org/pdf/2304.01196) and lack of robustness to the order of examples [wang2023large](https://arxiv.org/pdf/2305.17926). Following [vicuna2023](https://lmsys.org/blog/2023-03-30-vicuna/), [liu2023visual](https://arxiv.org/pdf/2304.08485), [dubois2023alpacafarm](https://arxiv.org/pdf/2305.14387), we use GPT-4-based instruction-following evaluation in this work. #### Multimodal Instruction Tuning Recently, instruction tuning has been expanded to the multimodal setting, including image, video [zhang2023video](http://arxiv.org/pdf/2311.12919v2), [maaz2023videochatgpt](https://arxiv.org/pdf/2306.05424), and audio [Huang2023AudioGPTUA](http://arxiv.org/pdf/2108.04325v2), [zhang2023speechgpt](https://arxiv.org/pdf/2305.11000). For image-based instruction tuning, MiniGPT-4 [zhu2023minigpt4](https://arxiv.org/pdf/2304.10592) employs ChatGPT to curate and improve detailed captions for high-quality instruction-following data. LLaVA [liu2023visual](https://arxiv.org/pdf/2304.08485) generates multimodal instruction-following data by prompting text-only GPT-4 with captions and object’s bounding boxes. LLaMA-Adapter [zhang2023llamaadapter](https://arxiv.org/pdf/2303.16199), [gao2023llamaadapterv2](https://arxiv.org/pdf/2304.15010) uses COCO data for text-image feature alignment and utilizes textual data only for instruction tuning. mPLUG-owl [ye2023mplugowl](https://arxiv.org/pdf/2304.14178) combines more than 1000M image-text pairs for pretraining and a 400K mixture of text-only/multimodal instruction-following data for finetuning. However, according to [liu2023hidden](https://arxiv.org/pdf/2305.07895), most of these models struggle to accomplish tasks requiring OCR capability. InstructBLIP [dai2023instructblip](https://arxiv.org/pdf/2305.06500) transforms 13 vision-language tasks (including OCR-VQA [mishra2019ocrvqa](http://arxiv.org/pdf/2010.02582v1)) into the instruction-following format for instruction tuning. Cream [kim2023cream](https://arxiv.org/pdf/2305.15080) applies multi-task learning that includes predicting masked texts in images. A more comprehensive survey can be found in [li2023multimodal](li2023multimodal). In this work, we select LLaVA as our baseline, which is the most data-efficient and powerful model, and demonstrate the effectiveness of our proposed pipeline. # Data Collection Starting from the LAION-5B [schuhmann2022laion](http://arxiv.org/pdf/2312.15897v1) dataset [^3], our goal is only to keep images that are text-rich. Considering that documents usually contain plenty of text, we first obtained a binary classification dataset by combining natural images and document data. Subsequently, we trained an image classifier using a DiT [2022DIT](https://doi.org/10.1145/3503161.3547911)-base backbone, which was fine-tuned on the RVL-CDIP dataset [harley2015evaluation](https://arxiv.org/pdf/1502.07058). Hopefully, such a classifier can predict whether an image contains text or not. We first build a subset by selecting images with a predicted probability greater than 0.8 while also satisfying $p($watermark$) < 0.8$ and $p($unsafe$) < 0.5$ [^4]. The derived subset is noisy due to the limitation of the classifier. To further clean up the data and incorporate human judgment,
r0.59
image
we randomly sampled 50K images and clustered them into 100 clusters based on `CLIP-ViT-B/32` visual features. After inspecting the clustering results, we carefully select 14 clusters (see Figure [clusters] in the Appendix for examples) containing diverse text-rich images ranging from posters, covers, advertisements, infographics, educational materials, and logos. The cluster model is then used as the filter to collect images for constructing our instruction-following examples. As a reference, we provide a CLIP [radford2021learning](https://arxiv.org/pdf/2103.00020)-based categorization (see Appendix for details.) to illustrate the distribution of images for both two types of data we collected in Figure [fig:Data Collection]. We summarize our collected data and LLaVA’s data in Table 1. #### Noisy Instruction-following Data [para:Noisy Instruction-following Data] Using the clustering model as a filter, we collect 422K deduplicated images that belong to the 14 preferred clusters. To balance the examples from different categories, we keep at most 52K examples for one cluster. We run all images through PaddleOCR [^5]. Note that running OCR at the original resolution (e.g.,$1024^2$) might recognize small fonts that are not visible by visual encoders like CLIP ViT (`\citealp{dosovitskiy2020image, radford2021learning}`{=latex}, resolution up to $336^2$). To ensure the recognition of visible fonts while maintaining OCR accuracy, we perform OCR on the image after downsampling (the short edge is resized to 384 pixels if longer than that.) to extract the text. Then, based on the geometric relationships between the recognized words, we merge them into paragraphs before concatenating them. As a robust instruction-following model should react similarly to instructions with similar meanings, we reword “Identify any text visible in the provided image.” into ten distinct instructions (Table 3 in Appendix). We then create a single-turn conversation for a given image by **(i)** randomly sampling an ***input instruction*** and **(ii)** using recognized texts as the desired ***output response***. Such instruction-following data is noisy because of the relatively limited performance of OCR tools on diverse fonts and colorful backgrounds.
| **Data** | **Image** | **Instruction** | **\# Conv** | **Avg Ins Len** | **Avg Res Len** | |:---|:--:|:--:|:--:|:--:|:--:| | LLaVA pretraining | CC3M | CC3M | 595K | 15.9 | 15.4 | | R~pretraining~ (Ours) | LAION | PaddleOCR | 422K | 17.2 | 48.8 | | LLaVA finetuning | COCO | GPT-4 | 158K | 15.9 | 93.1 | | R~finetuning~ (Ours) | LAION | GPT-4 | 16K | 15.1 | 40.5 | Summary of data statistics. R~pretraining~ and R~finetuning~ denote the additional pre-training / finetuning data we collected. The average instruction and response length are calculated after LLaMA tokenization.
#### GPT-4-based Instruction-following Data Compared to high-quality instruction-following data, there are mainly two issues for the noisy data collected above. **(i)** Responses should contain organized sentences instead of raw OCR results with missing words and grammar errors. **(ii)** Instructions should be diverse, suitable and specific to the given image instead of monotonously asking for all visible texts. To address these issues, we follow [liu2023visual](https://arxiv.org/pdf/2304.08485) to generate instruction-following data by prompting text-only GPT-4 [openai2023gpt4](https://arxiv.org/pdf/2303.08774) with OCR results and captions. It is challenging to prompt GPT-4 with fragmented OCR results in a few words to generate non-trivial instructions. To this end, we carefully select 4 of the 14 previously mentioned clusters (the 3rd, 4th, 6th and 9th clusters in Figure [clusters]) to collect images with enough visible and coherent sentences. As shown in Figure [fig:Data Collection], such filtering dramatically increases the percentage of book covers and quote images. We randomly selected 4K examples from each cluster (no overlap with images used for noisy instruction-following data), yielding a total of 16K images. Following prior work [wang2022selfinstruct](https://arxiv.org/pdf/2212.10560), [alpaca](https://github.com/tatsu-lab/stanford_alpaca), [liu2023visual](https://arxiv.org/pdf/2304.08485), we provide the visualization of verb-noun pairs for instructions generated by GPT-4 in Appendix Figure 7. For those instructions without a verb-noun pair, we demonstrate the frequency of objects being asked in Appendix Figure 8. Furthermore, based on the system message and two in-context few-shot examples (shown in Appendix ), we ask GPT-4 to generate conversational data based on OCR results and image captions (Figure [fig:highquality]). The generated questions are used as ***input instructions***, and answers are used as ***output responses***. Concretely, for a given image, we first provide two OCR results from EasyOCR and PaddleOCR, which can complement each other. To illustrate visual elements other than texts within the image, we also provide the result of BLIP-2 image captioning [li2023blip2](https://arxiv.org/pdf/2301.12597). To prevent the caption from focusing on the text, we use OCR bounding boxes to mask the text and then use the inpainting [telea2004image](telea2004image) to refill the mask before using generation captions. Note that captioning models might suffer from hallucinations [rohrbach2018object](rohrbach2018object). We mention this unreliability in our system message and ask GPT-4 only to generate questions with sure answers. We leave the generation of more detailed captions [rotstein2023fusecap](https://arxiv.org/pdf/2305.17718), [hu2022promptcap](https://arxiv.org/pdf/2211.09699) for future work.
# Model Architecture and Training #### Architecture In most of our study, we use the same model architecture as LLaVA. For the visual encoder $V$, we use `CLIP-ViT-L/14` for $224^2$ resolution and `CLIP-ViT-L/14-336` for $336^2$ resolution. The grid features before the last transformer layer are then transformed into the word embedding space of the language decoder through a trainable projection matrix $W$. We use Vicuna-13B [vicuna2023](https://lmsys.org/blog/2023-03-30-vicuna/), a LLaMA-based [touvron2023llama](https://arxiv.org/pdf/2302.13971) instruction-tuned language model, as the language decoder $D$ except the ablation study in Table [table: ablation on encoder/image]. In Section 5.1.0.3 and Appendix , we extend the current architecture by adding an extra high-resolution (high-res) visual encoder. Such a high-res encoder outputs thousands of patch features, which means that the transformed features and instruction tokens cannot fit in the context length of the language decoder. To this end, we propose to add cross-attention modules to the decoder, which attend to key-value pairs transformed from the high-res patch features. #### Training We follow the two-stage training design of LLaVA (Figure [fig:training]). The training objectives of both stages are the same: generate ***output responses*** (*$<$res$>$*) for the ***input instructions*** (*$<$ins$>$*). The transformed image tokens (*$<$img$>$*) are added before or after the first input instruction. **(i)** During the first pre-training stage, only the projection matrix $W$ is trained for feature alignment. Since the decoder $D$ is frozen, training tolerates noisy data. In the pre-training stage, we combine the 595K pre-training data from LLaVA with our 422K noisy instruction-following data. **(ii)** Both the projection matrix $W$ and the language decoder $D$ are trained during the finetuning stage, where we merge our 16K instruction-following data into the 158K instruction-following data from LLaVA as the training set. Note that the visual encoder is frozen throughout the training period, which might restrict text recognition performance, as CLIP is trained for general-purpose text-image alignment. Better choices of the visual encoder [tschannen2022clippo](https://arxiv.org/pdf/2212.08045) or CLIP-ViT finetuning [ye2023mplugowl](https://arxiv.org/pdf/2304.14178) may further benefit the visual understanding capability, which we leave for future work. # Experiments We use the same training hyperparameters as LLaVA[^6], except that **(i)** We set the maximum sequence length to 1024 during pre-training. **(ii)** We first pad any given image to a square shape before resizing it to the desired input size, preventing some image content from cropping during preprocessing. For both resolutions ($224^2$, $336^2$), we reproduce the original LLaVA for a fair comparison. The GPT-4 model used in this work refers to the `gpt-4-0314` version, while the cost to collect finetuning data is around \$300. The temperature used to sample GPT-4 is set to $1.0$ for the generation of training data, $0.7$ for the generation of evaluation data, and $0.2$ for the evaluation based on GPT-4. All experiments are run on NVIDIA A100 80GB GPUs. During the evaluation, the temperature used to sample from our model is set at $0.9$ for text-based VQA, $0.7$ for GPT-4-based instruction-following evaluation, and $0.2$ for other qualitative demonstrations.
lc S@ S @ S @ S & **Res.** & **ST-VQA** & **OCR-VQA** & **TextVQA** & **DocVQA** BLIP-2 `\citeyearpar{li2023blip2}`{=latex} $\dagger$ & & 21.7 & 30.7 & 32.2 & 4.9 OpenFlamingo `\citeyearpar{anas_awadalla_2023_7733589}`{=latex} $\dagger$ & & 19.3 & 27.8 & 29.1 & 5.1 MiniGPT4 `\citeyearpar{zhu2023minigpt4}`{=latex} $\dagger$ & & 14.0 & 11.5 & 18.7 & 3.0 LLaVA `\citeyearpar{liu2023visual}`{=latex} $\dagger$ & & 22.1 & 11.4 & 28.9 & 4.5 mPLUG-Owl `\citeyearpar{ye2023mplugowl}`{=latex} $\dagger$ & & 29.3 & 28.6 & 40.3 & 6.9 LLaVA $\ddagger$ & & 24.3 & 10.8 & 31.0 & 5.2 LLaVAR & & 30.2 & 23.4 & 39.5 & 6.2   LLaVA $\ddagger$ & & 28.9 & 11.0 & 36.7 & 6.9 LLaVAR & & 39.2 & 23.8 & 48.5 & 11.6
lSSSS & **ST-VQA** & **OCR-VQA** & **TextVQA** & **DocVQA** (1) LLaVA & 28.9 & 11.0 & 36.7 & 6.9 (2) LLaVA + R~pretraining~ & 36.7 & 26.1 & 46.5 & 9.6 (3) LLaVA + R~finetuning~ & 34.1 & 21.6 & 43.6 & 9.5 (4) LLaVA + C~pretraining~ & 35.4 & 27.0 & 45.6 & 9.2 (5) LLaVA + N~finetuning~ & 34.1 & 25.9 & 43.3 & 10.2 (6) LLaVAR & 39.2 & 23.8 & 48.5 & 11.6
ccccSSSS & **CLIP Res.** & **Extra Enc.** & **R~pretraining~** & **ST-VQA** & **OCR-VQA** & **TextVQA** & **DocVQA** (a) & $224^2$ & & Low & 28.9 & 25.6 & 37.8 & 6.2 (b) & $336^2$ & & Low & 37.4 & 31.0 & 45.7 & 11.4 (c) & $224^2$ & & High & 28.9 & 24.9 & 35.8 & 6.2 (d) & $336^2$ & & High & 36.9 & 30.6 & 46.5 & 11.7 (e) & $224^2$ & & Low & 31.8 & 30.9 & 41.2 & 8.1 (f) & $224^2$ & & High & 33.5 & 29.8 & 42.2 & 10.2 (g) & $224^2$ & & High ($2\times$) & 35.8 & 30.7 & 45.6 & 15.3
Question: Who wrote this book?
GT Sandra Boynton
(1) The book was written by Sandra Boynton.
(2) The book was written by Sandra Boyton. It is a children’s book that features a family calendar with an accompanying activity guide. The book provides information (...) The cover of the book displays a picture of the family from the calendar, which helps to visually represent the topic of the book.
(3) The book is written by Sandra Byington.
(4) The book was written by Sandra Boynton.
(5) With 500+ Nity StickerS &aSwEllMagnetic Phone List! Moms 291573016171800 FAMII (Pocket) Calendar (...) 2016 SANDRA BOYNTON
(6) The book was written by Sandra Boynton.
Ablation study based an example from OCR-VQA. GT refers to ground truth, and (1) - (6) are different model variants from Table [table: VQA ablation]. We replace the excessively long response with (...).
## Quantitative Analysis #### Text-based VQA Following the evaluation protocol in [liu2023hidden](https://arxiv.org/pdf/2305.07895), we evaluate the performance of LLaVAR on four text-based VQA datasets: ST-VQA [STVQA](https://doi.org/10.1109/icdar.2019.00251), OCR-VQA [mishra2019ocrvqa](http://arxiv.org/pdf/2010.02582v1), TextVQA [textvqa](https://doi.org/10.1109/cvpr.2019.00851), and DocVQA [mathew2020docvqa](https://arxiv.org/pdf/2007.00398), representing various domains (see Appendix for more details and Appendix for more datasets). We present the results of the baseline models and our models in Table [table: VQA result]. Note that InstructBLIP includes OCR-VQA in its training sets, making it incomparable with our settings. In both resolution settings and all four datasets, LLaVAR substantially improves the LLaVA baseline, demonstrating that our collected data can bring about a robust improvement. Furthermore, the improvement is more significant in $336^2$ resolution compared to $224^2$, indicating that the collected data might bring a greater improvement at even higher resolutions. Our best model, $336^2$-based LLaVAR, performs best in 3 out of 4 evaluated datasets. Note that this is not a fair comparison. Some key factors include different language decoders, resolutions, and magnitudes of text-image training data. We provide more discussions on the comparison with mPLUG-Owl and the result of finetuning mPLUG-Owl using our data in Appendix . #### Ablation Study on pretraining/finetuning data We report the result in Table [table: VQA ablation] and Figure 1. **(i)** Based on variants (2) and (3), we find that the collected data can benefit the pretraining stage (R~pretraining~) and finetuning stage (R~finetuning~) separately while being complementary to each other in most cases [^7]. More importantly, enhancing the pretraining stage alone achieves the second-best overall performance, indicating the potential to boost textual detail understanding without dependence on GPT-4-generated high-quality data. **(ii)** Using pretraining images, we obtain C~pretraining~ by replacing the pretraining instructions with questions & captions, the same pattern as LLaVA. As variant (4) is not as good as (2), we can conclude that OCR is more advantageous than captions. **(iii)** We further validate the value of GPT-4 generated data by generating noisy finetuning data (N~finetuning~), similar to pretraining data. Variant (5) achieves comparable accuracy as variant (3). However, as shown in Figure 1, such noisy finetuning data hurts the instruction-following capability: (5) responds with all recognized texts while ignoring the questions. #### Ablation Study on encoders/image resolution [Ablation: encoders/res] While keeping finetuning data the same, we report the quantitative results of adding an extra visual encoder and varying the pretraining data in Table [table: ablation on encoder/image]. **(i)** Take `Pix2Struct-base` as an example, we find that adding an extra high-res visual encoder with cross-attention indeed improves the performance ((g) vs. (a)), especially achieving the best zero-shot performance on DocVQA (15.3% accuracy). The performance gain on other datasets is relatively limited, probably due to the extra encoder we use being pretrained on web pages instead of natural images. On the other hand, the performance of (e) and (f) remains poor, without doubling the number of high-res examples in R~pretraining~. Given the larger number of parameters initialized in the cross-attention module, they may be underfitting when trained on the same data as the projection matrix $W$ (e.g., (e) vs. (b)), similar to the finding in [zeng2023matters](zeng2023matters). **(ii)** Considering (c) vs. (a) and (d) vs. (b), while the images are resized to the same size after preprocessing, high-res OCR results turn out to be not necessarily better than the low-resolution version, suggesting the capability of the visual encoder is almost saturated in (a) and (b). For more details and results on the extra high-res encoder, please refer to Appendix . #### GPT-4-based instruction-following evaluation We also report the GPT-4 evaluation results on instruction-following questions in Table 2. **(i)** **Natural Images**: 90 questions based on 30 COCO validation images from [liu2023visual](https://arxiv.org/pdf/2304.08485), including three aspects: conversation, detail description, and complex reasoning. This aims to test whether our collected data will hurt, maintain, or improve the model’s performance on natural images. First of all, using a higher resolution brings improvement (+2.9) in the performance of detail description, which is intuitive. Furthermore, LLaVAR achieves a better trade-off and increases the performance of all three aspects (+1.6 on average). More details are in Appendix . **(ii)** **Text-Rich Images**: Similar to collecting the finetuning data, we leverage 50 text-rich images from LAION to collect instruction-following questions based on OCR results and human-annotated captions. We then collect responses from our trained model and use GPT-4 to calculate the relative score w.r.t GPT-4 responses. We add this as an extra dimension “**Read**” to Table 2, where our model demonstrates a significant (+3.8) improvement. The Appendix provides an example in Table 11. ## Qualitative Analysis We use a recent movie poster [^8] to demonstrate the difference between LLaVA and LLaVAR when interacting with humans based on text-rich images. LLaVA, without augmenting textual understanding within images, suffers from hallucination when answering these questions. Some mentioned movies, like “A Man Called Ove” and “The Ugly Truth”, are real movies, suggesting that the language decoder is hallucinating its internal knowledge, while the visual encoder cannot encode helpful information. Alternatively, LLaVAR can correctly answer many of the provided questions with **faithful** information, which is clearly grounded in the image. However, some limitations remain, such as the spelling error “ottol” (We provide more statistics related to such spelling errors in Appendix ). Also, the final question asks for information that is not observable from the given poster, where an expected response should express such uncertainty instead of giving concrete answers. Nevertheless, neither model correctly answers the question.
| | **Res** | **Conversation** | **Detail** | **Complex** | **Read** | |:-----------------|:-------:|:----------------:|:----------:|:-----------:|:--------:| | LLaVA (Original) | $224^2$ | 83.1 | 75.3 | 96.5 | \- | | LLaVA | $336^2$ | 83.9 | 78.2 | 95.3 | 87.9 | | LLaVAR | $336^2$ | 84.5 | 78.9 | 96.5 | 91.7 | Relative scores (w.r.t. text-only GPT-4) for instruction-following questions, where the first three dimensions are based on natural images, the last dimension (“Read”) is based on text-rich images. In the first row, we show the original results ($224^2$-based) fetched from [liu2023visual](https://arxiv.org/pdf/2304.08485). We report our reproduced LLaVA on $336^2$ resolution for a fair comparison.
H: When will this movie be released?
L: The movie "The Call of the Ottoman" starring Daniel Craig is expected to be released on February 14th, 2023.
R: The movie will be released on January 13.
H: Who starred in the movie?
L: The movie stars Tom Cruise and Cobie Smulders.
R: Tom Hanks starred in the movie.
H: What is this movie about?
L: The movie "A Man Called Ove" is a comedy-drama film directed by Hannes Holm. (...)
R: The movie is about Tom Hanks, who is called Ottol, and it is an international bestseller based on a true story. (...)
H: Can you describe the given image?
L: The image is a movie poster featuring a man and a cat. The man, who appears to be a character from the movie “The Ugly Truth,” (...)
R: The image features a man standing in front of a house with a cat beside him.
H: Who directed the film?
L: The film was directed by Florian Henckel von Donnersmarck.
R: The film was directed by Tom Hanks, who made his directorial debut with this movie.
Qualitative Comparison. We color the human inputs (H) in blue, LLaVA’s responses (L) in pink, and LLaVAR’s responses (R) in green. Both are 3362-based models. We replace the excessively long response with (...).
Case study of the recognizable font size, in which the x-axis refers to the height of ground truth answers in the image and the y-axis stands for the answer accuracy of models. We plot the results for both 2242-based models and 3362-based models.
## Case Study: Recognizable Font Size We first collect 825 examples from OCR-VQA, which have answers directly presented in the image and are detectable by the OCR tool. By rescaling the images, we test the model’s performance in answering these questions while the vertical heights of answers range from 3 pixels to 19 pixels. We report the result in Fig 3. **(i)** For the baseline model LLaVA, it struggles to provide correct answers in all scenarios, for both $224^2$-based and $336^2$-based versions. **(ii)** Our model LLaVAR achieves significantly better results in all scales. We observe a threshold for recognizable texts for both $224^2$-based and $336^2$-based versions as the accuracy sharply decreases when the height is smaller than 7 pixels. More interestingly, the $224^2$-based version achieves better performance on small texts with 3 pixels height while the $336^2$-based achieves better performance on large texts with more than 7 pixels height. We assume the extra training stage of CLIP $336^2$ makes it better on the larger scale but worse on the smaller scale. ## Transferred Instruction-following Capability According to the dataset statistics (Table 1) and the visualization (Figure 7), our collected instruction-following data is not as diverse and substantial as LLaVA. This can be attributed to the relatively limited information given GPT-4 compared to five different human-written captions used in LLaVA. The content of text-rich images is also less diverse than that of natural images. While using more complex in-context examples can definitely stimulate generating more complicated instruction-following examples, it can also multiply the cost. In Appendix Figure 6, we demonstrate the transferred instruction-following capability of LLaVA, potentially from both the LLaVA data and the Vicuna backbone. While the extra data we add mainly focuses on understanding the visible texts within images, LLaVAR manages to build its reasoning, writing, and elaboration skills based on the top of its text recognition capability in an end-to-end manner. This allows users to interact with various online content based on simple screenshots. # Conclusion In this work, we enhance visual instruction-tuned models in terms of their capability to read texts in images. Using text-rich images from the LAION dataset, we collect 422K noisy instruction-following examples using OCR results only and 16K high-quality instruction-following data based on text-only GPT-4. These two sets of data are leveraged to augment the pretraining stage and finetuning stage of LLaVA accordingly. Our model, LLaVAR, demonstrates superior performance in understanding texts within images and following human instructions on both prior benchmarks and real-world online content. Moreover, our analysis shows that the same augmented data is more effective with higher resolution. Additionally, using noisy instruction-following examples to augment pretraining essentially boosts the model performance without prompting GPT-4. For future work, we encourage exploration of **(i)** better image selection criteria or domain reweighting strategy [xie2023doremi](https://arxiv.org/pdf/2305.10429) and **(ii)** more data-efficient and computation-efficient ways to augment instruction-following models with multimodal capability, especially in the high-res scenario.
| **Instructions** | |:-----------------------------------------------------------------------| | Identify any text visible in the image provided. | | List all the text you can see in the given image. | | Enumerate the words or sentences visible in the picture. | | Describe any readable text present in the image. | | Report any discernible text you see in the image. | | Share any legible words or sentences visible in the picture. | | Provide a list of texts observed in the provided image. | | Note down any readable words or phrases shown in the photo. | | Report on any text that can be clearly read in the image. | | Mention any discernable and legible text present in the given picture. | Ten instructions asking for OCR results.
# A [CLIP-based categorization] #### CLIP-based categorization Based on the observation of selected clusters, we divide the images used into 8 categories. For each category, we use one or multiple words as labels. - **Quote & Meme**: “quote”, “internet meme”. - **Poster**: “movie poster”, “podcast poster”, “TV show poster”, “event poster”, “poster”, - **Book Cover**: “book cover”, “magazine cover”. - **Game Cover**: “game cover”. - **Ad & Product Packaging**: “ad”, “advertisement”, “food packaging”, “product packaging”. - **Infographic**: “chart”, “bar chart”, “pie chart”, “scatter plot”. - **Educational Material**: “exam paper”, “quiz”, “certificate”, “book page”. - **Logo**: “logo”. For each word, we use the following templates to achieve embedding-space ensembling [radford2021learning](https://arxiv.org/pdf/2103.00020): - “a photo of a {}.” - “a blurry photo of a {}.” - “a black and white photo of a {}.” - “a low contrast photo of a {}.” - “a high contrast photo of a {}.” - “a bad photo of a {}.” - “a good photo of a {}.” - “a photo of a small {}.” - “a photo of a big {}.” For each image, we calculate the similarity between the image and all words mentioned above using `CLIP-ViT-L/14`. If the highest similarity is less than $0.15$, we then classify the image into **Other**, otherwise we classify into the “super class” (e.g., **Poster**) of the word (e.g., “movie poster”) with the highest similarity. # B [context prompt] #### System Message adapted from [liu2023visual](https://arxiv.org/pdf/2304.08485).
You are an AI visual assistant, and you are seeing a single image. What you see is provided with two OCR results and one image caption describing the information within the same image you are looking at. Image captions might include hallucinations, while OCR results are more accurate. Answer all questions with definite answers as you are seeing the image. Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers. Include questions asking about the visual content of the image (e.g., the man, the sunset, the ocean.) and the texts contained in the image. Only include questions that have definite answers: \(1\) one can see the content in the image that the question asks about and can answer confidently; \(2\) one can determine confidently from the image that it is not in the image. Do not ask any questions that cannot be answered confidently; \(3\) DO NOT mention OCR or image caption in your questions and answers; \(4\) DO NOT ask about information from captions while it looks unrelated to or contradicts OCR results. Also include complex questions that are relevant to the content in the image, for example, asking about background knowledge of the texts in the image, asking to discuss about the design of the image, etc. Again, do not ask about uncertain details. Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized. You can include multiple paragraphs if necessary.
#### Few-shot Examples
**Human:** WHAT F YOUR BLESSINGS COME THROUGH RAINDROPS PDf Induded LAURASTOR Y DEVOHONAL BASED ON THE GRAAAMY VARD WIN ISONG mAT HAS INSEIRED MILLION WHATIFYOUR BLESSINGS COMETHROUGHRAINDROPS PDF Included LAURASTORY A DEVOTIONAL BASEDON THE GRAMMY AWARD WINNI PISONGTHATHASINSPIREDMILLIONSE a girl is standing in a field with a rainbow **GPT:** Question: What is the name of the devotional mentioned in the image? Answer: The devotional is called “What If Your Blessings Come Through Raindrops” by Laura Story. Question: What is special about the song mentioned in the image? Answer: The song is special because it’s a Grammy Award-winning song that has inspired millions of people." **Human:** One of the hardest things in life to accept is a called third strike Robert Frost te One of the hardest things in life to accept is a called third strike Robert Frost quotefancy a close up of a baseball glove **GPT:** Question: Why is the third strike hard to accept? Explain the quote to me. Answer: The quote Öne of the hardest things in life to accept is a called third strikeättributed to Robert Frost is a metaphor for life situations. In baseball, a called third strike is when the umpire determines that a pitch was a strike, but the batter did not swing. This can be frustrating for the batter because they missed an opportunity to hit the ball due to either a lack of confidence or misjudgment. In life, this metaphor refers to difficult moments where we might miss opportunities or face setbacks because of our own inaction or hesitation. The quote suggests that it is hard to accept these missed opportunities or challenges, especially when they result from our own choices or lack of action.
# C [Evaluation Dataset] Details of evaluation datasets used in the main paper: - ST-VQA [STVQA](https://doi.org/10.1109/icdar.2019.00251) contains 31791 questions that require understanding the scene text, based on images from COCO [lin2015microsoft](https://arxiv.org/pdf/1405.0312), Visual Genome [krishna2016visual](https://arxiv.org/pdf/1602.07332), ImageNet [imagenet](https://doi.org/10.1109/CVPR.2009.5206848), etc. - TextVQA [textvqa](https://doi.org/10.1109/cvpr.2019.00851) contains 45,336 questions that need reading and reasoning about the text in images to answer, based on images from OpenImages [openimages](http://arxiv.org/pdf/1809.05929v7). - OCR-VQA [mishra2019ocrvqa](http://arxiv.org/pdf/2010.02582v1) contains more than 1 million questions asking about information from book cover images [iwana2016judging](https://arxiv.org/pdf/1610.09204). - DocVQA [mathew2020docvqa](https://arxiv.org/pdf/2007.00398) contains 50000 questions based on document images. Details of extra datasets in Appendix: - CT80 [risnumawan2014robust](risnumawan2014robust) contains 80 images for curved text OCR evaluation. The formats of questions are: (1) “What is written in the image?" for English words. (2) “What is the number in the image?" for digit string. - POIE [textvqa](https://doi.org/10.1109/cvpr.2019.00851) contains 3000 camera images collected from the Nutrition Facts label of products, together with 111,155 text instances. The format of questions is “What is {entity name} in the image?". - ChartQA [masry2022chartqa](masry2022chartqa) includes 4,804 charts with 9608 human-written questions. # D [other metrics] #### Results of other metrics
| | Res. | METEOR | ROUGE-L | CIDEr | |:-------|:--------|:------:|:-------:|:-----:| | LLaVA | $224^2$ | 7.0 | 8.2 | 15.3 | | LLaVAR | | 10.0 | 11.4 | 24.5 | | LLaVA | $336^2$ | 8.4 | 9.9 | 19.1 | | LLaVAR | | 12.8 | 14.3 | 30.9 | Results on ST-VQA using text-matching metrics.
| | Res. | METEOR | ROUGE-L | CIDEr | |:-------|:--------|:------:|:-------:|:-----:| | LLaVA | $224^2$ | 8.7 | 10.5 | 12.2 | | LLaVAR | | 12.5 | 14.9 | 21.4 | | LLaVA | $336^2$ | 9.9 | 12.1 | 15.3 | | LLaVAR | | 14.8 | 17.4 | 27.0 | Results on textVQA using text-matching metrics.
| | Res. | METEOR | ROUGE-L | CIDEr | |:-------|:--------|:------:|:-------:|:-----:| | LLaVA | $224^2$ | 0.2 | 0.1 | 0.0 | | LLaVAR | | 0.3 | 0.1 | 0.0 | | LLaVA | $336^2$ | 0.3 | 0.1 | 0.0 | | LLaVAR | | 0.2 | 0.1 | 0.0 | Results on OCR-VQA using text-matching metrics.
| | Res. | METEOR | ROUGE-L | CIDEr | |:-------|:--------|:------:|:-------:|:-----:| | LLaVA | $224^2$ | 3.8 | 4.8 | 6.3 | | LLaVAR | | 5.6 | 6.9 | 12.7 | | LLaVA | $336^2$ | 4.6 | 5.6 | 8.7 | | LLaVAR | | 8.6 | 10.0 | 21.5 | Results on DocVQA using text-matching metrics.
The metric used for text-based VQA in the main paper is the standard practice in VQA benchmarks [VQA](VQA). For STVQA and DocVQA, previous works use ANLS (Average Normalized Levenshtein Similarity) as the metric [STVQA](https://doi.org/10.1109/icdar.2019.00251), [mathew2020docvqa](https://arxiv.org/pdf/2007.00398), which calculates the average normalized edit distance and only works for supervised models trained to output short and precise answers. It works badly for instruction-following models that usually output long sequences instead of brief answers. For reference, we provide more text-matching metrics (METEOR, `\citealp[]{banerjee-lavie-2005-meteor}`{=latex}, ROUGE-L, `\citealp[]{lin-2004-rouge}`{=latex}, CIDEr, `\citealp[]{vedantam2014cider}`{=latex}) to demonstrate the improvement of our model (Table 4, 5, 6, 7), which works well except for OCR-VQA. We assume these metrics are not valuable for OCR-VQA since the ground truth answers are usually too short. # E [Extra datasets] #### Results on extra datasets
lc S@ S@ S & **Res.** & **CT80** & **POIE** & **ChartQA** BLIP-2 `\citeyearpar{li2023blip2}`{=latex} $\dagger$ & & 80.9 & 2.5 & 7.2 OpenFlamingo `\citeyearpar{anas_awadalla_2023_7733589}`{=latex} $\dagger$ & & 67.7 & 2.1 & 9.1 MiniGPT4 `\citeyearpar{zhu2023minigpt4}`{=latex} $\dagger$ & & 57.3 & 1.3 & 4.3 LLaVA `\citeyearpar{liu2023visual}`{=latex} $\dagger$ & & 61.1 & 2.1 & 7.3 mPLUG-Owl `\citeyearpar{ye2023mplugowl}`{=latex} $\dagger$ & & 81.9 & 3.3 & 9.5 LLaVA $\ddagger$ & & 61.5 & 1.9 & 9.2 LLaVAR & & 81.6 & 5.7 & 10.2 LLaVA $\ddagger$ & & 64.9 & 2.5 & 10.2 LLaVAR & & 83.0 & 8.7 & 13.5
In Table [table: extra VQA result], we provide results on three extra datasets: CT80 (OCR, `\citealp[]{risnumawan2014robust}`{=latex}), POIE (Information Extraction, `\citealp[]{kuang2023visual}`{=latex}), and ChartQA [masry2022chartqa](masry2022chartqa). We use the same VQA metric as other text-based VQA datasets. We observe similar trends as the main paper results: LLaVAR data significantly improves over the LLaVA baseline, usually more considerably in a higher resolution. # F [Finetune mPLUG] #### Comparison with mPLUG-Owl We find that LLaVAR usually performs similarly well with mPLUG-Owl in the same $224^2$ resolution.We further clarify the setting differences between mPLUG-Owl and ours: mPLUG-Owl is trained on 1000M+ text-image pairs, while the original LLaVA is trained on about 0.6M text-image pairs. Our model, LLaVAR, is trained on about 1M text-image pairs. Within the same resolution, LLaVAR demonstrates a good performance with decent data efficiency. We presume that training on large-scale non-OCR data improves OCR performance, as many of the captions in LAION datasets are equivalent to incomplete OCR results (Texts in an online image will sometimes appear in the captions). In the scale of our experiment, we observe similar improvement that just training on captions of text-rich images can help with text recognition capability: In Table [table: VQA ablation], variant (4) is better than variant (1). However, training on captions only (variant (4)) is not as good as training on OCR-based data (variant (2)(6)), at least in the scale of our experiments. #### Results of finetuning mPLUG-Owl To further validate the effectiveness of our collected data, we provide the results of finetuning mPLUG-Owl using our 16K GPT-4-based instruction-following data in Table 8. Though the mPLUG-Owl checkpoint is extensively trained on 1000M+ text-image pairs, we find that our data can boost performance in most cases, demonstrating the effectiveness of our data.
| | ST-VQA | OCR-VQA | TextVQA | DocVQA | CT80 | POIE | ChartQA | |:----------------|:------:|:-------:|:-------:|:------:|:----:|:----:|:-------:| | mPLUG-Owl | 29.3 | 28.6 | 40.3 | 6.9 | 81.9 | 3.3 | 9.5 | | mPLUG-Owl~ours~ | 29.6 | 31.2 | 40.8 | 7.0 | 84.7 | 3.7 | 10.2 | Results (accuracy %) of finetuning mPLUG-Owl. mPLUG-Owl~ours~ denotes mPLUG-Owl finetuned on our 16K GPT-4-based instruction-following data.
# G [ScienceQA section] #### ScienceQA Results Starting from our pretrained LLaVAR ($336^2$-based, without finetuning), we also report the results of further finetuning on the ScienceQA dataset [lu2022learn](https://arxiv.org/pdf/2209.09513) in Table [table:scienceqa], which is a multimodal multi-choice QA dataset covering diverse domains. Our motivation is that some images in this dataset contain text descriptions and tables that require textual understanding within images. The LLaVAR model finetuned on ScienceQA achieves an average accuracy of 91.42%, better than LLaVA (90.92%), while the most considerable improvement comes from natural science questions (+1.43%). # H [High-Res section]
Illustration of the dual visual encoder system. Given an image, it is simultaneously processed by visual encoders V1 and V2. V1 features are transformed by transformation matrix W and directly used as input embeddings to the language model. For V2 features, they are transformed by transformation matrix K and V and used as keys and values to calculate the cross attention in every transformer layer (assume there are N layers), which uses the transformed hidden states (through Q) from the self-attention module as queries. For the language decoder D, the input is image tokens (<img>) and instruction tokens (<ins>), while the target is response tokens (<res>).
The original version of LLaVAR only supports up to $336^2$ resolution, while our case study has also shown the threshold for the recognizable font size. Both suggest the difficulty of processing real-world high-res images without scaling and cutting. To this end, we test a dual visual encoder system for the high-res variant of LLaVAR, where a high-res visual encoder is added to work with the standard one. Ideally, the standard visual encoder extracts general, high-level information, while the high-res one specifically helps with detailed information. #### Architecture A high-res visual encoder usually outputs thousands of visual features. Simply following LLaVA to feed the transformed visual features into the context of the language decoder is impractical, as the maximum sequence length of the language decoder is usually 2048/4096. To this end, we propose to handle high-res visual features by cross-attention module and standard visual features by feature transformation. We depict the proposed system in Figure 4. Specifically, given a standard visual encoder $V_1$, the extracted features are transformed into the word embedding space of the language decoder through a trainable projection matrix $W$. These transformed features are then concatenated with the word embeddings to build the input embeddings of the language decoder $D$. $$\begin{aligned} \begin{split} & \mathrm{emb}(\langle \mathrm{img}_1\rangle), \cdots, \mathrm{emb}(\langle \mathrm{img}_m \rangle) = WV_1(I) \\ \mathrm{input}\_\mathrm{emb} = \mathbf{concat}([ & \mathrm{emb}(\langle \mathrm{img}_1\rangle), \cdots, \mathrm{emb}(\langle \mathrm{img}_m \rangle), \mathrm{emb}(\langle \mathrm{ins}_1\rangle), \cdots, \mathrm{emb}(\langle \mathrm{ins}_n \rangle)]) \end{split} \end{aligned}$$ where $I$ is the input image, $V_1$ denotes extracting the grid features before the last transformer layer. At the same time, we use the high-res visual encoder $V_2$ to extract high-res visual features, which are then transformed into keys/values as the inputs of the cross-attention module in transformer layers. Given $h^j$ as the hidden state before the cross-attention module in layer $j$, $$\begin{aligned} \begin{split} & \mathrm{CrossAttention}(h, V_2, I) = \mathrm{softmax}(\frac{Q^jh^j(K^jV_2(I))^T}{\sqrt{d}})V^jV_2(I) \end{split} \end{aligned}$$ where $Q^j, K^j, V^j$ denotes the query/key/value projection matrix in the $j$-th transformers layer. In practice, there is a pre-attention LayerNorm before calculating the attention and another output projection matrix $O^j$ to project the aggregated values back to the hidden space. As the pretrained language decoder $D$ might only have self-attention modules, we manually add another cross-attention module after the original self-attention module in every transformer layer. Considering the random initialization of cross-attention modules might hurt the original language generation capability, we initialize the value projection matrix $V^j$ as a zero matrix and the output projection matrix $O^j$ as an identity matrix. #### Implementation We use `CLIP-ViT-L/14` as the standard visual encoder. For the high-resolution encoder, we test two models: **(i)** `Pix2Struct-base` [lee2022pix2struct](https://arxiv.org/pdf/2210.03347) is a visual encoder trained on screenshot to HTML transformation. It supports up to 2048 patches with size $16^2$, equivalent to $1024 * 512$. **(ii)** `ConcatCLIP` refers to using 16 `CLIP-ViT-L/14` models to encode the $4 * 4$ grids of images separately and then concatenate the extracted features together. In other words, it supports $896^2$ resolution. We use Vicuna-7B as the language decoder for the high-res version of LLaVAR. #### Training Only cross-attention modules and the projection matrix $W$ are trained during pretraining, while visual encoders and the language decoder are frozen. Cross-attention modules, the projection matrix $W$, and the language decoder $D$ are trained during finetuning. #### Data To fully unlock the potential of the augmented visual encoder, we also double the number of pretraining examples using the same criteria mentioned in Section 3.0.0.1. This corresponds to the variant (g) in Table [table: ablation on encoder/image]. #### Discussion
lS@ S@ S@ S & **ST-VQA** & **OCR-VQA** & **TextVQA** & **DocVQA** `Pix2Struct` + LLaVA & 21.9 & 11.8 & 28.7 & 4.4 `Pix2Struct` + LLaVAR & 35.8 & 30.7 & 45.6 & 15.3 `ConcatCLIP` + LLaVA & 23.1 & 14.2 & 30.5 & 5.1 `ConcatCLIP` + LLaVAR & 42.1 & 30.8 & 52.1 & 18.5
We report the performance of augmented architecture, using either LLaVA or LLaVAR data in Table [table: pix2struct and clip]. By comparing the relative improvement in Table [table: VQA result] and [table: pix2struct and clip], we find that higher-resolution models benefit more from our collected data, suggesting our data is underutilized in the original LLaVA architecture. # I [OCR error analysis] #### The impact of OCR spelling errors
| | **Res.** | **Correct %** | **Partially Correct%** | |:-------|:--------:|:-------------:|:----------------------:| | LLaVA | $224^2$ | 1.6% | 8.7% | | LLaVAR | | 6.8% | 22.8% | | LLaVA | $336^2$ | 2.2% | 11.2% | | LLaVAR | | 9.0% | 26.8% | Statistics of correct answers and partially correct answers on OCR-VQA.
We study such OCR errors by studying 1673 examples from OCR-VQA, which have ground truth answers with more than ten characters. We (i) define “correct” as the ground truth answers that are exactly in the predictions, and (ii) define “partially correct” as there exists a substring in the prediction that has high enough similarity with the ground truth but not the same. Specifically, we look at all substrings with the same length of the ground truth in the prediction to calculate ANLS (Average Normalized Levenshtein Similarity) and regard the prediction as “partially correct” if the highest ANLS is greater or equal to 0.5 but smaller than 1. We find that many predictions can be considered partially correct, indicating the actual performance of tested models is better than the reported accuracy numbers. However, the percentage of partially correct predictions is highly correlated with the percentage of correct predictions. Therefore, we believe that the current metrics can effectively compare the performance of different models. # J [ablation study on LLaVA benchmark] #### Ablation Study on Instruction-following Evaluation
| | **Conversation** | **Detail** | **Complex** | |:--------------------|:----------------:|:----------:|:-----------:| | LLaVA | 83.6 | 78.1 | 95.2 | | LLaVA + R~pretrain~ | 86.7 | 79.3 | 95.1 | | LLaVA + R~finetune~ | 79.0 | 79.4 | 98.0 | | LLaVAR | 84.4 | 78.9 | 96.6 | Ablation Study on GPT-4-based Instruction-following Evaluation from LLaVA [liu2023visual](https://arxiv.org/pdf/2304.08485).
To provide a more detailed understanding of LLaVAR’s data, we provide the ablation study results on 90 instruction-following questions from LLaVA [liu2023visual](https://arxiv.org/pdf/2304.08485). We find that including pretraining data improves the conversation capability, probably because longer training data leads to generating longer responses (Table 1), which is favored by GPT-4 [xu2023baize](https://arxiv.org/pdf/2304.01196). On the other hand, including only finetuning data hurts the conversation capability but increases complex reasoning. Combining pretraining and finetuning data improves the trade-off between conversation and complex reasoning. Such results prove that incorporating our data will at least not harm the performance of interacting with natural images.
lm8.8cmm1cm & Relative Score
| | |:------------| | GPT-4 | | (Text-Only) | An example for our collected evaluation data, corresponding to the “**Read**” dimension of GPT-4-based instruction-following evaluation. The corresponding image is Figure 5 in Appendix.
& From the title and image, it can be inferred that "Bo’s Lasting Lessons" is a book about leadership lessons taught by a legendary coach, likely in the context of sports. The image of men running suggests that the book might be focused on a specific sport, such as football or track and field. The target audience could be individuals interested in coaching, sports, or leadership in general, looking to learn from the experiences and teachings of a successful coach. & - LLaVA & The book "Bo’s Lasting Lessons: The Timeless Teachings of Bo Schembechler" is written by Bo Schembechler and Jim Harshaw. Given the title and the image on the cover, which shows a football coach running across the field with his players in pursuit, it can be inferred that the content of the book is likely focused on football and the strategies, techniques, or principles that can help players improve their performance and achieve success in the sport. The potential target audience for this book could be football players, coaches, and fans who are interested in learning more about the game and the experiences and insights of a seasoned coach like Bo Schembechler. & 75% LLaVAR & "Bo’s Lasting Lessons" appears to be a book that shares the legacy coach Bo Schembechler’s teachings on leadership and timeless fundamentals. The image on the cover, featuring a football coach leading his team, further supports this inference. The book might be targeted towards sports enthusiasts, particularly fans of football, and people interested in learning about leadership principles and fundamentals from a renowned coach’s perspective. The audiobook, read by John H. Mayer and featuring a foreword and conversation with John U. Bacon, suggests that it could also appeal to audiences who prefer listening to books or want to gain more insights from the author and other experts. & 129%
An example for the Read dimension of GPT-4-based instruction-following evaluation.

image image image

Transferred instruction-following capability of LLaVAR.
Visualization of collected instructions.
Visualization of collected instructions.
[^1]: Collaborations through Adobe University Gift Program. [^2]: In this work, we use the phrase “text-rich images” to describe images with text in them, such as posters and book covers. In contrast, we refer to images without text as “natural images”. [^3]: [^4]: Both probabilities are from the LAION dataset’s metadata. [^5]: [^6]: [^7]: Since the metric only consider the recall, it might favor variant (2)(4)(5) due to their longer outputs. [^8]:
Classic VLMs combined with OCR tools to get fine-grained details (mainly text) To tackle the lack of fine-grained details (mainly textual ones) from document images by traditional VLMs, some models have added the fine-grained details of the document image by using an OCR tools. *An OCR (Optical Character Recognition) tool is a technology that extracts text and its bounding box positions from images.* The visual tokens from the visual encoder of the VLM and textual tokens (from the OCR and the input (question, instruction, ...)) are put together (eg. concatenated,) and given to the LLM decoder of the VLM . The OCR tool adds not only textual elements, but also their position in the document. Some models like [Lyrics](https://arxiv.org/pdf/2312.05278) employ a "Visual Refiner" to extract fine-grained details from the image, which not only extracts text and their bbox within the image, but also local visual features. Adding OCR output of the document to a VLM requires modality alignment during pretraining to ensure coherent learning and integration of different data types (image tokens from the Visual Encoder and textual and positional tokens from the OCR) . Alignment ensures that the model can correlate the text with its corresponding visual elements and its position on the page, which is essential for a cohesive understanding of multimodal content. A first alignment to be done is between the textual and the positional tokens. The Masked Visual Language Modeling (MVLM) Pretraining Method is used by [Hi-VT5](https://arxiv.org/pdf/2212.05935) and [LayoutLMv3](https://arxiv.org/pdf/2204.08387). MVLM is similar to Masked Language Modeling (MLM) from the NLP field, where some words in the text are masked (replaced by a special token) and the model is trained to predict these masked words. The difference relies on the fact that in MVLM, the tokens are masked but not their positions (bbox). This approach allows the model to leverage the spatial context provided by the bounding box coordinates. By preserving the positional information, MVLM facilitates the alignment of textual and position modalities, enabling the model to learn the relationship between text and its location on the page. Some methods select a random amount of tokens to mask. (hivt5, layoutlmv3) Alignment must also be done between the textual and visual tokens. This alignment can be done by Word-Region Alignment (WRA) and Word Patch Alignment (WPA) pretraining tasks . The former was introduced by [UNITER](https://arxiv.org/pdf/1909.11740). WRA uses optimal transport (OT) to encourage precise alignment between specific words and corresponding image regions (extracted by a CNN) of the document, by focusing on minimizing the coupling cost of embeddings distributions of words and images, encouraging the model to learn which image regions correspond to which words in the text . WPA, introduced by [LayoutLMv3](https://arxiv.org/pdf/2204.08387) does the same with patchs from ViTs rather than CNNs: it aims at determining the matching patchs and text tokens, and works by masking all patchs except one, and its corresponding token is labeled "aligned", while the others are labeled "unaligned". For each textual token, the model predicts if it is aligned or not with the unmasked patch. Finally, alignment can between all types of data: the textual, positional and visual tokens. Masked Image Modeling was implemented by [LayoutLMv3](https://arxiv.org/pdf/2204.08387) and aims at randomly masking some patchs of the image and predicting these masked patchs depending of the surrounded text and position tokens . [DoCo: Document Object Contrastive Learning](https://arxiv.org/pdf/2402.19014) pretrains a ViT visual encoder using a Document Object Contrastive learning method. Traditional contrastive learning methods work by comparing pairs of images and their summaries, encouraging the model to learn similar representations for related pairs and different representations for unrelated pairs. DoCo, on the other hand, aligns texts and bounding boxes from documents with the visual features produced by the vision encoder, rather than comparing images and summaries. This method helps the model to understand and integrate textual and spatial information within documents. [InstructDr](https://arxiv.org/pdf/2401.13313) implements a specialized connection layer called Document-Former to integrate the outputs from a Vision Encoder (pretrained CLIP) and OCR into a Language Model (FlanT5), composed of cross-attention layers within Transformer blocks to align and merge the visual embeddings from the VLM with textual and positional data from the OCR. To sum up, using an OCR tool to extract fine-grained details requires addition pretraining for modality alignment. Some recent works ([LaRA](https://arxiv.org/pdf/2406.06730)) have opted for no alignment. Instead, they simply add the OCR words within the input given to the LLM . Thus, the LLM decoder takes as input an instruction (), alongside OCR results (), and the transformed image tokens () (image encoder followed by a projection layer) introduced randomly before or after the instruction. ##

2. Models using fine-grained vision model and a LLM as decoder

The computational complexity of LLMs in terms of the input sequence length \( n \) can be expressed as \( O(n^2) \) , with the complexity arising from pairwise comparisons between elements in the sequence. However, the more fine-grained (high resolution) image the visual encoder takes, the longer will be its representation by the visual encoder , so the input sequence length taken by the visual encoder increases, so the more time it takesat inference of the LLM, without saying that LLMs have input sequence length limit. However, the higher the resolution the image is, the better performance will have the vision language model on fine-grained tasks . Here is this illustration made by [Pix2Struct's paper](https://arxiv.org/pdf/2210.03347) on Vision-Language model performance on DocVQA task (ANLS score) depending on the input sequence length (and so on the number of pixel of the image): ![](/literature_review/doc_ai/vlm/pix2struct.png) This is why models using LLMs as decoder often use a smaller representation visual encoder than the vision-rich models. However, some papers use some tips to use LLM as decoder and a fine-grained representation of the input image (document) by the model . And here is a summary of those tips: Slicing high-resolution image into several crops By dividing an image into multiple crops, a low-resolution Vision Transformer (ViT) can be utilized , reducing the input size for the large language model (LLM) while still enabling fine-grained analysis. **[SPHINX](https://ar5iv.labs.arxiv.org/html/2311.07575)** crops high-resolution documents into four 224x224 pixel sub-images and includes a low-resolution version of the entire image. These slices and the full image are encoded separately with four visual encoders (CLIP-ViT, CLIP-ConvNeXt, DINOv2-ViT, BLIP2) and then concatenated for the large language model (LLM) input: ![](/literature_review/doc_ai/vlm/sphinx.png) **[UReader](https://arxiv.org/pdf/2310.05126)** employs an adaptive cropping module to divide high-res images into local images based on predefined grids, selecting the best grid via resolution coherence and shape similarity calculations​​ ![](/literature_review/doc_ai/vlm/ureader1.png): ![](/literature_review/doc_ai/vlm/ureader2.png) **[Monkey](https://arxiv.org/pdf/2311.06607)** uses a Swin Transformer-inspired sliding window to split images into crops​:​ ![](/literature_review/doc_ai/vlm/monkey.png) **[TextMonkey](https://arxiv.org/pdf/2403.04473)** enhances this with shifted window attention and a token resampler for better slice connections within the Vision Transformer (ViT): ![](/literature_review/doc_ai/vlm/textmonkey.png) **[mPLUG-DocOwl1.5](https://arxiv.org/pdf/2403.12895)** adopts adaptive cropping and adds textual tokens to visual features, using an H-Reducer projection matrix to maintain slice positions:​​ ![](/literature_review/doc_ai/vlm/docowl.png) **[LLaVA-UDH](https://arxiv.org/pdf/2403.11703)** introduces image-modularization, slicing images into variable-sized crops and selecting the optimal partition to align with ViT's standard configuration:​​ ![](/literature_review/doc_ai/vlm/llavauhd.png) **[InternLM-XComposer2-4KHD](https://arxiv.org/pdf/2404.06512)** dynamically partitions images into non-overlapping 336x336 pixel slices and adds a learnable newline token to preserve global structure:​ ![](/literature_review/doc_ai/vlm/internlm-xcomposer2-4khd.png) **[TextHawk](https://arxiv.org/pdf/2404.09204)** utilizes adaptive cropping similar to UReader but includes a Scalable Positional Embedding (SPE) module to adjust positional embeddings based on slice count, ensuring consistent positional information for downstream tasks​. **[Idefics2](https://arxiv.org/pdf/2405.02246)** employs a pooling layer, so that the sequence of visual tokens is pooled into a shorter sequence. Reduction of Vision Embedding Sequence (post-processing) Another method involves reducing the sequence length of the high-resolution image embedding after it has been generated by the visual encoder, thereby providing a smaller input size to the large language model (LLM) . Reduction of Vision Embedding Sequence (post-processing) by transforming the image into the frequency domain One effective approach to reducing the sequence length of visual embeddings is transforming the image into the frequency domain . [DocPedia](https://arxiv.org/pdf/2311.11810) implements this technique by converting high-resolution visual encoder outputs into the frequency domain, similar to a Fourier transformation. This method separates high-level information, such as object structures and contours crucial for semantic understanding, from low-level details like texture and noise. By emphasizing important features and minimizing noise, this approach streamlines visual data processing. For instance, [FrequencyViT](https://openaccess.thecvf.com/content/WACV2023/papers/Li_Discrete_Cosin_TransFormer_Image_Modeling_From_Frequency_Domain_WACV_2023_paper.pdf) uses the Discrete Cosine Transform (DCT) to convert images into the frequency domain before feeding them into the Vision Transformer. This process breaks down the image into blocks and measures the energy in each block. These measurements, or "DCT coefficients," provide a concise representation of the image's energy distribution across luminance and chrominance channels, allowing for high-resolution image processing with a shorter representation sequence. Reduction of Vision Embedding Sequence (post-processing) by implementing a sampler-based module Another effective approach involves implementing a sampler-based module to reduce the dimensionality of visual embeddings . Many papers use a pooling layer to reduce the sequence length of the visual embedding ([Kosmos 2.5](https://arxiv.org/pdf/2309.11419), [Idefics2](https://arxiv.org/pdf/2405.02246) and [TextHawk](https://arxiv.org/pdf/2404.09204)). Another method to reduce the image representation dimension involves the use of learnable tokens added to image patches. These tokens serve as summarizers of visual information , allowing the model to obtain higher semantic visual representations while reducing computational load. In the Q-Former layer , implemented in [BLIP-2](https://arxiv.org/pdf/2301.12597), [MiniGPT-4](https://arxiv.org/pdf/2304.10592) and [InstructDr](https://arxiv.org/pdf/2401.13313) as a "Document-Former", these learnable tokens are used to capture and distill essential visual features through a process of cross-attention with the image patches, helping to extract the most relevant visual information that aligns with the textual input. Similarly, the Perceiver Resampler layer , implemented in [Flamingo](https://arxiv.org/pdf/2204.14198), [Kosmos 2.5](https://arxiv.org/pdf/2309.11419), and [Monkey](https://arxiv.org/pdf/2311.06607) as a "shared resampler", employs learnable tokens that directly interact with image patches via cross-attention mechanisms, summarizing the visual content into a smaller set of embeddings. Another method for dimensionality reduction of visual embedding is through the use of convolution techniques. [mPLUG-DocOwl1.5](https://arxiv.org/pdf/2403.12895) employs H-Reducer as projection layer, which uses convolutional techniques to shorten the sequence length while preserving horizontal semantic coherence , making it particularly effective for text-heavy images, as presented in ![](/literature_review/doc_ai/vlm/hreducer.png). Reduction of Vision Embedding Sequence (post-processing) by implementing token selection Another effective way to reduce the dimensionality of visual embeddings is through token selection techniques . [Tinychart](https://arxiv.org/pdf/2404.16635) implements a visual token merging method, which is particularly useful for charts containing many similar color blocks and blank spaces. This method merges the \(r\) most similar token pairs , reducing the vision feature sequence length by $ r $ in each layer. Following the token merging strategy from [ToMe](https://arxiv.org/pdf/2210.09461), similarity between tokens is measured using the cosine distance between Keys from self-attention . Tokens are divided into two sets, with each token in one set paired with its most similar token in the other set, merging features through average pooling. When tokens representing multiple patches are merged, the attention mechanism is adjusted to account for the new token 'size'. This is done by adding $ \log s $ to the attention scores, where $ s $ is a vector indicating the size of each token. This adjustment ensures the attention mechanism reflects the actual information each token represents, maintaining balanced and accurate attention distribution. Dual approach: high and low resolution images handled in parallel (the former by a small decoder, the latter by a LLM) **[CogAgent](https://arxiv.org/pdf/2312.08914)** employs this by using large pretrained Vision-and-Language Models (VLMs) and high-resolution small Vision Transformers (ViTs). The document image is resized to high-resolution (1120×1120) and low-resolution (224×224), processed in parallel by different-sized image encoders. The low-resolution encoder is part of the pretrained large VLM, CogVLM, which includes an EVA2-CLIP-E encoder with an MLP adapter and uses Vicuna-7b as the decoder . In parallel, the high-resolution input is handled by a smaller ViT and cross-attention layers . Only the high-resolution module is trained, leveraging the small ViT's ability to process higher resolution images due to its quadratic memory complexity advantage.

**[Mini-Gemini](https://arxiv.org/pdf/2403.18814)** uses a pretrained CNN as the high-resolution encoder and a CLIP-pretrained ViT for low-resolution . It combines low and high-resolution embeddings through a nPatch Info Mining modulen using cross-attention: ![](/literature_review/doc_ai/vlm/minigemini.png) ![](/literature_review/doc_ai/vlm/minigemini2.png) **[LLaVA-HR](https://arxiv.org/pdf/2403.03003)** adopts a similar approach, with MR-Adapters embedding high-resolution visual information into the low-resolution modeling to capture fine-grained semantics, as presented in this picture: ![](/literature_review/doc_ai/vlm/llavahr.png) In the Mixture-of-Resolution Adapter (MR-Adapter), the high-resolution visual features are integrated into the low-resolution visual features in each block of the Vision Transformer (ViT). Doing so, the MR-Adapter enriches the semantic content of low-resolution features. This integration of high-resolution features into the low-resolution ones is achieved using the formula \(F'_{vl} = F_{vl} + f_l(F_{vl}) + g \cdot f_h(F_{vh})\) where $ F_vl $ is low-resolution features and $ F_vh $ is high-resolution features, $ f_l $ and $ f_h $ are mapping modules where $ f_l $ is a convolutional block that processes low-resolution features , and $ f_h $ is an $ MLP $ (Multi-Layer Perceptron) layer that processes high-resolution features , and $ g $ is a dynamic score that adjusts the influence of the high-resolution information .$ g $ is determined by the pooled visual features $ f_v $, calculated from both low and high-resolution features of the image, capturing the essential visual information from both high-resolution and low-resolution pathways, averaged across the entire image : \(f_v = \frac{1}{h \times w} \sum_{i}^{h} \sum_{j}^{w} [f_l(F_{vl})^{i,j}, f_h(F_{vh})^{i,j}]\) . Then, this $ g $ score is controlled by activation functions $ Gelu $ and $ Tanh $, and projection matrices $ W_1 $ and $ W_2 $ : \(g = Tanh(W_2 Gelu(W_1 f_v))\) .

**[Vary](https://arxiv.org/pdf/2312.06109)** adopts the same method, which consists of two components: a "vision vocabulary network" with a high-resolution visual encoder (Pretrained ViTDet) and a tiny decoder (OPT-125M) , and a traditional MLLM comprising a low-resolution visual encoder (CLIP) and an LLM (Vicuna-7b) . The fine-grained encoder and tiny decoder are trained autoregressively for next-token prediction, enhancing the high-resolution visual embedding. This high-resolution visual embedding is then integrated into the frozen encoder of the MLLM, a technique Vary calls "vocabulary expansion" . This ensures that improvements to CLIP do not introduce noise when processing natural images, thereby expanding the model's capabilities in fine-grained perception tasks. ![](/literature_review/doc_ai/vlm/vary.png) Papers we have cited in this section Now that we have summed up the existing method to use a fine-grained vision encoder and a LLM as decoder in VLM, you can find below the different papers we talked about in this section:
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models 2024-05-24 Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.
Show Paper Content
[^1]: Corresponding author. # Introduction Large Multimodal Models (LMMs; [gpt4v](https://cdn.openai.com/papers/GPTV_System_Card.pdf), [gemini](http://arxiv.org/pdf/2405.12107v1), [claude3](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf)) have achieved notable advancements in recent years, demonstrating superior performance in diverse domains, including image and video understanding [ureader](http://arxiv.org/pdf/2311.13165v1), [xc2-4k](http://arxiv.org/pdf/2404.06512v1), digital agent development [appagent](http://arxiv.org/pdf/2312.13771v2), and robotics [roboflamingo](http://arxiv.org/pdf/2311.01378v3). The imperative to comprehend a wide range of tasks and intricate scenes underscores the critical role of the visual encoder, which is mostly a Vision Transformer (ViT; [vit](http://arxiv.org/pdf/2105.15075v2)). However, ViT’s quadratic spatial complexity and output of excessive visual tokens limit its application in diverse and high-resolution tasks [ureader](http://arxiv.org/pdf/2311.13165v1), [li2023otterhd](http://arxiv.org/pdf/1102.1442v1), [xc2-4k](http://arxiv.org/pdf/2404.06512v1), [cheng2023can](http://arxiv.org/pdf/1505.06659v1). The excessive visual tokens lead to a significant computational burden in the Large Language Model (LLM; [llama](None), [llama2](https://doi.org/10.48550/arXiv.2307.09288)), far exceeding the computational cost imposed by the quadratic spatial complexity in the visual encoder. Such redundancy in the visual tokens not only sacrifices efficiency but also impedes the effective extraction of visual information [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [xc2-4k](http://arxiv.org/pdf/2404.06512v1). While a range of methods (Tab. [tab:table-1]; [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [li2023monkey](http://arxiv.org/pdf/2103.15488v1), [vary](http://arxiv.org/pdf/2312.06109v1)) have been proposed to remedy the quadratic spatial complexity of ViT, they fail to mitigate the key problem, the redundancy in the visual tokens [fastv](http://arxiv.org/pdf/2403.06764v2), [lin2023vila](http://arxiv.org/pdf/2306.16774v1). Hierarchical visual backbones [resnet](http://arxiv.org/pdf/1608.05895v1), [senet](http://arxiv.org/pdf/2209.08294v1), [davit](http://arxiv.org/pdf/2108.01778v1), which can be considered as counterparts to ViT, can well address the problem of excessive visual tokens due to their inherent ***Information Compression*** process. Specifically, features are sequentially compressed across stages in hierarchical backbones. They compress visual features by *32$\times$* [resnet](http://arxiv.org/pdf/1608.05895v1), [liu2022convnet](http://arxiv.org/pdf/2007.00649v1) compared to ViT with only *14$\times$* [vit](http://arxiv.org/pdf/2105.15075v2). Therefore, at the same resolution they generate fewer than *1/4* visual tokens compared to ViT, significantly alleviating computational burdens on the LLM. Moreover, hierarchical visual encoders, typically designed with linear spatial complexity [liu2022convnet](http://arxiv.org/pdf/2007.00649v1), [davit](http://arxiv.org/pdf/2108.01778v1), [resnet](http://arxiv.org/pdf/1608.05895v1), effectively tackle both the issue of excessive visual tokens and the quadratic visual complexity. We choose to employ ConvNeXt among the hierarchical visual encoders due to its excellent performance [convnext-vs-vit](https://arxiv.org/pdf/2311.09215), [fc-clip](http://arxiv.org/pdf/2308.02487v2) and the availability of off-the-shelf contrastive language-image pretrained weights (CLIP; [clip](http://arxiv.org/pdf/2404.19696v1)), which mainstream visual encoders of LMMs adopt [blip2](http://arxiv.org/pdf/2301.12597v3), [llava-v1](http://arxiv.org/pdf/2402.11690v1), [qwen-vl](http://arxiv.org/pdf/2308.12966v3), [mm1](http://arxiv.org/pdf/2403.01757v1). However, directly replacing ViT with ConvNeXt leads to inferior performance on general capabilities benchmarks (Section [sec:updating]). This can be attributed to the fact that ConvNeXt is pretrained on low resolution, whereas we directly apply it to high-resolution [openclip](https://doi.org/10.5281/zenodo.5143773), [laion5b](http://arxiv.org/pdf/2312.15897v1). Moreover, the pretraining data for ConvNeXt is considered to be of low quality [metaclip](http://arxiv.org/pdf/2309.16671v4), [openclip](https://doi.org/10.5281/zenodo.5143773), [laion5b](http://arxiv.org/pdf/2312.15897v1) compared to ViT’s pretraining data [clip](http://arxiv.org/pdf/2404.19696v1). To address these issues, we propose to update the visual encoder rather than freezing it. Surprisingly, updating the visual encoder enables ConvNeXt to perform comparably to ViT on general benchmarks. On fine-grained benchmarks, we observe that ConvNeXt outperforms ViT. These findings indicate that even when compressing visual tokens to an equal quantity, the higher resolution model’s features still contain more fine-grained information. This observation inspires us to further scale up the resolution. However, further scaling the resolution beyond 1024 leads to the generation of excessive visual tokens. To mitigate this issue, we further compress the visual information with an additional ConvNeXt stage to enhance the inherent *information compression* of hierarchical backbones. The visual inputs would be compressed by *64$\times$* rather than *32$\times$* to further reduce the redundancy. Hence, ConvLLaVA generates only 576 visual tokens when processing 1536 resolution inputs, which is equivalent to the number of visual tokens generated by ViT when processing 336 resolution inputs (Section [sec:add-stage]). In summary, we introduce ConvLLaVA whose visual encoder is a five-stage ConvNeXt. ConvLLaVA compresses high-resolution images into information-rich visual features, effectively avoiding the generation of excessive visual tokens (in Tab. [tab:table-1]; [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [li2023monkey](http://arxiv.org/pdf/2103.15488v1), [minigemini](http://arxiv.org/pdf/2305.16318v2), [llava-hr](http://arxiv.org/pdf/2403.03003v1)). Furthermore, thanks to the translation equivalence of convolution, ConvLLaVA can be trained on low-resolution and evaluated on higher resolutions, and it can also handle images of arbitrary aspect ratio. Extensive experiments have demonstrated the effectiveness of our method. ConvLLaVA 7B outperforms LLaVA-1.5-13B across various benchmarks, including MME [mme](http://arxiv.org/pdf/2306.05179v2), MMBench [liu2023mmbench](http://arxiv.org/pdf/2005.12661v2), SEEDBench [li2023seed](http://arxiv.org/pdf/2311.15759v1), RealWorldQA [grok1_5](https://x.ai/blog/grok-1.5v), TextVQA [textvqa](http://arxiv.org/pdf/2003.12462v2), DocVQA [docvqa](http://arxiv.org/pdf/2111.05547v1), POPE [pope](http://arxiv.org/pdf/2402.15721v1), and MMVet [mmvet](http://arxiv.org/pdf/2402.15896v1). # Related Work **Large Multimodal Models.** To harness the potential of Large Language Models and incorporate visual information, BLIP series models [blip2](http://arxiv.org/pdf/2301.12597v3), [dai2023instructblip](https://arxiv.org/pdf/2305.06500) propose the Q-former, which generates visual tokens for LLMs to interpret visual data. Meanwhile, LLaVA [llava-v1](http://arxiv.org/pdf/2402.11690v1) employs a single linear layer to map visual features to the word embedding space, allowing LLMs to perceive vision features. These approaches utilize the ViT as the visual encoder [clip](http://arxiv.org/pdf/2404.19696v1), [vit](http://arxiv.org/pdf/2105.15075v2), [honeybee](http://arxiv.org/pdf/2312.06742v2), [lin2023vila](http://arxiv.org/pdf/2306.16774v1), [minigpt](http://arxiv.org/pdf/2402.17510v1), primarily tailored for low-resolution visual data (e.g., 224 or 336 resolution). Moreover, Qwen-VL [qwen-vl](http://arxiv.org/pdf/2308.12966v3) and mPLUG-owl2 [mplug-owl2](http://arxiv.org/pdf/2311.04257v2) scale the resolution of ViT to 448 by updating the weights of ViT. However, these methods fail to further scale up resolution due to the quadratic spatial complexity of ViT, while ConvNeXt can scale up the resolution with the linear cost increase. Qwen-VL [qwen-vl](http://arxiv.org/pdf/2308.12966v3) and mPLUG-owl2 [mplug-owl2](http://arxiv.org/pdf/2311.04257v2) also explore to reduce the visual tokens via resampler. However, recent studies [honeybee](http://arxiv.org/pdf/2312.06742v2), [xc2-4k](http://arxiv.org/pdf/2404.06512v1) show that convolution or simply concatenation performs better than resampler. **High-resolution LMMs with Cropping.** The representative cropping method for high-resolution LMMs is introduced in LLaVA-NExT [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), which partitions an image into four patches, each encoded separately by ViT and subsequently concatenated for LLM processing. A collection of methods have adopted cropping to scale up resolution [ureader](http://arxiv.org/pdf/2311.13165v1), [lin2023sphinx](http://arxiv.org/pdf/2311.07575v1), [li2023monkey](http://arxiv.org/pdf/2103.15488v1), [xc2-4k](http://arxiv.org/pdf/2404.06512v1). While effective in reducing ViT complexity, cropping compromises the structural integrity of the image, thus potentially impacting overall performance. Moreover, the proliferation of visual tokens introduced by cropping poses significant complexity on LLMs and challenges the retrieval capabilities of LLMs [xc2-4k](http://arxiv.org/pdf/2404.06512v1). **High-resolution LMMs with Extra Visual Encoders.** Incorporating an auxiliary visual encoder for high-resolution image understanding would not significantly increase the number of visual tokens. Vary [vary](http://arxiv.org/pdf/2312.06109v1) and Deepseek-VL [deepseek-vl](http://arxiv.org/pdf/2402.17510v1) utilize SAM [sam](http://arxiv.org/pdf/2305.01275v1) as a high-resolution visual encoder to augment the feature of ViT. MiniGemini-HD [minigemini](http://arxiv.org/pdf/2305.16318v2) and LLaVA-HR [llava-hr](http://arxiv.org/pdf/2403.03003v1) employ ConvNeXt [openclip](https://doi.org/10.5281/zenodo.5143773) to process high-resolution images and use cross-attention or adapters to extract features from the high-resolution input. However, these methods introduce additional complexity through supplementary visual encoders and associated hyperparameters. Furthermore, extracting features from low-quality representations (e.g., LAION-CLIP-ConvNeXt) may potentially compromise LMMs’ performance [gadre2024datacomp](http://arxiv.org/pdf/2004.12070v2), [metaclip](http://arxiv.org/pdf/2309.16671v4). # ConvLLaVA [sec:method] We present ConvLLaVA, as illustrated in Fig. 1 (b), whose visual encoder is a five-stage ConvNeXt. We first introduce the overall architecture and the advantages of our ConvLLaVA in Section 1.1. The two major optimizations: updating the visual encoder and training an additional stage are introduced in Section 1.2 and Section 1.3. ## ConvNeXt as Standalone Visual Encoder [sec:convllava]
We show the structure for LLaVA and ConvLLaVA in (a) and (b). ConvNeXt has a hierarchical structure which compresses visual tokens between stages. The training procedure is composed of three training stages and the trainable parameters for each stage are shown in (c).
The architecture of ConvLLaVA is identical to most popular general LMMs, *e.g.*, LLaVA [llava-v1](http://arxiv.org/pdf/2402.11690v1), [llava-v1-5](http://arxiv.org/pdf/2310.19145v1), Qwen-VL [qwen-vl](http://arxiv.org/pdf/2308.12966v3), and VILA [lin2023vila](http://arxiv.org/pdf/2306.16774v1). These models comprise three components as shown in Fig. 1 (a): a vision encoder $g()$, a large language model $f()$, and a vision-language projector $h()$. Specifically, the vision model encodes the visual inputs $\vx$ into latent visual embeddings $g(\vx)$. The vision-language projector then maps the latent visual embeddings into the embedding space of the language model $\vz = h(g(\vx))$. Given the visual embeddings $\vz$ and text embeddings $\vt$ encoded by the language tokenizer, these embeddings are concatenated along the sequence dimension and then passed to the language model. Finally, the vision language model is trained with language modeling loss [gpt](http://arxiv.org/pdf/2310.01427v1). Considering that our study mainly focuses on the visual encoder, we employ a two-layer MLP and Vicuna-7B [vicuna](http://arxiv.org/pdf/2306.05685v4) as the projector and language model following LLaVA-1.5 [llava-v1-5](http://arxiv.org/pdf/2310.19145v1). Rather than using CLIP-VIT [clip](http://arxiv.org/pdf/2404.19696v1), we introduce CLIP-ConvNeXt [liu2022convnet](http://arxiv.org/pdf/2007.00649v1), [openclip](https://doi.org/10.5281/zenodo.5143773) as the standalone visual encoder.
r0.4
**ConvNeXt.** The basic block of ConvNeXt comprises a depth-wise convolution and a feed-forward network [liu2022convnet](http://arxiv.org/pdf/2007.00649v1). The depth-wise convolution has a *7$\times$``{=html}7* kernel size, and the computation complexity is $\mathcal{O}(k^2CN)$, where $k$, $C$, and $N$ are the kernel size, number of channels, and number of visual tokens, respectively. In contrast, the complexity of self-attention in ViT is $\mathcal{O}(4C^2N+2CN^2)$. Consequently, the spatial complexity of ConvNeXt is significantly lower than ViT. The input is initially processed by a *4$\times$``{=html}4* non-overlapping convolution downsampling layer. Subsequently, the features are successively fed into the four stages of ConvNeXt, while each stage comprises several ConvNeXt blocks. Feature maps are downsampled by *2$\times$*, and dimensions are expanded by *2$\times$* between stages. The output of the ConvNeXt is downsampled by *32$\times$*, rather than *14$\times$* of ViT-L. Hence, ConvNeXt produces less than *1/4* visual tokens compared to ViT, which alleviates the computation load of the language model. Benefiting from the linear spatial complexity and fewer visual tokens, the computation reduction of LMMs from ViT-L (red line) to ConvNeXt (blue line) is almost *8$\times$* as illustrated in Fig. [fig:quality]. **Five-stage ConvNeXt$\dag$.** Leveraging ConvNeXt as the visual encoder is efficient for encoding 768 resolution images, while scaling resolutions to higher than 768 produces excessive visual tokens. Previous studies [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [minigemini](http://arxiv.org/pdf/2305.16318v2) neglect to explore compressing visual tokens, while compressing visual tokens has been proven to be reasonable since there is redundancy in the visual representation [lin2023vila](http://arxiv.org/pdf/2306.16774v1), [fastv](http://arxiv.org/pdf/2403.06764v2). These studies suggest that we can further downsample visual features using ConvNeXt. We propose to compress visual features by incorporating ConvNeXt blocks for stage 5 into the original four-stage model. We prefer using ConvNeXt blocks over other structures due to the following three reasons (1) The five-stage ConvNeXt, as a whole, could be transferred as a visual encoder for other LMMs, whereas downsampling in the projector does not offer such flexibility (2) ConvNeXt blocks maintain translation equivariance, allowing them to effectively process images of any aspect ratio, unlike attention blocks. (3) The impact on performance from the downsampling stage is minimal, except that the resampler consistently underperforms compared to other methods, as evidenced by [honeybee](http://arxiv.org/pdf/2312.06742v2), [xc2-4k](http://arxiv.org/pdf/2404.06512v1), [mm1](http://arxiv.org/pdf/2403.01757v1). Finally, we denote the overall five-stage ConvNeXt as ConvNeXt$\dag$. At 1536 resolution, ConvNeXt$\dag$ reduces the number of visual tokens to 576, equivalent to that of ViT at 336 resolution. This would reduce the total computation by *6$\times$* *w.r.t.* the original ConvNeXt (blue line) to ConvNeXt$\dag$ (green line) as shown in Fig. [fig:quality]. Our approach is more computationally efficient than cropping methods, which often produce an excessive number of visual tokens [mm1](http://arxiv.org/pdf/2403.01757v1), [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [li2023monkey](http://arxiv.org/pdf/2103.15488v1). Furthermore, by eliminating the need for cropping and merging, ConvLLaVA avoids the global view, thereby further reducing the number of visual tokens. ## Updating ConvNeXt is Essential [sec:updating] The mainstream optimization approach [llava-v1](http://arxiv.org/pdf/2402.11690v1), [lin2023vila](http://arxiv.org/pdf/2306.16774v1) freezes the vision encoder during training, as it has better performance and is more efficient than updating the visual encoder [prismatic](http://arxiv.org/pdf/2402.07865v1). However, freezing ConvNeXt during training is sub-optimal. Hence, we conduct depth analysis to prove that freezing the visual encoder (i.e., ConvNeXt) would inherit the defects from pretraining, and updating ConvNeXt may both improve the quality of representations and adapt them to high-resolution inputs. **Setups of Freezing ConvNeXt.** The optimization procedure is the same as LLaVA-1.5 [llava-v1-5](http://arxiv.org/pdf/2310.19145v1). For training the projector and instruction tuning, we use the same 558k caption dataset and 665k instruction data, respectively. Our visual encoder CLIP-ConvNeXt-L is pretrained on 256 resolution and fine-tuned with 320 resolution based on LAION-2B [liu2022convnet](http://arxiv.org/pdf/2007.00649v1), [openclip](https://doi.org/10.5281/zenodo.5143773). We directly increase the resolution to 512 and 768 when applying ConvNeXt as the vision encoder. As for the baseline, we use ViT which is pretrained on 336 resolution with OpenAI WIT dataset [clip](http://arxiv.org/pdf/2404.19696v1). The training and inference speed for ConvNeXt on 768 resolution is on par with ViT on 336 resolution. Hence, we consider the comparison between 768-resolution ConvNeXt and 336-resolution ViT to be fair. Detailed training procedure is shown in Tab. [tab:hy-llava]. **Benchmarks.** We use four standard benchmarks to evaluate the results: two general capability benchmarks, MMbench [liu2023mmbench](http://arxiv.org/pdf/2005.12661v2), SEEDBench [li2023seed](http://arxiv.org/pdf/2311.15759v1), and two fine-grained OCR benchmarks, TextVQA [textvqa](http://arxiv.org/pdf/2003.12462v2) and DocVQA [docvqa](http://arxiv.org/pdf/2111.05547v1). It is worth noting that our evaluation procedure for TextVQA differs slightly from LLaVA-1.5 [llava-v1-5](http://arxiv.org/pdf/2310.19145v1), as we use VLMEVALKIT which does not include OCR tokens in the question. **Results for Freezing the Visual Encoder.** As shown in Tab. [tab:freezing-encoder], we observe the following results: \(1\) ConvNeXt has significant advantages over ViT on OCR benchmarks. On TextVQA and DocVQA, both 512 and 768 resolution ConvNeXt outperforms ViT due to their higher resolution [prismatic](http://arxiv.org/pdf/2402.07865v1), [mplug-owl2](http://arxiv.org/pdf/2311.04257v2). Even with fewer visual tokens, the 512-resolution ConvNeXt still outperforms the 336-resolution ViT. \(2\) The overall general capability of ConvNeXt is inferior to ViT. For general benchmarks, on SEEDBench, 768-resolution ConvNeXt performs comparably with ViT. While on MMBench, ConvNeXt underperforms ViT. We hypothesize that there are two reasons for the performance gap on MMbench: First, ConvNeXt is pretrained on low resolution but directly applied on high resolution. Such employment affects the quality of visual features. Second, the pretrained representation for ConvNeXt may be inferior to OpenAI’s ViT [clip](http://arxiv.org/pdf/2404.19696v1). The results imply that increasing resolution without training could affect the quality of representation and hamper the performance of LMMs. However, studies have shown that simply updating the visual encoder during instruction tuning can hinder performance [prismatic](http://arxiv.org/pdf/2402.07865v1). To mitigate this issue, ShareGPT4V [sharegpt4v](http://arxiv.org/pdf/1809.10312v1) provides an effective training protocol and a high-quality dataset for updating the visual encoder. Therefore, we adopt this effective method to update the visual encoder. **Setups of Updating ConvNeXt.** To update the visual encoder, we first leverage the 558k caption dataset for projector initialization [llava-v1-5](http://arxiv.org/pdf/2310.19145v1). Then, we apply a high-quality caption dataset, ShareGPT4V-PT [sharegpt4v](http://arxiv.org/pdf/1809.10312v1), to train the entire vision-language model including the visual encoder. Finally, the LLaVA 665k instruction tuning dataset is used for visual instruction tuning. The detailed training procedure is shown in Tab. [tab:hy-sharegpt4v]. The last 12 layers of ViT-L are trainable (according to ShareGPT4V [sharegpt4v](http://arxiv.org/pdf/1809.10312v1)). For ConvNeXt, we update the last 18 blocks (ConvNeXt-L has a total of 36 blocks). **Results for Updating the Visual Encoder.** As shown in Tab. [tab:ShareGPT4V], we observe the following results: \(1\) ConvNeXt has significant advantages over ViT on the OCR benchmark. The improvement for 768 resolution ConvNeXt is larger than 336 resolution ViT (6.3/10.4 *v.s.* 4.6/5.2). These results demonstrate the idea of compressing high-resolution visual inputs to a small number (*e.g.*, 576) of information-rich visual tokens is feasible. Compressing does not lead to great information loss. Even with the same number of tokens, ConvNeXt preserves more fine-grained visual information and significantly outperforms ViT. \(2\) For general benchmarks, ConvNeXt performs on par with ViT. Specifically, ConvNeXt outperforms ViT on SEEDBench and performs on par with ViT on MMBench. Notably, the performance gap between the 768 resolution ConvNeXt and the 336 resolution ViT on MMBench is narrowed from 3.3 to 0.3 compared with freezing the visual encoder. This implies that updating the visual encoder is essential. To further support this, we show the results of updating the visual encoder with more data in Appendix [app:more-data]. Generally, the updated ConvNeXt performs better than ViT on these 4 benchmarks. This evidences that updating the ConvNeXt significantly enhances the performances, underscoring its critical importance. Previous methods employ ConvNeXt as an auxiliary visual encoder and directly increase the resolution to 1024 [llava-hr](http://arxiv.org/pdf/2403.03003v1) or 1536 [minigemini](http://arxiv.org/pdf/2305.16318v2). They fail to identify the problem that scaling up the resolution without updating ConvNeXt would compromise the performance. Our method, delving deeper into the root of the issue, provides a simple yet effective solution to scaling up the resolution. ## Training with Stage 5 Scales up Resolution to 1536 [sec:add-stage] As we mentioned in Section 1.1, scaling resolution to higher than 768 would generate excessive visual tokens. To reduce the redundancy and mitigate the excessive computational demands on the large language model (LLM), we propose training stage 5 for the ConvNeXt model to compress the visual information (training protocol shown in Fig. 1 (c)). **Implementation Details.** We employ a three-stage training protocol. In the projector initialization stage, we train the fifth stage layers and the projector with the ShareGPT4V-PT data [sharegpt4v](http://arxiv.org/pdf/1809.10312v1). In the second stage, we train the entire model with the ShareGPT4V-PT data. For instruction tuning, we utilize the 665k LLaVA instruction data to train the LLM and the projector. The training protocol is similar to the protocol for updating the visual encoder. The only difference is that we train the fifth stage and projector with ShareGPT4V-PT data, while experiments in Section 1.2 train the projector with the 558k caption data in the first training stage. We add 6 layers in stage 5 and tune the last three stages in the second training phase. Ablation studies on these hyper-parameters are included in Appendix [app:stage-add-layers]. **Results for ConvNeXt$\dag$.** We present the results of adding stage 5 to ConvNeXt in Tab. [tab:add-stage]. Scaling up the resolution consistently improves performance on SEEDBench, TextVQA, and DocVQA, which require fine-grained understanding and benefit from the higher resolution. These results highlight the effectiveness of our method of training stage 5. However, on MMBench, the performance of ConvNeXt$\dag$ exhibits a slight drop when scaling the resolution from 1024 to 1536. The resolution of 1536 is approximately six times higher than the pretraining resolution (256). Adapting the pretrained visual encoder to effectively extract global information from such a significant increase in resolution requires a substantial amount of training data. In Section [sec:exp], we verify this hypothesis by providing sufficient data to the visual encoder in the second training stage.
Comparisons of ConvNeXt and ConvNeXt on SEEDBench and DocVQA. The marked number above the line shows the resolution of the model.
**On Scaling Resolution.** When we increase the resolution, the number of visual tokens also increases. These two factors are entangled, and there has been a lack of in-depth investigation into the relationship between them. Previous work claims that raw resolution matters more than the number of visual tokens [lin2023vila](http://arxiv.org/pdf/2306.16774v1). We experiment on the general benchmark SEEDBench and OCR benchmark DocVQA to investigate these assumptions. Our method provides control experiments to reveal the relationship between resolution and the number of visual tokens. We compare the results of ConvNeXt (trained in Section 1.2) and ConvNeXt$\dag$ (trained in Section 1.3) as the visual encoder for LMMs under the same number of visual tokens. The two series of models are pretrained with ShareGPT4V-PT and instruction-tuned with 665k LLaVA instruction data. ConvNeXt$\dag$ has an additional stage to compress the number of visual tokens to 1/4. Hence, the differences between these two series models have been largely reduced. Our control experiments reveal novel findings: \(1\) When the number of visual tokens is the same, the higher resolution model exhibits better performance on SEEDBench and DocVQA. In the Fig.2, the green line consistently outperforms the blue line. This is because that high-resolution model provides finer-grained and higher-quality visual features even if the output number of visual tokens is the same. Previous work [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [li2023monkey](http://arxiv.org/pdf/2103.15488v1), [xc2-4k](http://arxiv.org/pdf/2404.06512v1) which scales up the resolution by splitting the image into patches would generate excessive visual tokens. Such cropping methods significantly sacrifice efficiency and challenge the retrieval capability of LLM. Our core discovery presents a promising approach to enrich the information contained in visual features without compromising efficiency. Compressing high-resolution images into information-rich visual tokens is more efficient than the cropping method. Training a stage to further compress visual features provides a manner to increase resolution and maintain a moderate computational cost. \(2\) The importance of the number of visual tokens varies across different benchmarks at equivalent resolution. For general benchmarks like SEEDBench, the performance drop brought by compressing visual tokens for the 768-resolution models is marginal (0.9 on SEEDBench). However, for OCR benchmarks like DocVQA, the performance drop for the model with fewer visual tokens is substantial (9.1 on DocVQA). Overall, these results demonstrate that while compressing visual tokens causes only slight information loss on general benchmarks, but leads to significant information loss on fine-grained OCR benchmarks. # Experiments [sec:exp] Our results demonstrate that scaling up the resolution of ConvNeXt and updating the visual encoder are two effective approaches to training an advanced, high-resolution Language-Multimodal Model. However, we found that the available training data was insufficient to fully unleash the potential of these approaches. Consequently, we scaled up the high-quality training data to address this limitation. ## Training Setups **Training Stages.** We adopt a three-stage training protocol to train ConvLLaVA as shown in Fig. [fig:structure] (c). The training process is categorized into three stages: (1) *Projector Initialization.* We train the fifth stage of the ConvNeXt model and the vision-language projector. We utilize caption data including ShareGPT4V-PT [sharegpt4v](http://arxiv.org/pdf/1809.10312v1), ShareGPT4V [sharegpt4v](http://arxiv.org/pdf/1809.10312v1), and ALLaVA captions [allava](http://arxiv.org/pdf/2112.07133v2), totaling approximately 2M examples. (2) *Vision-Language Pretraining.* We employ caption data including ShareGPT4V-PT [sharegpt4v](http://arxiv.org/pdf/1809.10312v1), ShareGPT4V [sharegpt4v](http://arxiv.org/pdf/1809.10312v1), ALLaVA [allava](http://arxiv.org/pdf/2112.07133v2), and a 190k open-sourced subset of VFLAN [vflan](http://arxiv.org/pdf/2403.04343v1), amounting to 2.9M data. (3) *Visual Instruction Tuning.* We fine-tune the model with the 665k LLaVA instruction dataset [llava-v1-5](http://arxiv.org/pdf/2310.19145v1). In each stage, we train the model for 1 epoch with the AdamW optimizer. The cosine learning rate schedule is also applied. **Implementation Details.** We utilize the LAION-2B pretrained ConvNeXt-L model as our visual encoder [openclip](https://doi.org/10.5281/zenodo.5143773). In the three training stages, the resolution is scaled up to a fixed value. We train ConvLLaVA at 768, 1024, and 1536 resolutions. The learning rates in the three training stages are 3e-4, 2e-5, and 2e-5, respectively. Meanwhile, the batch sizes are 256, 256, and 128. Training the ConvLLaVA 768 resolution model takes approximately 18 hours on 2 A800 machines. The instruction tuning costs 20 hours for LLaVA-NExT 7B on an A100 machine [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/), while it tasks only 9 hours for our 1536 resolution ConvLLaVA on a single machine. **Evaluation Benchmarks.** To systematically investigate the performance of our model, we include more benchmarks for evaluation, including MME [mme](http://arxiv.org/pdf/2306.05179v2), MMBench [liu2023mmbench](http://arxiv.org/pdf/2005.12661v2), SEEDBench [li2023seed](http://arxiv.org/pdf/2311.15759v1), MMMU [yue2023mmmu](http://arxiv.org/pdf/2311.16502v3), MMVet [mmvet](http://arxiv.org/pdf/2402.15896v1), RealWorldQA [grok1_5](https://x.ai/blog/grok-1.5v), TextVQA [textvqa](http://arxiv.org/pdf/2003.12462v2), DocVQA [docvqa](http://arxiv.org/pdf/2111.05547v1), and POPE [pope](http://arxiv.org/pdf/2402.15721v1). Our results are measured by VLMEVALKIT. We also assess the performance on grounding benchmarks, including RefCOCO [refcoco](http://arxiv.org/pdf/1808.08754v1), RefCOCO+, and RefCOCOg [refcocog](http://arxiv.org/pdf/1511.02283v3).
| | | | | | | | | | | | | | |:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | Method | Res. | \#V Tokens | LLM | RefCOCO | | | RefCOCO+ | | | RefCOCOg | | Avg | | | | | | val | test-A | test-B | val | test-A | test-B | val | test | | | LLaVA-1.5 | 336 | 576 | 7B | 76.3 | 83.2 | 67.9 | 66.8 | 77.0 | 56.8 | 70.4 | 70.0 | 71.1 | | LLaVA-1.5 | 336 | 576 | 13B | 84.0 | 89.5 | 77.1 | 76.3 | 84.3 | 66.1 | 78.8 | 78.3 | 79.3 | | ConvLLaVA | 768 | 144 | 7B | 84.5 | 89.0 | 79.2 | 77.7 | 84.9 | 69.7 | 79.8 | 79.7 | 80.6 | | ConvLLaVA | 1024 | 256 | 7B | 85.5 | 89.6 | 78.8 | 79.3 | 86.1 | 70.3 | 80.6 | 81.2 | 81.4 | | ConvLLaVA | 1536 | 576 | 7B | **86.5** | **90.6** | **80.5** | **80.0** | **86.8** | **71.5** | **82.0** | **82.4** | **82.3** | Results on referring expression comprehension tasks. The models in this table are trained with the same grounding data. We mark the best performance of the model **bold**.
## Quantitative Results We perform a comprehensive comparison with state-of-the-art models on 7 different benchmarks (Tab. [tab:main]). Our model achieves consistent improvements compared to LLaVA-1.5. Our 7B model even exhibits comparable performance with LLaVA-1.5 13B and LLaVA-NExT 7B [llava-v1-6](https://llava-vl.github.io/blog/2024-01-30-llava-next/). On OCR benchmarks like TextVQA and DocVQA, our model outperforms the LLaVA-1.5 7B and 13B models. Since OCR benchmarks are sensitive to resolution, our ConvLLaVA series models demonstrate consistent improvement on TextVQA and DocVQA with higher resolution, showcasing the effectiveness of scaling up resolution. Notably, our model surpasses Qwen-VL-Chat on DocVQA which has millions of document training data. While there is only a limited number of document data in our training dataset. This shows the benefits of the high-resolution design of our model. ConvLLaVA outperforms LLaVA-NExT on MMBench, TextVQA, POPE, and MMVet. For grounding benchmarks, our model and LLaVA are trained with the same set of grounding data. The comparison between them is fair. On RefCOCO, RefCOCO+, and RefCOCOg, ConvLLaVA exhibits consistent improvement when increasing resolution (Tab. 1). ConvLLaVA outperforms LLaVA-7B and 13B model on all 8 test splits. This demonstrates the benefits of higher resolution for grounding tasks. Our 7B model also surpasses 13B LLaVA model on all 8 benchmarks. ## Understanding Any Aspect Ratio Images and Highre Resolutions Thanks to the translation equivalence of convolution neural network, our model could be trained on a fixed resolution but inference on higher resolution and with an arbitrary aspect ratio. We test such ability on our 1536 resolution model ConvLLaVA.
r0.4 | Input Shape | SEED | Text | Doc | |:---------------:|:--------:|:--------:|:--------:| | (1536, 1536) | **70.2** | **65.8** | 59.0 | | short side=1536 | 68.9 | 64.6 | 65.0 | | short side=1664 | 67.3 | 64.2 | **65.7** |
The original image preprocessing process is padding the image to a square, resizing the image to 1536, and center cropping [llava-v1-5](http://arxiv.org/pdf/2310.19145v1). We canceling padding and center cropping. Hence, the short side of the image should just be resized to 1536 and keep the original aspect ratio. This is the setting of how we test images of any aspect ratio. The results are shown in Tab. [tab:shape]. We observe that on the general benchmark, SEEDBench, the performance slightly decreases. On OCR benchmarks, especially on DocVQA, the performance is improved. The reason for this we think is that the image aspect ratio in DocVQA is not 1:1, forcely transforming the image into a square would lower the resolution of the image. We also test ConvLLaVA when resizing the short side of images to 1664 resolution which is higher than its pretrained 1536 resolution. We observe that on DocVQA the performance could be further improved to 65.7. ## Discussions [sec:discussions] **Architectures and data.** While we have demonstrated the effectiveness of our method, there remains room for further improvement. The ConvNeXt architecture we use is tailored for low-resolution image understanding (e.g., 256), with a kernel size of 7 optimized for such resolutions. However, as the resolution increases to 1536, the relatively small kernel size may limit the model capacity when the resolution is extremely high. Besides, the number of layers in the ConvNeXt four stages (3, 3, 27, 3) is designed for a 4-stage model and may not be optimal for our 5-stage model. Therefore, a potential future direction could involve designing a five-stage, linear spatial complexity, hierarchical high-resolution vision encoder. We emphasize the critical role of the five-stage visual encoder since it is fit for high-resolution LMM. It compresses visual features by *64$\times$*, greatly reducing the redundancy in its visual tokens. In contrast, four-stage visual encoders, designed for traditional computer vision tasks, output excessive tokens when resolution is high. **Linear spatial complexity and information compression.** We identify *linear spatial complexity* and *information compression* procedure as two critical properties for future visual encoders of LMMs. These properties ensure the efficiency of both the visual encoder and the LLM, respectively. Furthermore, they are crucial for multi-image, interleaved image and text, and video understanding tasks, as these tasks commonly result in numerous visual tokens. We anticipate that future research will focus more on these two directions to further advance the research of LMMs. **Trade-off between compression and retrieval for high-resolution understanding.** Our method, ConvLLaVA, compresses a 1536-resolution image to 576 visual tokens with a 64$\times$ compression ratio. While concurrent work [xc2-4k](http://arxiv.org/pdf/2404.06512v1), [internvl1.5](http://arxiv.org/pdf/2404.16821v2) explores retrieving fine-grained image information from long visual token sequences. In the context of high-resolution image understanding, compressing visual information maintains computational efficiency, but excessive compression may lead to information loss. Conversely, retaining a large number of visual tokens avoids information loss but sacrifices efficiency and challenges the retrieval capabilities of LLMs. Consequently, a trade-off emerges between visual information compression and retrieval capabilities for high-resolution understanding. Future research should explore an optimal balance between these two factors. # Conclusion In this paper, we have critically examined the limitations of the visual encoder for current LMMs: quadratic spatial complexity and numerous visual tokens. The excessive visual tokens are the more fundamental problem. These drawbacks hinder LMMs from efficiently understanding high-resolution images. Consequently, we propose ConvLLaVA, whose visual encoder is a hierarchical backbone, ConvNeXt, to mitigate this issue. ConvLLaVA compresses high-resolution visual information into information-rich visual representation rather than preserving all the redundancy in the visual representation. Extensive experimental results have demonstrated the efficacy of our proposed method. Our 7B parameter model exhibits superior performance compared to the LLaVA-1.5 13B model. Furthermore, our method is flexible in encoding images with arbitrary shapes and resolutions. Our work highlights the advantages of hierarchical visual backbones for LMMs, addressing critical challenges while maintaining simplicity and efficiency. # Acknowledgments [acknowledgments] This work is supported in part by the National Natural Science Foundation of China under Grants 62321005 and 62276150. # Training Visual Encoder with More Data [app:more-data] In Section [sec:updating], we observe that updating the visual encoder is essential for ConvNeXt as the standalone encoder. We compare the two visual encoders with more training data in Tab. [tab:allava-sharegpt4v]. For the visual language training stage, we use ALLaVA and ShareGPT4V-PT. We train the last two stages for ConvNeXt and the last 12 layers for ViT. With more training data, ConvNeXt outperforms ViT on all the 4 benchmarks. These results validate the advantages of ConvNeXt over ViT. This ConvNeXt model even outperforms the 768-resolution ConvLLaVA model on some benchmarks due to its higher number of visual tokens. However, the training and inference speed is much slower than the 768-resolution ConvLLaVA model due to the increased number of visual tokens. The 1536 resolution ConvLLaVA, featuring outputting the same number of visual tokens, outperforms this model. This shows higher resolution model may have a higher model capacity to learn from data. # Hyperparameters for 5-stage ConvNeXt [app:stage-add-layers] We discuss the choice of hyperparameters in this section. **Number of Trained Stages.** We conduct an ablation study to determine the optimal number of stages for vision-language pretraining at 768 resolution. We find that fine-tuning from stage 3 yields better results than fine-tuning from stage 4 (Tab. [tab:stages-high]). While the performances of fine-tuning from stage 2 and stage 3 are comparable, we opt for fine-tuning from stage 3 due to its fewer trainable parameters. **Number of Layers in Stage 5.** We ablate on the number of ConvNeXt layers in stage 5. Given that the number of layers in each stage is a multiple of 3 in ConvNeXt-L, we experiment with 3, 6, and 9 layers in stage 5. For simplicity, we perform the experiments on ConvNeXt 768. We observe a slight decrease in performance when adding 9 layers in stage 5 (Tab. [tab:ablation-layers]). However, it’s hard to determine whether adding 3 or 6 layers is more beneficial for these four benchmarks. Hence, we conduct experiment on the 1536 resolution to further investigate this hyperparameter (Tab. [tab:add-layers-1536]). The results show that adding 6 layers could be better. We opt for 6 layers in our experiments. # Training protocol for each experiment [app:implementations] The detailed training hyper-parameters are shown in the following tables.
| Training Stage | 1 | 2 | |:---------------:|:--------------:|:--------------:| | Visual Encoder | | | | Projector | | | | LLM | | | | data | LLaVA LCS-558K | LLaVA SFT 665k | | lr | 1e-3 | 2e-5 | | batch size | 256 | 128 | | lr schedule | cosine decay | cosine decay | | lr warmup ratio | 0.03 | 0.03 | | epoch | 1 | 1 | | optimizer | AdamW | AdamW | The training protocol for Tab. [tab:freezing-encoder].
| Training Stage | 1 | 2 | 3 | |:---------------:|:--------------:|:-------------:|:--------------:| | Visual Encoder | | | | | Projector | | | | | LLM | | | | | data | LLaVA LCS-558K | ShareGPT4V-PT | LLaVA SFT 665k | | lr | 1e-3 | 2e-5 | 2e-5 | | batch size | 256 | 256 | 128 | | lr schedule | cosine decay | cosine decay | cosine decay | | lr warmup ratio | 0.03 | 0.03 | 0.03 | | epoch | 1 | 1 | 1 | | optimizer | AdamW | AdamW | AdamW | The training protocol for Tab. [tab:ShareGPT4V].
| Training Stage | 1 | 2 | 3 | |:---------------:|:-------------:|:-------------:|:--------------:| | ConvNeXt | | | | | Stage 5 | | | | | Projector | | | | | LLM | | | | | data | ShareGPT4V-PT | ShareGPT4V-PT | LLaVA SFT 665k | | lr | 3e-4 | 2e-5 | 2e-5 | | batch size | 256 | 256 | 128 | | lr schedule | cosine decay | cosine decay | cosine decay | | lr warmup ratio | 0.03 | 0.03 | 0.03 | | epoch | 1 | 1 | 1 | | optimizer | AdamW | AdamW | AdamW | The training protocol for Tab. [tab:add-stage], Tab. [tab:stages-high], and Tab. [tab:ablation-layers]
| Training Stage | 1 | 2 | 3 | |:---------------:|:--------------:|:-------------:|:--------------:| | ConvNeXt | | | | | Stage 5 | | | | | Projector | | | | | LLM | | | | | data | ShareGPT4V-PT | ShareGPT4V-PT | LLaVA SFT 665k | | | ShareGPT4V | ShareGPT4V | | | | ALLaVA Caption | ALLaVA, VFLAN | | | lr | 3e-4 | 2e-5 | 2e-5 | | batch size | 256 | 256 | 128 | | lr schedule | cosine decay | cosine decay | cosine decay | | lr warmup ratio | 0.03 | 0.03 | 0.03 | | epoch | 1 | 1 | 1 | | optimizer | AdamW | AdamW | AdamW | The training protocol for Tab. [tab:main], and Tab. [tab:grounding]
Focus Anywhere for Fine-grained Multi-page Document Understanding 2024-05-23 Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang
Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents. We introduce a novel task to boost the document understanding by making LVLMs focus attention on the document-level region, such as redefining full-page OCR as foreground focus. We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages (e.g., a page containing a photo). Meanwhile, we render cross-vocabulary vision data as the catalyzer to achieve a full reaction of multiple visual vocabularies and in-document figure understanding. Further, without modifying the weights of multiple vision vocabularies, the above catalyzed fine-grained understanding capabilities can be efficiently tuned to multi-page documents, enabling the model to focus anywhere in both format-free and page-free manners. Besides, we build a benchmark including 9 fine-grained sub-tasks (e.g., region-level OCR/summary, color-guided OCR) to promote document analysis in the community. The experimental results verify the superiority of our model.
Show Paper Content
# Introduction [intro] Recently, research on Large Vision-Language Models [GPT4](https://arxiv.org/pdf/arXiv preprint arXiv:2303.08774), [minigpt4](http://arxiv.org/pdf/2402.17510v1), [Flamingo](http://arxiv.org/pdf/2205.07065v1) has been an attractive direction. These models not only easily handle some conventional vision tasks (*e.g.*, Image Caption [coco_text](http://arxiv.org/pdf/1707.08831v1), OCR [OCRVQA](http://arxiv.org/pdf/2010.02582v1)), but also demonstrate powerful reasoning capabilities like humans.
(a) Multiple vision vocabularies are catalyzed using synthetic cross-vocabulary data to handle interleaved pages. (b) Fox achieves fine-grained document-level understanding by focusing anywhere, such as region-level OCR/translation and in-page figure caption. (c) Fox impressively supports the entire 8-page input and can focus on multiple cross-page RoIs in a single-turn conversation.
The LVLMs mostly give responses by leveraging large language models [OPT](http://arxiv.org/pdf/2405.04515v2), [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), [T5](http://arxiv.org/pdf/1910.10683v4) to follow language instructions while referring to the vision vocabulary to understand the input image. Some researchers attempt to adopt LVLMs to advance the understanding of large-resolution (*e.g.*, 833$\times$``{=html}1132) document pages. For example, UReader [ye2023ureader](http://arxiv.org/pdf/2311.13165v1) crops the input image into smaller patches to align with a CLIP-style vision vocabulary of input size 224$\times$``{=html}224. Later, TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) divides the input image into 448$\times$``{=html}448 patches and uses Openclip’s ViT-bigG [openclip_ilharco_2024_10469088](openclip_ilharco_2024_10469088) along with a resampling strategy to retain useful image tokens. LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) adopts CLIP-ViT-L-336px to perform visual perception and splits the input image into smaller patches to encode independently. InternVL-V1.5 [chen2024far_intervl1.5](http://arxiv.org/pdf/2404.16821v2) proposes a stronger vision vocabulary InternViT-6B with the input size of 448$\times$``{=html}448. Similarly, to capture more details of the input image, InternVL-V1.5 [chen2024far_intervl1.5](http://arxiv.org/pdf/2404.16821v2) dynamically divides the input image into 1 to 12 tiles. Different from the methods above, without cropping patches, Vary [wei2023vary](http://arxiv.org/pdf/2312.06109v1) writes an extra SAM-style [SAM](http://arxiv.org/pdf/2305.01275v1) vision vocabulary specific to document and chart data, running in parallel with the CLIP branch. Vary can directly encode 1024$\times$``{=html}1024 page into 256 image tokens with a high compression ratio. The patch-based models [ye2023ureader](http://arxiv.org/pdf/2311.13165v1), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [chen2024far_intervl1.5](http://arxiv.org/pdf/2404.16821v2) mostly employ CLIP-style vision vocabulary with small resolution, so a large-scale document needs to be decomposed into many patches/tiles. A patch/tile is independently encoded to 256 image tokens, and InternVL-V1.5 [chen2024far_intervl1.5](http://arxiv.org/pdf/2404.16821v2) even produces 3,328 image tokens during training. However, numerous image tokens are difficult to extend to multi-page documents for contextual understanding. More importantly, there may still be dense characters on these cropped patches, but CLIP-style vision vocabulary compresses limited sparse information of small input images via global contrastive learning, preventing these models from losslessly recovering the content of the original document (, full-page OCR). Although Vary [wei2023vary](http://arxiv.org/pdf/2312.06109v1) enjoys a high compression ratio and avoids cropping patches by directly encoding the document page, the lack of full collaboration across multiple vision vocabularies limits the performance. For example, given an input document page, Vary [wei2023vary](http://arxiv.org/pdf/2312.06109v1) tends to only activate the SAM-style ViT branch due to the specific-vocabulary visual bias. In addition, the above models are sensitive to document format (*e.g.*, multi-column) and do not support fine-grained user interaction on specific regions on documents. Another key point for the document understanding is how to carry out fine-grained interaction, such as OCR/summarizing/captioning a region of interest. Actually, LVLMs with human-like referential dialogue capability for natural scenes have been investigated, such as Shikra [chen2023shikra](http://arxiv.org/pdf/2306.15195v2) and ChatSpot [zhao2023chatspot](http://arxiv.org/pdf/2307.09474v1). They introduce referring spatial coordinates to refer to the special region of the input natural image, lifting the user experience and leading to more precise conversations. But these models can not handle the document images due to vision vocabulary CLIP-ViT [CLIP_radford2021learning](http://arxiv.org/pdf/2404.19696v1) which is specific to natural scenes and has low input resolution. Besides, CLIP-style pre-training method based on Laion-COCO [schuhmann2021laion](http://arxiv.org/pdf/2111.02114v1) (image-phrase pairs) only weakly write sparse visual knowledge, leading to a gap in understanding the dense document. Thus, we may ask: *Can we devise an effective and efficient pipeline for LVLMs to achieve the fine-grained multi-page document understanding?* In this paper, we propose Fox, an effective pipeline, hybrid data, and tunning strategy, giving a pleasing answer to the above question. The proposed Fox efficiently catalyzes the LVLM’s attention to anywhere on single/multi-page documents in a user-friendly manner. Our solution has three highlights: 1) *Focusing anywhere:* We introduce a novel task that boosts document understanding by focusing on the region of interest via fine-grained position-aware prompts, *i.e.*, click points, dragged bounding boxes, and drawn color boxes. Notably, the dense full-page OCR sub-task can be further optimized by being redefined as foreground focus. 2) *Full reaction across multiple vision vocabularies:* To fully interpret hybrid visual knowledge on interleaved document pages, we synthesize cross-vocabulary vision data to activate multiple visual vocabularies simultaneously to break down the specific-vocabulary bias of visual content, catalyzing multiple vision vocabularies to a full reaction. 3) *Supporting multi-column format and multiple pages:* With the position-aware prompts, the pipeline of focusing anywhere can yield robust performance regardless of document format. Moreover, benefiting from the high compression ratio (one 1024$\times$``{=html}1024 page to 256 image tokes), we demonstrate the Fox can be efficiently tuned to achieve the above fine-grained capabilities on multi-page documents without modifying parameters of vision vocabulary. As a result of the focusing catalytic process, the proposed Fox can not only give specific-vocabulary responses (*e.g.*, page foreground OCR, region/line-level OCR/translation) but also gain the noticeable ability to utilize the cross-vocabulary visual knowledge (*e.g.*, color-guided OCR, in-document figure caption). Furthermore, for more impressive multi-page document features, Fox can give the OCR results of $region_1$ on $page_1$ and $region_n$ on $page_n$ by only one question. Note that tasks like this with reference to cross-page content are of great research significance. We encourage researchers to rethink the framework design for LVLM-based document understanding and not be limited to conventional single-page sparse QA tasks. Our contributions can be summarized as follows: - We introduce a series of novel tasks to boost document understanding by enabling LVLMs to focus on document-level regions of interest. We propose an effective and efficient solution named Fox to focus anywhere on single/multi-page documents. - To catalyze multiple vision vocabularies for figure-text interleaved documents, we provide methods for generating hybrid data containing cross-vocabulary visual elements. - Fox is robust to documents of various formats due to the flexible position-aware prompts. Without training vision vocabulary, our Fox can be easily tuned to multi-page documents and gain cross-page parsing capabilities. - We build a fine-grained document benchmark, including 9 sub-tasks, such as dense page OCR, region-level OCR/translation/summary, color-guided OCR, multi-page OCR/VQA. Experimental results show that our Fox outperforms other LVLMs by a large margin. # Related Works ## Visual Document Understanding Visual document understanding is widely investigated in the research field of computer vision. Optical Character Recognition (OCR) is a basic task, which plays a key role in document digitalization [smith2007overview](http://arxiv.org/pdf/1003.5893v1), [moysset2017full](http://arxiv.org/pdf/1704.08628v1). The layout analysis task [zhong2019publaynet](http://arxiv.org/pdf/1908.07836v1) aims to detect various document elements and facilitate to understanding of spatial relationships between them. We believe that OCR is a good task to test whether LVLMs can compress documents losslessly. Besides, for translation and summary [vaswani2017attention](http://arxiv.org/pdf/2107.08000v1), [dong2019unified](http://arxiv.org/pdf/2212.06742v2) tasks, the proposed Fox can directly give answers for document images via the multimodal framework. ## Large Language Models In recent times, the success of LLMs has ignited the fields of natural language processing (NLP) and artificial general intelligence (AGI). The LLMs are built with the popular transformer framework which is explored by earlier NLP research, *e.g.*, BERT [Bert](http://arxiv.org/pdf/1810.04805v2), GPT-2 [GPT-2](http://arxiv.org/pdf/2203.12926v1), T5 [T5](http://arxiv.org/pdf/1910.10683v4), and so on. Afterward, it is discovered that when the model parameters are expanded to a certain size, the language model will be greatly boosted due to the so-called "emergent ability" [wei2022emergent](http://arxiv.org/pdf/2403.15796v2). Further, the "GPT time" comes with amazing dialogue robots optimized by Reinforcement Learning with Human Feedback [RLHF_christiano2017deep](http://arxiv.org/pdf/2007.12904v2), *e.g.*, InstructGPT [InstructGPT](http://arxiv.org/pdf/2302.05206v1) and ChatGPT [ChatGPT](https://openai.com/blog/chatgpt/). Following that, OPT [OPT](http://arxiv.org/pdf/2405.04515v2), LLaMA [llama](http://arxiv.org/pdf/2402.08075v1), and GLM [GLM](http://arxiv.org/pdf/2004.13270v1) are accessible to the community to pursue the performance like the GPT family. Based on the open-source LLMs, for more specific requirements, some fine-tuned models have merged, such as Alphaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca) and Vicuna [vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), which also play critical roles in later Large Vision-Language Models. ## Large Vision-Language Models For vision-centric tasks, Large Vision-Language Models (LVLMs) [llava](http://arxiv.org/pdf/2402.11690v1), [Flamingo](http://arxiv.org/pdf/2205.07065v1), [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1) have been developed by connecting the vision networks to LLMs. CLIP-ViT [CLIP_radford2021learning](http://arxiv.org/pdf/2404.19696v1) is a mature pre-trained vision vocabulary widely used to inject visual modality into language models. To ensure that LLMs can understand the visual context, LLaVA [llava](http://arxiv.org/pdf/2402.11690v1) places the linear layers to project visual tokens into text space. Later, beyond natural scenes, LVLMs for large-resolution documents have emerged. UReader [ye2023ureader](http://arxiv.org/pdf/2311.13165v1) is developed based on the LVLM mPLUG-Owl [ye2023mplug](http://arxiv.org/pdf/2405.00390v2). UReader [ye2023ureader](http://arxiv.org/pdf/2311.13165v1) devise a shape-adaptive approach to crop input images into 224$\times$``{=html}224 patches and feed them into CLIP vision encoder. Following Qwen-VL [Qwen-VL](http://arxiv.org/pdf/2308.12966v3), TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) uses a more powerful vision vocabulary Openclip’s ViT-bigG [openclip_ilharco_2024_10469088](openclip_ilharco_2024_10469088) with 448$\times$``{=html}448 input size to endoce each cropped patch. With the strategy of cropping patches, LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) adopts CLIP-ViT-L-336px to perform visual perception. Similarly, to capture more details, InternVL-V1.5 [chen2024far_intervl1.5](http://arxiv.org/pdf/2404.16821v2) dynamically divides the input image into 1 to 12 tiles of 448$\times$``{=html}448. In contrast, without cropping patches, Vary [wei2023vary](http://arxiv.org/pdf/2312.06109v1) writes an extra SAM-style [SAM](http://arxiv.org/pdf/2305.01275v1) 1024$\times$``{=html}1024 vision vocabulary specific to document and chart data, running in parallel with the CLIP branch. Compared to the above models, we believe that document understanding should move towards more fine-grained (*e.g.,* region-level OCR/translation) and multi-page tasks. Imagine how cool it would be if we could use the LVLM like a reading pen! In this paper, we introduce Fox which can achieve fine-grained features by focusing anywhere on multi-page documents. # Methods
Overall framework of the proposed Fox. All image tokens of multiple pages are unified into a sequence to achieve multi-page understanding. We devise position-aware prompts (point, color, and box) to make the model focus anywhere on single/multi-page documents. We catalyze multiple vision vocabularies into a full reaction of hybrid visual knowledge for interleaved pages.
In this section, we will elaborate on the details of the proposed Fox. First, we introduce the flexible pipeline which supports single/multi-page document understanding. Second, we provide the strategy to produce the data containing hybrid visual elements to activate multiple vocabularies concurrently. Last, we unify multi-task data with position-aware prompts to conduct the focusing process. ## Framework for Focusing Anywhere As illustrated in Figure 2, the architecture of the proposed Fox is built with two vision vocabularies, a large language model, and embedding linear layers. Specifically, to better handle figure-text interleaved large-resolution documents, there are two vision vocabularies, including natural content-aware CLIP-ViT [CLIP_radford2021learning](http://arxiv.org/pdf/2404.19696v1) and artificial content-aware Vary-tiny [wei2023vary](http://arxiv.org/pdf/2312.06109v1). The overall framework is neat and provides more user-friendly fine-grained interactions, which can focus on the entire page and more specific regions of interest (RoI). Impressively, the proposed Fox also supports users to select RoIs on multiple pages at the same time, enabling cross-page contextual understanding. Given a set of input document pages $I=\{p_i\}_{i=1}^N$, users can further indicate regions of interest $r_i$ on each page by clicking a point, dragging boxes, or drawing color boxes, and then give some language instructions $L^{instruct}$ about the questioning RoIs. $N$ is the number of input pages. The spatial coordinates or color information of $\{r_i\}_{i=1}^N$ is transformed into position-aware prompts and combined with $L^{instruct}$ to produce complete referential instructions. Meanwhile, two vision vocabularies will produce 256 image tokens $v^C_i \in \mathbb{R}^{256\times1024}$ and $v^S_i \in \mathbb{R}^{256\times1024}$ for each page $p_i$. These image tokens $\{v^C_i\}_{i=1}^N$ and $\{v^S_i\}_{i=1}^N$ are sent into linear layers $W^C$ and $W^S$ to align with linguistic space. Then, the final image tokens $v_i \in \mathbb{R}^{256\times2048}$ can be obtained by concatenation. Note that $v_i$ is compressed into cross-vocabulary content, including dense characters and figures. Finally, with the projected image tokens and referential instructions, LLM will generate the response sequence $Q$ in an auto-regressive manner. The above process can be formulated as follows: $$\{v_i\}_{i=1}^N = \left[ W^C \circ \{v^C_i\}_{i=1}^N || W^S \circ \{v^S_i\}_{i=1}^N\right]$$ $$Q = \mathcal{LLM} \left( \{v_i\}_{i=1}^N, \left(L^{instruct}, \Psi \left(\{r_i\}_{i=1}^N \right)\right) \right)$$ where $\left[\cdot || \cdot \right]$ is the concatenation operation. $\Psi(\cdot)$ denotes the normalization for spatial coordinates. Note that multi-page ($N$ pages) image tokens $\{v_i\}_{i=1}^N$ are unified into a sequence for cross-page contextual understanding. With the causal masked sequence modeling, the training objective can be expressed as: $$\mathcal{L}_t=-E_{(Q, V)\sim D}\operatorname{log} P_{\theta} \left( q_m | q_{ $$\label{eq1} \left\{ \begin{aligned} W_{new}^n & = \operatorname{randint}\left(\left[\alpha \cdot W^d \right], \left[\beta \cdot W^d\right] \right), H_{new}^n = \left[W_{new}^n/W^n \cdot H^n \right], & \text{if} \ W^n/H^n > W^d/H^d \\ H_{new}^n & = \operatorname{randint}\left(\left[\eta \cdot H^d \right], \left[\gamma \cdot H^d\right] \right), W_{new}^n = \left[H_{new}^n/H^n \cdot W^n \right], & \text{if} \ W^n/H^n \leq W^d/H^d\\ \end{aligned} \right.$$
where $W_{new}^n$/$H_{new}^n$ denote the desired width/height of the scaled natural image. $\left[\cdot\right]$ means the integral function. $\alpha$, $\beta$, $\eta$, and $\gamma$ are the hyperparameters that control the scaling ratio, and they are set to 0.3, 0.9, 0.4, and 0.9, respectively. Then, we randomly pick a suitable location $(x^n_1, y^n_1, x^n_2, y^n_2)$ on the page to place the scaled natural image. What’s more, to make the interleaved data reasonable and delete the occluded text on this page, we calculate the intersection of union (IoU) between $(x^n_1, y^n_1, x^n_2, y^n_2)$ and the vanilla text boxes $\left\{ (x^d_{i,1}, y^d_{i,1}, x^d_{i,2}, y^d_{i,2}) \right\}_{i=1}^{N_d}$, and fill the text boxes overlapped by the natural image with the white color. $N_d$ is the number of text boxes on this document page. So, we can obtain cross-vocabulary image-text pairs for in-document figure caption. The text for each interleaved page includes the filtered optical characters and the description of the pasted natural image. #### Color-text hybrid data. CLIP is written with the knowledge for recognizing colors, while the Vary-tiny is not. We produce color-text hybrid data to further activate multiple vocabularies, which is the key to enabling Fox to support the conversations for users’ color-guided RoI. We randomly select three text boxes and paint them directly on the document page in red, blue, and green colors. The proposed Fox is expected to directly give the OCR results in the area with the questioning color. ## Triggering Focusing Process via Fine-grained Instruction-following Tasks We devise fine-grained instructions based on several position-aware text prompts, such as points, boxes, and colors, to catalyze Fox to focus any fine-grained region on single/multi-page documents. #### Fine-grained document understanding. We define several novel sub-tasks to drive the model to focus on fine-grained regions for flexible document-level understanding: 1) Foreground OCR. We redefine the page OCR task as the foreground focus to further boost the dense perception. The instruction can be “*Give the OCR results of the box $(x^f_{i,1}, y^f_{i,1}, x^f_{i,2}, y^f_{i,2})$*”. The foreground box can be obtained by some simple operations. 2) Region-level OCR. Based on the obtained text boxes, we transform the content of one page into multiple region-level OCRs via multi-turn conversations. An example can be “*Give the OCR results of the box $(x^d_{i,1}, y^d_{i,1}, x^d_{i,2}, y^d_{i,2})$*”. 3) Line-level OCR. We pick a point near the left side of each line as the position prompt. Then, we construct the line-level multi-turn conversations and an example can be like “*OCR the line $(x^d_{j}, y^d_{j})$*”. 4) Color-guided OCR. Using the color-text hybrid data in Section 3.2, we define the corresponding cross-vocabulary task by some color-guided questions, such as “*OCR red box*” and “*OCR blue box*”. 5) Region-level translation and summary. We filter and retain the boxes with text lengths over 400 on each page. Then, we employ GPT-3.5 [ChatGPT](https://openai.com/blog/chatgpt/) to generate the translation and summary for each long in-box text as the corresponding annotations. The instruction can be “*Translate/Summarize the content of the box $(x^d_{i,1}, y^d_{i,1}, x^d_{i,2}, y^d_{i,2})$*”. 6) Document layout: We convert the 330K high-quality annotations of PubLayNet [zhong2019publaynet](http://arxiv.org/pdf/1908.07836v1) to the unified conversation format. Further, we sample 1M extra PDF pages and use PaddleOCRv2 [paddleocrv2_du2021pp](http://arxiv.org/pdf/2109.03144v2) tools to generate pseudo layout annotations. #### In-document figure understanding. Based on the synthetic interleaved data, we organize the cross-vocabulary image-text pairs into two sub-tasks: 1) In-document figure caption. As a result of the added position-aware prompts, an example language instruction is as follows: “*Give a brief description for the region $(x^n_1, y^n_1, x^n_2, y^n_2)$ of the image*”. The box denotes the boundary of the figure. 2) In-document in-figure chat. The RegionChat [zhao2023chatspot](http://arxiv.org/pdf/2307.09474v1) dataset is built for referential dialogue on natural images. After rendering it on PDF pages, with spatial coordinates of the referring region, we can ask the proposed Fox the following question: “*What can you see in this region? $(x^n_1, y^n_1, x^n_2, y^n_2)$*”. At a more fine-grained level, the RoI can be the box within the figure on the document page. #### Extension for multi-page documents. The proposed Fox can be easily tuned to focus on multiple regions of multi-page documents using simple instructions. As a forerunner, we define two basic yet interesting multi-page sub-tasks and give position-aware instruction examples. 1) Multi-page region-level OCR: “*OCR boxes on multiple pages. Page 1: $(x^1_1, y^1_1, x^1_2, y^1_2)$, Page 2: $(x^2_1, y^2_1, x^2_2, y^2_2)$, $\dots$ Page N: $(x^N_1, y^N_1, x^N_2, y^N_2)$*”. 2) Cross-page VQA: “*Which page’s box contains more characters? Page 1: $(x^1_1, y^1_1, x^1_2, y^1_2)$, Page 2: $(x^2_1, y^2_1, x^2_2, y^2_2)$, $\dots$ Page N: $(x^N_1, y^N_1, x^N_2, y^N_2)$*”. It is worth noting that all the above methods are independent of document format. The PDF data with any format or layout, such as single-column, double-column, interleaved, *etc.*, can be parsed to extract positional prompts and formulated into the corresponding conversations. With the fine-grained position-aware instructions, the vision query pipeline enjoys high human-AI interactivity and is robust to different formats (multi-column) and multi-page documents. ## Catalyzing Fox by Multi-page and Multi-grained Data Engine The data engine is a key part of the proposed Fox. To ensure the performance on multiple tasks, We carefully control the quantity and ratio of training data, and more details are reported in Table [tab:data]. #### Pre-training data. In the pre-training stage, we formulate a large number of multimodal task-driven data. Specifically, for hybrid images of in-document caption and chat sub-tasks, we render the BLIP558K [llava](http://arxiv.org/pdf/2402.11690v1) data, 1M natural images sampled in Laion-COCO [schuhmann2021laion](http://arxiv.org/pdf/2111.02114v1) and RegionChat100K [zhao2023chatspot](http://arxiv.org/pdf/2307.09474v1) data into an equal amount of document pages sampled in prepared PDF data. For fine-grained optical character understanding, we formulate 6 types of 4.6M document image-text pairs, containing box/line/color position-aware prompts and OCR/translation/summary interactive task forms. Further, we generate 800K multi-page data, including multi-page multi-region OCR and cross-page QA. In addition, to maintain the general conversational capabilities of our model, we sample 1M natural data from Laion-COCO [schuhmann2021laion](http://arxiv.org/pdf/2111.02114v1) and NLP dialogue data from Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca), Baize [xu2023baize](http://arxiv.org/pdf/2404.02406v1) and ShareGPT. #### SFT data. In the supervised fine-tuning stage, To make the conversation experience more comfortable, we sample 10K image-text pairs for each data type of the above pre-training data, and adopt GPT3.5 [ChatGPT](https://openai.com/blog/chatgpt/) to rewrite prompts ten times more diversified. Besides, LLaVA80K [llava](http://arxiv.org/pdf/2402.11690v1) is also added to further tune our model to generate pleasing answers.
| **Task** | **Region-level Dataset** | **Sample** | **Task** | **Page-level Dataset** | **Sample** | |:--:|:--:|:--:|:--:|:--:|:--:| | In-document Cap. | PDF$\times$BLIP558K [llava](http://arxiv.org/pdf/2402.11690v1) | 558K | Layout | PubLayNet [zhong2019publaynet](http://arxiv.org/pdf/1908.07836v1) | 33K | | | PDF$\times$ Laion-COCO [schuhmann2021laion](http://arxiv.org/pdf/2111.02114v1) | 1M | | Annots. by PaddleOCRv2 [paddleocrv2_du2021pp](http://arxiv.org/pdf/2109.03144v2) | 1M | | In-document Chat | PDF$\times$ RegionChat [zhao2023chatspot](http://arxiv.org/pdf/2307.09474v1) | 22K | Cap. | Laion-COCO [schuhmann2021laion](http://arxiv.org/pdf/2111.02114v1) | 500K | | Doc. Understanding | foreground OCR | 1M | NLP | Alpaca [alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 52K | | | region-level OCR | 1M | | Baize [xu2023baize](http://arxiv.org/pdf/2404.02406v1) | 112K | | | line-level OCR | 600K | | ShareGPT | 125K | | | color-guided OCR | 1M | ——— | ———————————— | ————- | | | region-level translation | 500K | PDF | Page OCR | 1M | | | region-level summary | 500K | | Page Markdown | 1M | | Multi-page Doc. | multi-region OCR | 400K | \- | \- | \- | | | cross-page VQA | 400K | \- | \- | \- |
#### Input and Conversation Format For each input image, we resize it with a fixed resolution 1024$\times$``{=html}1024 before feeding it into the SAM-style [SAM](http://arxiv.org/pdf/2305.01275v1) ViT branch and we perform a resize operation to obtain a new image of 224$\times$``{=html}224 as the input of the CLIP vision network. We choose Qwen-1.8B [qwen](http://arxiv.org/pdf/2309.16609v1) with rich linguistic vocabulary as our language model. Following the LLaVA-MPT [llava](http://arxiv.org/pdf/2402.11690v1), [team2023introducing](http://arxiv.org/pdf/2311.16429v1) dialogue style, the input conversation format can be summarized as follows: \<\|im_start\|\>user: \"\"\ "*human question \[position-aware prompts\]*"\<\|im_end\|\> \<\|im_start\|\>assistant: "*AI responses*" \<\|im_end\|\>. # Experiments ## Implementation Details During the multi-task pre-training and SFT phase, the multiple vision vocabularies (CLIP and SAM-style ViT) are frozen and only the parameters of the embedding linear layers and language model are optimized. We train our model using the optimizer AdamW [AdamW](http://arxiv.org/pdf/2311.11446v2) and a cosine annealing scheduler [loshchilov2016sgdr](http://arxiv.org/pdf/1608.03983v5). The learning rate is set to 1e-4 in pretraining and then to 2e-5 in SFT. In both stages, we use 48 A800 GPUs with a per device batch of 4 and the data epoch is set to 1. ## Multi-grained Benchmark and Metrics To advance fine-grained document understanding, we build a bilingual benchmark including 9 sub-tasks. We collect 112 English pages and 100 Chinese pages, including single/multi-column formats. The number of words per page exceeds 1,000. These images are used to evaluate page OCR, line-level OCR, color-guided OCR, region-level OCR/translation/summary, multi-page multi-region OCR, and cross-page VQA. Besides, to monitor the performance of interleaved data, we render 200 natural images sampled from Laion-COCO [schuhmann2021laion](http://arxiv.org/pdf/2111.02114v1) onto 200 PDF pages to evaluate the document-level in-figure caption task. The comprehensive evaluation metrics contain normalized edit distance, F1-score, BLEU [papineni2002bleu](http://arxiv.org/pdf/2202.11027v1), METEOR [banerjee2005meteor](http://arxiv.org/pdf/2312.00536v1), ROUGE [lin2004rouge](http://arxiv.org/pdf/2209.06517v2), and *etc*.
| **Method** | Params | Edit Distance $\downarrow$ | F1-score $\uparrow$ | Precision $\uparrow$ | Recall $\uparrow$ | BLEU $\uparrow$ | METEOR $\uparrow$ | |:---|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | LLaVA-NeXT [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) | 34B | 0.430 | 0.647 | 0.573 | 0.881 | 0.478 | 0.582 | | InternVL-ChatV1.5 [chen2024far_intervl1.5](http://arxiv.org/pdf/2404.16821v2) | 26B | 0.393 | 0.751 | 0.698 | 0.917 | 0.568 | 0.663 | | Nougat [blecher2023nougat](http://arxiv.org/pdf/2308.13418v1) | 250M | 0.255 | 0.745 | 0.720 | 0.809 | 0.665 | 0.761 | | Vary [wei2023vary](http://arxiv.org/pdf/2312.06109v1) | 7B | 0.092 | 0.918 | 0.906 | 0.956 | 0.885 | 0.926 | | Vary-toy [wei2024small_varytoy](http://arxiv.org/pdf/2401.12503v1) | 1.8B | 0.082 | 0.924 | 0.919 | 0.938 | 0.889 | 0.929 | | Qwen-VL-Plus [Qwen-VL](http://arxiv.org/pdf/2308.12966v3) | \>100B | 0.096 | 0.931 | 0.921 | 0.950 | 0.893 | 0.936 | | Qwen-VL-Max [Qwen-VL](http://arxiv.org/pdf/2308.12966v3) | \>100B | 0.057 | **0.964** | 0.955 | **0.977** | **0.942** | **0.971** | | Fox (foreground focus) | **1.8B** | **0.046** | 0.952 | **0.957** | 0.948 | 0.930 | 0.954 | Dense English text recognition on the single document page.
| **Method** | Params | Edit Distance $\downarrow$ | F1-score $\uparrow$ | Precision $\uparrow$ | Recall $\uparrow$ | BLEU $\uparrow$ | METEOR $\uparrow$ | |:---|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | InternVL-ChatV1.5 [chen2024far_intervl1.5](http://arxiv.org/pdf/2404.16821v2) | 26B | 0.265 | 0.816 | 0.784 | 0.866 | 0.622 | 0.717 | | Vary-toy [wei2024small_varytoy](http://arxiv.org/pdf/2401.12503v1) | 1.8B | 0.142 | 0.914 | 0.928 | 0.907 | 0.718 | 0.832 | | Qwen-VL-Plus [Qwen-VL](http://arxiv.org/pdf/2308.12966v3) | \>100B | 0.121 | 0.895 | 0.903 | 0.890 | 0.684 | 0.828 | | Vary [wei2023vary](http://arxiv.org/pdf/2312.06109v1) | 7B | 0.113 | 0.952 | 0.961 | 0.944 | 0.754 | 0.873 | | Qwen-VL-Max [Qwen-VL](http://arxiv.org/pdf/2308.12966v3) | \>100B | 0.091 | 0.931 | 0.917 | 0.946 | 0.756 | 0.885 | | Fox (foreground focus) | **1.8B** | **0.061** | **0.954** | **0.964** | **0.946** | **0.842** | **0.908** | Dense Chinese text recognition on the single document page.
clcccccc & & & (rl)3-5 (rl)6-8 & & color & region & line & color & region & line & Edit Distance $\downarrow$ & 0.064 & 0.059 & 0.116 & 0.114 & 0.042 & 0.084 & F1-score $\uparrow$ & 0.940 & 0.957 & 0.879 & 0.884 & 0.955 & 0.918 & Precision $\uparrow$ & 0.942 & 0.962 & 0.879 & 0.902 & 0.966 & 0.931 & Recall $\uparrow$ & 0.942 & 0.955 & 0.883 & 0.873 & 0.947 & 0.909 & BLEU $\uparrow$ & 0.868 & 0.914 & 0.845 & 0.778 & 0.885 & 0.825 & METEOR $\uparrow$ & 0.938 & 0.955 & 0.878 & 0.848 & 0.934 & 0.886
## Evaluation Results #### Foreground focus for dense text recognition on a single page. For the dense text recognition on the entire page, we directly input the normalized box $\left[2, 2, 998, 998\right]$ as the foreground prompts. As shown in Table 1 and  2, Fox showcases strong English and Chinese dense OCR ability by almost lossless compression for the document page. Specifically, Fox achieves the best edit distance of 0.046 and 0.061 in English and Chinese, respectively. Compared to Vary-toy using the image-level prompts, the proposed Fox lifts the English F1-score by 2.8% by redefining the task as foreground focus. Note that the performance of LLaVA-NeXT and InternVL-ChatV1.5 which use the CLIP-style vocabulary is bottle-necked, indicating that the dense texts of each patch are not completely encoded. #### Region focusing performance of in-document fine-grained tasks. As shown in Table [tab:boxline], Fox can yield excellent OCR results on various metrics under several color-guided/region-level/line-level settings, indicating that our model can accurately recognize the content in these randomly sampled RoIs. In Table 3, for the region-level translation, Fox yields an acceptable METEOR of 0.366 due to the smaller language model of 1.8B parameters. In addition, we evaluate our model on the fine-grained summary task and obtain a decent ROUGE-L-F score of 0.282. It is worth mentioning that this kind of usage similar to a reading pen is exactly what users need more.
| **Fine-grained Translation** | | **Fine-grained Summary** | | | **Fine-grained Caption** | | |:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 1-2 (rl)3-5 (rl)6-7 BLEU | METEOR | ROUGE-L R | ROUGE-L P | ROUGE-L F | METEOR | ROUGE-L F | | 0.138 | 0.366 | 0.261 | 0.316 | 0.282 | 0.359 | 0.396 | The performance of in-document fine-grained understanding tasks. The fine-grained translation/summary/caption tasks are targeted at interpreting in-document text/figure regions.
| **Method** | **Multi-page (8 pages) multi-region OCR** | | | | **Cross-page (8 pages) VQA** | |:---|:--:|:--:|:--:|:--:|:--:| | 2-5 (rl)6-6 | Edit Distance $\downarrow$ | F1-score $\uparrow$ | BLEU $\uparrow$ | METEOR $\uparrow$ | Accuracy $\uparrow$ | | Fox (Ours) | 0.084 | 0.946 | 0.836 | 0.805 | 0.827 | The performance of fine-grained tasks on the multi-page (8 pages) documents.
#### Cross-vocabulary focusing tasks on interleaved pages. The color-guided task requires cross-vocabulary visual knowledge, *i.e.*, CLIP for recognizing colors and Vary-tiny for capturing texts. Table [tab:boxline] shows that the decent results (0.940 and 0.884 on English and Chinese F1-score) meet our expectations due to the collaboration across multiple vision vocabularies. For the in-document figure caption task, we render natural images onto document pages and ask our model “*What is this in the box $$?*”, where $$ is the boundary of the natural image that is pasted into the document page. As shown in Table 3, when handling interleaved data, Fox reaches the METEOR of 0.359 and ROUGE-L-F of 0.396 due to the full reaction of activating multiple vocabularies. #### Exploration for focusing on multiple pages. To verify the focusing capability of Fox on multi-page documents, we report two relevant results in Table 4. For the multi-page OCR task, we ask the model to output the OCR results of 8 boxes on 8 complex pages (in mixed English/Chinese and mixed single/multi-column formats) in a single-turn conversation. Our Fox still performs an amazing F1-score of 0.946 and achieves true focus anywhere by parsing the entire 8-page document simultaneously. For the cross-page visual question-answering task which requires the model to answer which box has the largest number of characters in multiple cross-page boxes, Fox yields a high accuracy of 0.827, demonstrating that it is easier to perform VQA reasoning based on successfully perceiving dense text of multiple pages.
Visualization results. Fox can focus anywhere by supporting fine-grained features, such as in-document figure caption, color-guided OCR, VQA in the cartoon book, and etc.
#### Visualization. Figure 3 shows our Fox can perform impressive features with high human-AI interactivity. For the figure on the academic page, Fox gives the response “global seismic hazards” which is relevant to the content of the document. Fox can also give precise OCR results by dense text perception. For the cartoon book, Fox can recognize the interesting “lion” and can read the story texts for users. This indicates that our Fox enjoys fine-grained focusing capabilities in various scenarios. # Conclusion and Limitations [discussion] This paper proposes a user-friendly LVLM named Fox, which enjoys amazing fine-grained capabilities of focusing anywhere on single/multi-page documents. Further, after catalyzing the multiple vision vocabularies into a full reaction, Fox gains impressive cross-vocabulary features on figure-text interleaved pages. To advance the fine-grained document understanding, we provide a benchmark containing comprehensive sub-tasks. Our Fox can achieve promising scores in these experiments, making a successful step to high human-AI interactivity on dense-content documents. We believe that the proposed method has considerable room for improvement (*e.g.*, the low-resolution CLIP), and we encourage more researchers to focus on more reasonable multi-page document-level tasks. # Appendix We show more amazing output results of our model Fox. All testing images are from the Internet.
Fox can give precise responses when focusing on the 8-page document. These pages contain bilingual content, have well over a thousand characters per page, and have a variety of single and multi-column layouts. This extreme case demonstrates powerful focusing capabilities.
The left case shows Fox can handle the cross-page VQA task on the multi-page (8 pages as an example) document. The right case shows Fox can perform the dense Chinese text recognition by foreground focus and obtain precise results.
The proposed Fox easily performs dense English text recognition by foreground focus.
Fox can achieve text-associative in-page figure caption and fine-grained document understanding. Fox enjoys high flexibility and robustness when performing fine-grained region-level translation/summary/OCR tasks in multi-column documents.
Of course, Fox can yield interesting results in cartoon and natural scenes.
[^1]: This work was done when the first author was interning at Megvii Technology Inc.