Document AI gathers several tasks, such as Document Classification, Document Information Extraction, Document reconstruction, Document Captioning, Document summarization, and Document Question Answering. What is notable is that (1) Multipage VrDU datasets have recently emerged and are steadily increasing , indicating a shift in the field towards this type of task. These datasets encompass a wide variety of document types , including traditional PDF documents, charts, tables, web pages, arXiv papers, diagrams, and application pages, showcasing the diversity of the field. Additionally (2), there is a growing emphasis on datasets that cover tasks requiring abstract or numerical reasoning , which demands higher levels of cognitive processing. Finally (3), there is an increasing focus on incorporating multiple types of data through various instructions in these datasets. Question Answering Question answering is a prevalent task in natural language processing (NLP) where a model must provide a natural language response to a question based on a given passage . This concept extends to images, evolving into Visual Question Answering (VQA) . Visual Question Answering (VQA) entails answering questions posed in natural language about the content of images, combining computer vision and NLP to interpret and respond to specific queries about visual elements. Those VQA datasets are a little tricky: we can always wonder who is answering the question for visual question answering: the LLM-only, or the LLM thanks to the visual representation it got as input? So it is really important to take a dataset that can be only answered with the image(s) it contains. One way to evaluate that is to compare the performance on such datasets with the visual representation given to the LLM and without. [Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/pdf/2406.16860) presented an overview of some dataset and the performance on them with and without visual encoder . This shows the dataset really requiring vision and the ones not: ![](/literature_review/doc_ai/vlm/bench_cat.png) For instance, AI2D (datasets with questions on figures) displays less than a 5% gap between vision enabled and disabled, suggesting that these benchmarks may not significantly depend on visual input and rather heavily rely on the base LLM. While VQA typically involves a single question and answer per image , Conversational VQA (Conv VQA) features a series of questions and answers within a single conversation , enabling multiple interactions about the same image. Also, in VQA, images can vary widely in format: - Conventional VQA focuses on realistic or synthetic photos , with questions about object recognition, attributes, and spatial relationships within the image. - Scene Text VQA deals with realistic photos that include textual elements within the scene , such as a restaurant sign, focusing on recognizing and understanding text associated with objects in the image. - Chart VQA addresses images of charts, such as bar graphs, line charts, or pie charts , with questions often involving trend recognition, value comparisons, and identifying specific chart properties. - Diagram VQA, also known as Infographic VQA, involves interpreting diagrammatic images that explain processes or relationships . Diagrams are complex graphics comprising a graphic space, various elements, and their interrelationships. Questions in this category typically ask about components and connections depicted in the diagrams. - Document VQA relates to images of various document types, including business PDFs, web pages, forms, receipts, and invoices . VQA tasks with documents resemble Machine Reading Comprehension (MRC) tasks, involving questions about textual content that are answered using text extracted from the document. Document-based VQA incorporates image processing to analyze visual input alongside textual queries. - Multipage VQA involves answering questions that require understanding and integrating information across multiple pages of a document, such as a presentation or report, rather than focusing on a single page per question , which is typical for traditional VQA tasks. - Open-domain VQA involves answering questions based on a broad database of documents, images, tables, texts, and charts . The model includes a retriever component that identifies the specific document likely to contain the answer before generating a response. Image classification Image classification is a core task in computer vision that involves sorting images into predefined categories based on their content . When applied to documents, this process narrows down to identifying various types of document images, such as charts, diagrams, and other visual elements. For instance, chart image classification focuses on recognizing and categorizing charts extracted from documents into specific predefined categories. This task is complex due to the wide range of chart types, each with distinct features and structures. Likewise, diagrams and other document visuals are classified based on their characteristics and the information they convey , thereby improving the automation and comprehension of document content across numerous applications. Information Extraction Information extraction (IE) is a prevalent task in natural language processing (NLP). It involves the automatic extraction of structured information from unstructured or semi-structured sources like text documents, web pages, or databases. This process includes identifying and extracting specific types of entities (such as names of people, organizations, and locations) and their relationships from the text. When applied to documents, this task begins by identifying relevant elements within the document, such as text blocks, images, charts, or specific visual patterns. These identified elements are then annotated with meaningful labels or metadata that describe their content or function within the document. Document Reconstruction Document reconstruction is a specialized task that shares similarities with optical character recognition (OCR). Its aim is to reconstruct an image page of a document in a manner that preserves both its physical appearance and informational content . For textual content, this process involves assembling text blocks of the image and converting the image document in a textual format. For VrD, which contain specific layout, this reconstruction task aims converting the image document into structured formats like Markdown or HTML , facilitating clear organization and presentation of the content. This means that the input data are a set of images (screenshots of a scientific paper, of a webapp, of a webpage, …) and the output data are the correspond Markdown / LaTex / HTML code of these screenshots. In the case of graphical documents such as charts, reconstruction extends to extracting and reformatting underlying data into non-image formats, such as tables or Markdown . Captioning Captioning is a common task in computer vision that involves creating descriptive text for images, charts, diagrams, and tables . This process provides context and summarizes the visual content, making it accessible and understandable.
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering 2024-05-20 Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. However, most TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets using translation engines, the translation-based protocol encounters a substantial ``Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it does not adequately tackle challenges related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we address the task of multilingual TEC-VQA and provide a benchmark with high-quality human expert annotations in 9 diverse languages, called MTVQA. To our knowledge, MTVQA is the first multilingual TEC-VQA benchmark to provide human expert annotations for text-centric scenarios. Further, by evaluating several state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4V, on our MTVQA dataset, it is evident that there is still room for performance improvement, underscoring the value of our dataset. We hope this dataset will provide researchers with fresh perspectives and inspiration within the community. The MTVQA dataset will be available at https://huggingface.co/datasets/ByteDance/MTVQA.
Show Paper Content
# Introduction In the era of burgeoning AI, especially in LLMs/MLLMs [gpt4v](http://arxiv.org/pdf/2311.15732v2), [gpt4v_explore](http://arxiv.org/pdf/2312.15011v1), [team2023gemini](http://arxiv.org/pdf/2405.12107v1), [anthropic2024claude](http://arxiv.org/pdf/2007.04626v3), [reid2024gemini](http://arxiv.org/pdf/2312.17661v1), [bai2023qwen](http://arxiv.org/pdf/2309.16609v1), [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1), [young2024yi](http://arxiv.org/pdf/2304.11090v4), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [hu2024mplug](None), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [chen2024far](http://arxiv.org/pdf/2404.16821v2), [dong2024internlm](http://arxiv.org/pdf/2404.06512v1), [li2024mini](http://arxiv.org/pdf/2305.16318v2), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/), **Te**xt-**C**entric **V**isual **Q**uestion **A**nswering (**TEC-VQA**) [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [hu2024mplug](None) has served as a *de facto* gold proxy to evaluate AI models in the domain of text-centric scene understanding. Compared with general VQA [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [pham2024viocrvqa](http://arxiv.org/pdf/2404.18397v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mishra2019ocr](http://arxiv.org/pdf/2010.02582v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), [masry-etal-2022-chartqa](https://doi.org/10.18653/v1/2022.findings-acl.177), [zhu2016visual7w](http://arxiv.org/pdf/2306.04938v1), [krishna2017visual](http://arxiv.org/pdf/1602.07332v1), [antol2015vqa](http://arxiv.org/pdf/1309.1125v1), [marino2019ok](http://arxiv.org/pdf/1906.00067v2), [sheng2021human](http://arxiv.org/pdf/1810.02358v2), [liu2024visual](http://arxiv.org/pdf/2402.11690v1), [gao2015you](http://arxiv.org/pdf/1505.05612v3), [gan2020large](http://arxiv.org/pdf/2302.02502v2), [liu-etal-2021-visually](https://doi.org/10.18653/v1/2021.emnlp-main.818), TEC-VQA places greater emphasis on answering questions that require understanding textual information within images. It provides a streamlined avenue for individuals without specialized expertise to articulate their requirements and access applications in text-centric visual environments. However, the majority of advancements in TEC-VQA have predominantly concentrated on high-resource languages, *e.g.*, English [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), Chinese [qi-etal-2022-dureadervis](https://doi.org/10.18653/v1/2022.findings-acl.105), [gao2015you](http://arxiv.org/pdf/1505.05612v3), Japanese [shimizu2018visual](http://arxiv.org/pdf/1810.02358v2), [nguyen2023vlsp2022](http://arxiv.org/pdf/1810.02358v2) and *etc.*, thus restricting the applicability of AI models to the global community, particularly populations speaking low-resource languages. To tackle the problem of language diversity, several seminal studies [raj-khan-etal-2021-towards-developing](https://doi.org/10.18653/v1/2021.findings-emnlp.151), [pfeiffer-etal-2022-xgqa](https://doi.org/10.18653/v1/2022.findings-acl.196), [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176) in the general VQA field, leverage off-the-shelf translation engines to expand existing question-answer pairs from high-resource languages to their multilingual counterparts including low-resource ones. However, when applied to TEC-VQA, this translation-based approach may fall prey to the “*Visual-textual misalignment*” problem as only the text in question-answer pairs can be processed, while the visual text present in the images is overlooked. Not to mention issues such as nuanced meaning, contextual distortion, language bias, and question type diversity further render the transferability of the translation protocol infeasible for TEC-VQA. The *status quo* begs for a question: “*How can we address the visual-textual misalignment problem for multilingual TEC-VQA and what we stand in the MLLM era?*”
| | | |:--:|:--:| | | |
In this work, to answer the question above, we establish MTVQA, a novel and high-quality multilingual TEC-VQA benchmark, where all images are collected from real-world and meticulously annotated by human experts in nine languages: Arabic (AR), Korean (KO), Japanese (JA), Thai (TH), Vietnamese (VI), Russian (RU), French (FR), German (DE), and Italian (IT). More concretely, to ensure the visual-textual alignment at most, the annotation process follows the raise-then-correct paradigm, where a group of human annotators raises several distinct questions, ranging from simple content extraction to text-related reasoning, and subsequently provides answers. These QA pairs are then double-checked by another group to ensure accuracy and consistency. Consequently, as illustrated in Fig. [fig:leng_statistics], 6,678 training images and 21,829 question-answer pairs, as well as 2,116 test images and 6,778 question-answer pairs are obtained, covering several fine-grained scenarios, such as menus, logos, maps, bills, PPTs, research papers, and *etc*. To our best knowledge, MTVQA is the first TEC-VQA dataset to provide native human annotations for multilingual text-rich scenarios, especially for low-source languages. Furthermore, we investigate recent representative MLLMs, including GPT-4V, Gemini, QwenVL *etc*., by juxtaposing experimental results regarding their performance on our newly proposed MTVQA. Both for general MLLMs and document-focused ones, the results unequivocally demonstrate that opportunities for improvement persist within these MLLMs when applied in multilingual text-rich scenarios. In summary, the main contributions of this paper can be categorized into three points: - We introduce the MTVQA dataset, to the best of our knowledge, which is the first multilingual TEC-VQA benchmark to provide human expert annotations for text-centric scenarios. - We benchmark the state-of-the-art MLLMs on our new dataset and show there is still room for performance improvement for these models under multilingual text-rich scenarios. - We propose a set of baselines for multilingual TEC-VQA tasks. # Related Work ## LLMs/MLLMs for text-centric VQA Recent advancements in LLMs/MLLMs [gpt4v](http://arxiv.org/pdf/2311.15732v2), [gpt4v_explore](http://arxiv.org/pdf/2312.15011v1), [team2023gemini](http://arxiv.org/pdf/2405.12107v1), [anthropic2024claude](http://arxiv.org/pdf/2007.04626v3), [reid2024gemini](http://arxiv.org/pdf/2312.17661v1), [bai2023qwen](http://arxiv.org/pdf/2309.16609v1), [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1), [young2024yi](http://arxiv.org/pdf/2304.11090v4), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [hu2024mplug](None), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [chen2024far](http://arxiv.org/pdf/2404.16821v2), [dong2024internlm](http://arxiv.org/pdf/2404.06512v1), [li2024mini](http://arxiv.org/pdf/2305.16318v2), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) have revolutionized VQA tasks, as demonstrated by the remarkable zero-shot performance of these models. Notably, the high generalizability of LLMs/MLLMs, when explicitly trained on visual text understanding datasets and fine-tuned with instructions, has significantly enhanced their application in text-centric VQA scenarios [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [hu2024mplug](None). For example, LLaVAR [zhang2023llavar](http://arxiv.org/pdf/2306.17107v2), UniDoc [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), which extend LLaVA [liu2024visual](http://arxiv.org/pdf/2402.11690v1) into the realm of document understanding, pioneering the text-centric VQA of MLLMs by training them to predict texts and coordinates from document images. Furthermore, DocPedia [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3) operates visual input in the frequency domain rather than in space, which enables higher input resolution without increasing the input sequence. Lately, mPLUG-DocOwl [mPLUG-DocOwl](None), Qwen-VL [bai2023qwen](http://arxiv.org/pdf/2309.16609v1), and TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) leverage publicly available document-related VQA datasets to further enhance the text-centric VQA capabilities. Despite the promising results achieved by existing LLMs/MLLMs in text-centric VQA tasks, their focus on high-resource languages such as English or Chinese has posed challenges in achieving reasonable performance for low-resource languages. This is primarily due to the lack of data or benchmarks for these low-resource languages. ## Multilingual text-centric VQA Benchmarks VQA has garnered significant attention in recent years, with numerous studies, datasets, and benchmarks being proposed to advance the field [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [pham2024viocrvqa](http://arxiv.org/pdf/2404.18397v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mishra2019ocr](http://arxiv.org/pdf/2010.02582v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), [masry-etal-2022-chartqa](https://doi.org/10.18653/v1/2022.findings-acl.177), [zhu2016visual7w](http://arxiv.org/pdf/2306.04938v1), [krishna2017visual](http://arxiv.org/pdf/1602.07332v1), [antol2015vqa](http://arxiv.org/pdf/1309.1125v1), [marino2019ok](http://arxiv.org/pdf/1906.00067v2), [sheng2021human](http://arxiv.org/pdf/1810.02358v2), [liu2024visual](http://arxiv.org/pdf/2402.11690v1), [gao2015you](http://arxiv.org/pdf/1505.05612v3), [gan2020large](http://arxiv.org/pdf/2302.02502v2), [liu-etal-2021-visually](https://doi.org/10.18653/v1/2021.emnlp-main.818). Many datasets have been created that encompass scene text of various domains, including natural images [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), scanned documents [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), book and movie covers [mishra2019ocr](http://arxiv.org/pdf/2010.02582v1). One notable limitation of these datasets is their predominant focus on English [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2) or other high-resource languages such as Chinese [qi-etal-2022-dureadervis](https://doi.org/10.18653/v1/2022.findings-acl.105), [gao2015you](http://arxiv.org/pdf/1505.05612v3) and Japanese [shimizu2018visual](http://arxiv.org/pdf/1810.02358v2), [nguyen2023vlsp2022](http://arxiv.org/pdf/1810.02358v2), which restricts the applicability of VQA systems for low-resource languages such as Thai and Vietnamese. There is a recent effort toward extending VQA tasks to a wider range of languages [gupta2020unified](http://arxiv.org/pdf/2204.14264v2), [pfeiffer-etal-2022-xgqa](https://doi.org/10.18653/v1/2022.findings-acl.196), [vivoli2022must](http://arxiv.org/pdf/1902.05660v1), [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176), [li2023empirical](http://arxiv.org/pdf/1810.02358v2), [raj-khan-etal-2021-towards-developing](https://doi.org/10.18653/v1/2021.findings-emnlp.151) by providing a multilingual VQA datasets. For example,  [gao2015you](http://arxiv.org/pdf/1505.05612v3) created a free-form bilingual VQA dataset (FM-IQA) contains over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations. [raj-khan-etal-2021-towards-developing](https://doi.org/10.18653/v1/2021.findings-emnlp.151) developed a large-scale multilingual and code-mixed VQA dataset (MuCo-VQA) supporting five languages. Of more relevance are the works xGQA (8 languages) [pfeiffer-etal-2022-xgqa](https://doi.org/10.18653/v1/2022.findings-acl.196) and MaXM (7 languages) [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176), which apply translation-based protocols to expand VQA data beyond English. However, the translation-based multilingual VQA datasets inherently face issues, such as the “Visual-textual misalignment” problem, where only the text in question-answer pairs is processed, while the visual text in images is overlooked. Additionally, the nuanced meaning and context are often distorted; language bias introduced by machine translation models, and the coverage of certain question types is limited, as highlighted by [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176). Moreover, none of the previous multilingual datasets focus on text-centric scenarios where multilingual text frequently occurs. Our benchmark distinguishes itself by focusing on multilingual text-centric VQA scenarios using human expert annotations. To the best of our knowledge, the MTVQA benchmark is the first dataset to provide native human annotations for such scenarios. It covers 9 languages, thereby facilitating the training and evaluation of multilingual models in diverse linguistic contexts. Additionally, our dataset can gauge the VQA system’s ability for not only high-resource languages but also those that are typically underrepresented in current datasets [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), [gao2015you](http://arxiv.org/pdf/1505.05612v3). The MTVQA benchmark addresses a significant gap in existing datasets by catering to the crucial needs of low-resource languages through annotations from native speakers across multiple languages. Our pioneering efforts distinctly position the MTVQA benchmark as a unique multilingual VQA resource, advancing the frontier of machine learning research. # MTVQA Benchmark The MTVQA Benchmark covers 9 languages: Arabic (AR), Korean (KO), Japanese (JA), Thai (TH), Vietnamese (VI), Russian (RU), French (FR), German (DE), and Italian (IT). In this section, we describe in detail how we establish the MTVQA benchmark, including the collection of raw image data and two-round human expert annotations, which are independent of each other. ## Data Collection Our purpose is to develop a multilingual VQA benchmark capable of evaluating the QA performance of MLLMs in multilingual text-centric scenarios, thus the raw data collection process is mainly oriented towards text-centric images from natural scenarios and document scenarios. To ensure the diversity and quality of data, we collect not only the raw image data from publicly available datasets, including the multilingual scene text recognition images from MLT2019 [nayef2019icdar2019](http://arxiv.org/pdf/1909.07145v1) and PowerPoint slides (PPTs) sourced from the internet, but also the data from countries of each language. Furthermore, the collected data includes multiple fine-grained scenarios (Fig. [fig:data_distribution]), such as menus, logos, maps, bills, PPTs, research papers, and *etc*. As a result, we gather a total of 1,220 images from document scenarios and 876 images from natural scenarios in the test set of the MTVQA benchmark. To ensure the visual-textual alignment, for text-rich images lacking text and language annotations, we subject them to a standardized data cleaning process, which includes text recognition and language classification. Afterward, we organize all the text-rich images we have obtained into language-specific groups, preparing them for the subsequent stage of data annotation. ## Human Expert Annotation In order to obtain informative and accurate text-related QA pairs on the language-specific grouped images, we recruit a group of annotators with expertise from local regions of each language. It is worth noting that all these annotators are native speakers of their respective languages, ensuring their deep understanding and proficiency in the linguistic nuances and cultural context necessary for precise annotations. Considering the subjective nature of the text-image understanding task, we have implemented a further division within the annotation team. This division involves separating the team into two independent groups, with one group dedicated to generating and responding to questions based on the provided images, while the other group focuses on evaluating and correcting the QA pair results. This raise-then-correct paradigm ensures a comprehensive and reliable assessment of the text-image understanding process. Additionally, each language’s annotation results undergo a 10% sampling inspection by a quality inspector. If the QA pairs fail to meet the criteria, they are sent back for re-annotation. Prior to commencing the formal human expert annotation task, all annotators undergo unified training and receive annotation examples. The brief diagram of the two-round annotation process is shown in Figure [fig:anno_process] and we elaborate on it in the following subsections.
| | **AR** | **DE** | **FR** | **IT** | **JA** | **KO** | **RU** | **TH** | **VI** | |:---------------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:| | *Training Set* | | | | | | | | | | | Question | 8.29 | 8.72 | 9.73 | 12.05 | 12.43 | 11.74 | 11.56 | 11.35 | 11.21 | | Answer | 9.66 | 6.96 | 7.34 | 11.24 | 12.70 | 13.56 | 12.00 | 11.26 | 13.31 | | *Test Set* | | | | | | | | | | | Question | 8.08 | 8.29 | 9.76 | 11.93 | 12.48 | 12.2 | 11.65 | 10.98 | 10.99 | | Answer | 7.95 | 6.67 | 6.61 | 11.04 | 12.55 | 13.61 | 14.42 | 12.08 | 13.00 | Mean lengths of question-answer pairs in different languages of training set and test set, using GPT-4o tokenizer.
**First Round Questioning and Answering.**For the first round of annotation tasks, we assigned 3 annotators for each language to manually generate original QA results. Given a text-centric image from our collection, annotators are first required to read the texts in the image and analyze other contents in the image in a comprehensive and detailed manner. They must then raise 4 meaningful and distinct questions based on the content in the image and give the answers. All annotators adhere to the following criteria: (1) the first three questions should satisfy that answering these questions requires direct reading of the textual information in the image, (2) the fourth question requires reasoning about the text in the image to answer (3) the questions and answers must be reasonably correct and consistent with the content of the image, and (4) the answer should be as concise as possible and free of nonsense (*e.g.*, when the question is “When is the volunteer recruitment period”, the answer should be “9:00-16:00” rather than “The volunteer recruitment period is 9:00-16:00”). It’s worth mentioning that our requirement for concise answers is to make the evaluation process more friendly and more reliable, cause we try to keep the evaluation metrics unaffected by extraneous content in the answer sentence. **Second round Evaluation and Correction.**To reduce the effect of human subjective cognitive bias on our MTVQA benchmark and get high-quality question-answer pairs, we assigned 2 annotators for each language for the annotation evaluation and correction process. Based on the provided images and the first-round annotation results, the annotators must follow these rules of judgment and steps for the annotation: (1) Whether the question is related to the text in the image. If not, discard the current question-answer pair, (2) Whether the answer is correct. If not, modify the answer, and (3) Whether the answer repeats the content from the question. If so, remove the repeated content to ensure a concise answer.
## Data Statistics We instruct the annotators to complete the above human expert annotation work towards the text-centric VQA tasks and construct the MTVQA benchmark consisting of 8,794 images and 28,607 question-answer pairs that cover the 9 languages. The MTVQA benchmark is divided into a training set containing 6,678 images and 21,829 question-answer pairs, and a test set containing 2,116 images and 6,778 question-answer pairs. The detailed data distribution can be seen in Figure [fig:data_distribution]. To visualize the vocabulary richness of our benchmark, we calculate the word frequencies for each language and present them in the form of word clouds as shown in Figure [fig:word_cloud]. In Figure [fig:leng_statistics] we demonstrate the statistics of the question and answer lengths using GPT-4o tokenizer.
# Experiments ## Baseline Models For the MTVQA benchmark, we evaluate the following instruction-tuned general MLLMs, (1) **Open-source MLLMs:** InternVL-V1.5 [chen2023internvl](http://arxiv.org/pdf/2312.14238v3), InternLM-Xcomposer2-4KHD [dong2024internlm](http://arxiv.org/pdf/2404.06512v1), Mini-Gemini-HD-34B [li2024mini](http://arxiv.org/pdf/2305.16318v2), Llava-Next-34B [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/), DeepSeek-VL [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1), YI-VL-34B [young2024yi](http://arxiv.org/pdf/2304.11090v4), TextSquare [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) and mPLUG-DocOwl 1.5 [hu2024mplug](None); (2) **Closed-source MLLMs:** GPT-4V, Gemini Ultra, QwenVL Max, QwenVL Plus, Claude3 Opus, Claude3 Sonnet and GLM4V. For the closed-source MLLMs, we use the chat version through the official APIs, while for the open-source MLLMs, we utilize the instruct versions. It is noted that all the model weights of the open-source MLLMs evaluated in our experiments could be downloaded from the HuggingFace Model Hub. For the open-source MLLMs, the model size varies from 7b to 34b. ## Implementation Details We conduct the evaluation experiments over the baseline MLLMs with their default settings, ignoring the effect of generation configuration on the results. To make the output of MLLMs more evaluation-friendly, we design the following prompt format to limit the output length: “Answer the question using a word or phrase in the language of the question. + \”, where \ represents the corresponding question of the input image. The extra prefixes added to the raw question could limit the answer to be as concise as possible. Besides, we utilize the InternLM-Xcomposer2-4KHD [dong2024internlm](http://arxiv.org/pdf/2404.06512v1) as the backbone for the instructional fine-tuning experiment on the MTVQA training set. In the instructional fine-tuning process, we follow the default training settings [dong2024internlm](http://arxiv.org/pdf/2404.06512v1) with “HD-16” and train on MTVQA training set for 1 epoch. ## Evaluation Results **Zero-shot testing** To demonstrate the quantitative comparison results in the above MLLMs, we follow TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) with accuracy as the evaluation metric. That is, the model output is only counted as correct if it contains the ground truth. The complete evaluation results are shown in Table 2, where Claude3 Opus achieves the highest average accuracy of 25.7$\%$ on the 9 languages. It indicates that the multilingual text-centric VQA tasks remain a big challenge, even for the state-of-the-art open-source and closed-source MLLMs. From the metrics across languages, both open-source and closed-source models performed significantly better on Indo-European languages using the Latin alphabet, including DE, FR, and IT in our benchmark, compared to other languages, which results from the distribution of realistically available training data and the genetic relationship of different languages. In addition, all closed-source models except GLM4V outperform the open-source model overall across the nine languages, which may be due to the contribution of pre-training on multilingual data. We also found that the document-focused MLLMs, like TextSquare [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3) and TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), do not significantly outperform other open-source models on the metrics of these 9 languages. **Instruction tuning** As shown in Table 2, the instruction tuning experiment on MTVQA benchmark brings a 8.5$\%$ improvement in average accuracy. With respect to specific languages, French sees the largest improvement of 14.2$\%$ in accuracy, while Russian has the smallest improvement of 1.7$\%$ in accuracy. The results demonstrate that MLLMs vary in their ability to understand and learn from text-centric data in different languages, leaving great potential for future research of multilingual text-centric MLLMs pre-training.
|   | AR | DE | FR | IT | JA | KO | RU | TH | VI | Avg. | |:---|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | *Closed-Source MLLMs* | | | | | | | | | | | | GPT-4V | 11.5 | 31.5 | 40.4 | 32.3 | 11.5 | 16.7 | 10.3 | 15.0 | 28.9 | 22.0 | | Gemini Ultra | 14.7 | 32.3 | 40.0 | 31.8 | 12.3 | 17.2 | 11.8 | 20.3 | 28.6 | 23.2 | | QwenVL Max | 7.7 | 31.4 | 37.6 | 30.2 | 18.6 | 25.4 | 10.4 | 4.8 | 23.5 | 21.1 | | QwenVL Plus | 4.8 | 28.8 | 33.7 | 27.1 | 12.8 | 19.9 | 9.4 | 5.6 | 18.1 | 17.8 | | Claude3 Opus | **15.1** | **33.4** | **40.6** | **34.4** | **19.4** | **27.2** | **13.0** | **19.5** | **29.1** | **25.7** | | Claude3 Sonnet | 10.5 | 28.9 | 35.6 | 31.8 | 13.9 | 22.2 | 11.0 | 15.2 | 20.8 | 21.1 | | GLM4V | 0.3 | 30.0 | 34.1 | 30.1 | 3.4 | 5.7 | 3.0 | 3.5 | 12.3 | 13.6 | | *Open-Source MLLMs* | | | | | | | | | | | | InternVL-V1.5 [chen2023internvl](http://arxiv.org/pdf/2312.14238v3) | 3.4 | 27.1 | 31.4 | 27.1 | 9.9 | 9.0 | 4.9 | 8.7 | 12.4 | 14.9 | | Mini-Gemini-HD-34B [li2024mini](http://arxiv.org/pdf/2305.16318v2) | 2.2 | 25.0 | 29.2 | 25.5 | 6.1 | 8.6 | 4.1 | 4.3 | 11.8 | 13.0 | | Llava-Next-34B [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/) | 3.3 | 24.0 | 28.0 | 22.3 | 3.6 | 6.1 | 2.6 | 0.4 | 9.8 | 11.1 | | DeepSeek-VL [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1) | 0.6 | 14.2 | 15.3 | 15.2 | 2.9 | 3.8 | 1.6 | 0.9 | 5.2 | 6.6 | | YI-VL-34B [young2024yi](http://arxiv.org/pdf/2304.11090v4) | 1.7 | 13.5 | 15.7 | 12.1 | 4.8 | 5.2 | 0.8 | 3.5 | 4.1 | 6.8 | | TextSquare [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3) | 3.7 | 27.0 | 30.8 | 26.7 | 3.2 | 7.2 | 6.7 | 5.2 | 12.4 | 13.6 | | TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) | 2.0 | 18.1 | 19.9 | 22.1 | 4.6 | 7.2 | 3.2 | 0.9 | 11.1 | 9.9 | | mPLUG-DocOwl 1.5 [hu2024mplug](None) | 1.0 | 13.9 | 14.9 | 18.2 | 2.9 | 5.0 | 2.0 | 0.9 | 6.4 | 7.2 | | Xcomposer2-4KHD [dong2024internlm](http://arxiv.org/pdf/2404.06512v1) | 2.0 | 20.6 | 23.2 | 21.6 | 5.6 | 7.7 | 4.1 | 6.1 | 10.1 | 11.2 | | Xcomposer-SFT | 11.8 | 31.7 | 37.4 | 29.3 | 14.5 | 12.9 | 5.8 | 13.9 | 20.2 | 19.7 | Performance of the leading closed-source and open-source MLLMs on the MTVQA benchmark.
# Limitation The current iteration of MTVQA exhibits certain constraints that warrant attention. Primarily, the linguistic diversity incorporated is not exhaustive; several lesser-spoken languages remain unrepresented. Future enhancements will aim to broaden the multilingual scope of the dataset. Additionally, the dataset currently offers a singular canonical response for each question. Recognizing the multifaceted nature of the inquiry, subsequent versions will endeavor to include a spectrum of plausible answers to reflect the varied perspectives inherent to each question. # Conclusion In this paper, we introduce MTVQA, a multilingual TEC-VQA benchmark featuring high-quality human expert annotations in 9 diverse languages. We believe that MTVQA is the first benchmark of its kind to provide fully manual annotations specifically tailored to text-centric scenarios. The results obtained from both closed- and open-source MLLMs on our MTVQA dataset indicate that there is still room for improving their performance in multilingual text-centric scenarios. Although the current version of MTVQA has constraints regarding linguistic diversity and singular responses per question, we are confident that this dataset can still inspire researchers within the TEC-VQA community with new perspectives and ideas.
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning 2024-04-25 Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang
Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.
Show Paper Content
# Introduction
Our TinyChart-3B outperforms several 13B MLLMs on a variety of chart understanding benchmarks (a), while achieving larger inference throughput (b).
As an important information source, charts can intuitively visualize data in various visual presentation forms and have become an indispensable part of information dissemination, business decision-making, and academic research [chartsurvey](chartsurvey). With the rapid growth of multimodal data, automatically comprehending charts has become a pressing need and received increasing attention from the research community [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct), [onechart](onechart). Recently, Multimodal Large Language Models (MLLMs) have shown strong capability in comprehending images and following instructions [gpt4](gpt4), [llava](llava), [mplugowl](mplugowl), [llava1.5](llava1.5), [sphinx](sphinx), [mplugowl2](mplugowl2), [xcomposer](xcomposer), [xcomposer2](xcomposer2), [xcomposer2-4k](xcomposer2-4k). Based on these MLLMs, some recent works [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct), [paperowl](paperowl) further build chart understanding models by collecting and constructing versatile chart comprehension datasets and performing supervised fine-tuning. However, despite their remarkable success, current chart understanding models still face three main limitations: (1) Considerable amount of parameters makes training and deployment challenging. For example, ChartLlama [chartllama](chartllama) is a model with 13 billion parameters, which is hard to deploy on a single consumer GPU with less than 26GB of VRAMs. (2) They are prone to errors when tackling questions involving numerical calculations [chartast](chartast), which are difficult to directly answer without any reasoning steps. (3) They struggle with efficiently encoding for high-resolution images since the standard vision transformer would produce lengthy feature sequences. To overcome such limitations in chart understanding, we propose an efficient and powerful MLLM, namely **TinyChart**. As shown in Figure 1, through the efficient visual encoding and Program-of-Thoughts learning strategy, TinyChart achieves state-of-the-art performances on various chart understanding benchmarks with only 3B parameters, while excelling in faster inference throughput. For efficient visual encoding, we propose to merge visual tokens based on the observation that chart images often contain large areas of color and white spaces. Inspired by [tome](tome), we adopt a parameter-free Visual Token Merging module inside each vision transformer layer, which aggregates the most similar visual tokens and gradually reduces the length of the visual feature sequence, thus making it possible to efficiently encode high-resolution chart images. This enables the model to maintain high-resolution chart image input while controlling the computation load. Moreover, inspired by  [pot](pot), we propose Program-of-Thoughts learning that enhances the model’s ability to resolve mathematical problems. According to statistics on ChartQA [chartqa](chartqa), 42% of questions for charts require numerical answers, and most existing models struggle to perform numerical question answering [matcha](matcha), [chartast](chartast). To learn chart understanding more efficiently, we train the model to generate Python programs for the computation problems step by step. The programs are then passed to a Python interpreter to produce the final answer. To support Program-of-Thoughts learning, we further construct the ChartQA-PoT dataset based on ChartQA [chartqa](chartqa). The QA pairs in our ChartQA-PoT are constructed in two ways: (1) Template-based PoT construction, which generates questions and programs by filling in the manually written templates based on chart data. (2) GPT-based PoT construction, which leverages `gpt-3.5-turbo` [gpt3.5](gpt3.5) to generate programs based on human-written questions. Experimental results show that Program-of-Thoughts learning can significantly improve the question-answering, especially numerical question answering ability of TinyChart. The main contributions of this work are as follows: - We introduce TinyChart, an efficient multimodal chart understanding model, which outperforms several 13B MLLMs and achieves state-of-the-art performances on a variety of chart understanding benchmarks, while excelling in faster inference speed at the same time. - We propose a Program-of-Thoughts (PoT) learning strategy to enhance the model in learning numerical calculation and carefully build a PoT dataset ChartQA-PoT. - We adopt Visual Token Merging for efficient vision encoding, which significantly reduces the length of vision feature sequences and enables the model to encode high-resolution chart images with constrained computing resources. # Related Work ## Chart Understanding Chart understanding requires the model to comprehend chart contents and accomplish related tasks specified by the instructions. This field encompasses low-level recognition tasks, such as data extraction [deplot](deplot), and high-level tasks, such as question-answering (QA) [chartqa](chartqa), [plotqa](plotqa), [dvqa](dvqa), summarization [chart2text](chart2text), [chart2text-8k](chart2text-8k), and re-rendering [chartllama](chartllama). As charts often contain OCR text pivotal for data interpretation, and many instructions require the model to perform numerical calculations, chart understanding demands robust text recognition capabilities and computational reasoning from the model. Early approaches [lorra](lorra), [plotqa](plotqa), [deplot](deplot), [chartstamp](chartstamp), [mpmqa](mpmqa), [qc_cap](qc_cap) rely on pipeline methods that use off-the-shelf OCR tools or component detectors to transform charts into data tables and other textual representations. They then employ language models to complete the specified tasks. These pipeline approaches, limited by their inability to optimize jointly, were hampered by error accumulation. Recent studies [unichart](unichart), [matcha](matcha), [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct), [mmc](mmc) have shifted towards end-to-end methods based on multimodal large language models. These studies adopt the structure of multimodal large language models [llava](llava), [llava1.5](llava1.5), [mplugowl](mplugowl), [mplugowl2](mplugowl2), [sphinx](sphinx) and enhance chart understanding abilities through supervised fine-tuning [instructgpt](instructgpt) with substantial chart instruction data [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct). Although these models demonstrate improvement in performance, their extensive parameter size prevents them from being easily trained or deployed under resource-constrained scenarios. In this paper, we demonstrate that a 3B MLLM is enough to achieve state-of-the-art performance on several chart understanding tasks. Meanwhile, it has been well observed that these models are prone to numerical errors [matcha](matcha), [chartinstruct](chartinstruct), [chartast](chartast). Though [chartast](chartast) try to construct executable command lines in JSON format based on a template to eliminate numerical errors, we argue that it is insufficient to fully address this issue for two reasons: 1) The executable command lines in JSON format produced by [chartast](chartast) relies on a specific computational backend, which limits their potential versatility. 2) Template-based programs can only cover rather limited scenarios. Instead, we construct the Program-of-Thoughts learning dataset with the combination of both templates and GPT-generated programs. This allows the model to more effectively learn how to solve numerical problems. ## Multimodal Large Language Model Multimodal large language models (MLLM) exhibit strong capabilities in visual understanding and instruction following [gpt4](gpt4), [gemini](gemini). They typically comprise transformer-based visual encoders, large language models, and vision-language connectors [llava](llava), [llava1.5](llava1.5), [tinyllava](tinyllava), [mplugowl](mplugowl), [mplugowl2](mplugowl2), [xcomposer](xcomposer), [xcomposer2](xcomposer2), [mplug-octopus](mplug-octopus). These models are generally trained on extensive general image-text data for cross-modal alignment and instruction fine-tuning. Although some studies have showcased a degree of OCR capability in these multimodal large language models [ocr_mllm](ocr_mllm), [trie](trie), their performance on document and chart understanding benchmarks remains suboptimal due to their low input resolution [ureader](ureader), [xcomposer2-4k](xcomposer2-4k). Efforts in the general document domain have attempted to improve the fine-grained understanding capabilities of MLLMs by increasing resolution [qwenvl](qwenvl), segmenting images [ureader](ureader), [sphinx](sphinx), [docowl1.5](docowl1.5), [xcomposer2-4k](xcomposer2-4k), utilizing frequency domain signals [docpedia](docpedia), and introducing additional high-resolution encoders [cogagent](cogagent). However, these models often suffer from low efficiency, primarily due to the excessive length of the high-resolution visual sequences. The visual token merging method adopted in this paper can significantly reduce the length of visual feature sequences and relax the computational requirements with high-resolution input.
# TinyChart ## Model Architecture Figure [fig:overview] shows the overview framework of our proposed TinyChart. It follows the typical architecture of the multimodal large language model (MLLM), which consists of a vision transformer encoder, a vision-language connector, and a large language model. To encode high-resolution visual input effectively, we insert the visual token merging module inside each vision transformer layer. ### Vision Transformer Encoder The vision transformer encoder aims to encode chart images into vision features. A standard vision transformer [vit](vit) first resizes the input image $I$ into a fixed resolution and crops the image into patches. Then the patches are treated as vision tokens and processed with transformer encoder layers [transformer](transformer). Suppose the input image $I^{N\times N}$ is in resolution $N \times N$, and the patch size is $P \times P$, the length of vision tokens would be $(N // P)^2$. Since the standard transformer layer does not reduce the sequence length, the vision transformer finally produces a vision feature in length $(N // P)^2$. In practice, when $N$ is large, the vision feature can be very long and inefficient for the language model to handle.
Since key information (such as OCR words) in a chart can be unrecognizable in low-resolution images [docowl1.5](docowl1.5), high-resolution input is essential for chart understanding. However, charts typically contain a large number of color blocks and blank spaces, where patches are visually similar. To achieve efficient and effective chart understanding, we apply Visual Token Merging [tome](tome) in each transformer layer. The process of Visual Token Merging is shown in Figure [fig:tokenmerge]. By merging the $r$ most similar token pairs, it reduces the length of the vision feature by $r$ in each layer. We measure the similarity between two tokens using the cosine distance between Keys from self-attention following [tome](tome). As shown in the lower part of Figure [fig:tokenmerge], Vision Token Merger finds the top-$r$ similar token pairs through bipartite graph matching. It first divides the vision tokens into two disjoint sets. Then, for each token in one set, it finds the most similar tokens in the other set and draws an edge between the two tokens. After that, it only keeps the top-$r$ most similar edges and merges the features of the two endpoints through average pooling. Note that not only spatially adjacent visual tokens are subject to merging. Non-adjacent tokens can also be merged if they belong to different subsets and are similar enough. The visual token merging operation aggregates tokens with a similar feature into one. Therefore, it will reduce the proportion of this visual feature in the attention calculation in the following transformer layer, since the number of this feature has decreased. To solve this issue, we let the attention operation consider the actual number of patches $s$ represented by each token as follows: $$\begin{aligned} \mathrm{Attention}=\mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d}} + \log s \right) V \end{aligned}$$ Where $Q$, $K$, $V$ denotes the query, key, and value of self-attention which are linear projected from the hidden states [transformer](transformer). By adding $\log s$ inside $\mathrm{softmax}$, the token that merged from $s$ patches are duplicated by $s$ times in the attention calculation [tome](tome). ### Vision-Language Connector The vision language connector aims to project the vision features into the embedding space of the large language model. Following [llava1.5](llava1.5), [tinyllava](tinyllava), we implement the vision-language connector as a multiple-layer perceptron with one hidden layer and GeLU [gelu](gelu) activation. ### Large Language Model The large language model aims to comprehend both visual features and language instructions, and then generate responses to accomplish chart understanding tasks. It is implemented as a transformer decoder [transformer](transformer) with a causal attention mask. The training objective of the model is language modeling. Assuming the visual features is $V$, the language instruction is $L$, and the response is $R$, then the loss function is defined as follows: $$\begin{aligned} \mathcal{L}=\frac{1}{T}\sum_{i=1}^T\mathrm{LLM}(R_i|V, L, R_{[fig:pot_construction], the Template-based PoT is constructed based on human-written templates containing placeholders for both questions and code. The template questions involve common numerical operations such as calculating the sum, average, minimal, and maximum values. We adopt the 40 template questions proposed by PlotQA [plotqa](plotqa) and manually write their corresponding template Python code to solve them. As shown in the top-left part of Figure [fig:pot_construction], the template code consists of several variable assignment operations with NumPy [numpy](numpy) functions to perform calculations. The beginning steps usually involve extracting the relevant data from the chart and assigning them to variables. The final computed result is stored in a variable named "Answer". For each placeholder in the template, we identify all possible values from the data table of the chart and randomly select one to fill in the placeholder. After removing incorrect or unreasonable filled-ins using rule-based methods, we finally successfully construct 119,281 (question, PoT pairs) over 17,498 images from ChartQA. ### GPT-based PoT Although the template-based method allows for the construction of a large number of question-answer pairs, the diversity of these pairs is limited due to the fixed templates. To improve the generalization ability of PoT learning, we have additionally built GPT-generated PoT data by leveraging the powerful command-following and code-generation capabilities of large language models. Specifically, we prompt `gpt-3.5-turbo` [gpt3.5](gpt3.5) to generate PoT answers similar to the template PoT format for questions annotated in ChartQA using in-context examples. As shown in Figure [fig:pot_construction], since `gpt-3.5-turbo` does not accept image input, we also provide the data table corresponding to the chart as text input to `gpt-3.5-turbo`. We screen the quality of the generated PoT answers by running them through a Python interpreter. If the annotated PoT answer can not run on the Python interpreter, or if the answer obtained is different from the annotated one in ChartQA, then the corresponding PoT Answer is deleted. In the end, we construct 21,303 (question, PoT Answer) pairs on 15,521 chart images.
| Dataset | Benchmark | Samples | |:---------------------------------------------|:---------:|--------------:| | ***Chart question answer*** | | | | ChartQA [chartqa](chartqa) | | 28,299 | | ChartQA-PoT | | 140,584 | | PlotQA [plotqa](plotqa) | | 157,070 | | DVQA [dvqa](dvqa) | | 200,000 | | OpenCQA [opencqa](opencqa) | | 5,407 | | ***Chart-to-text generation*** | | | | Pew [chart2text](chart2text) | | 7,892 | | Statista [chart2text](chart2text) | | 29,589 | | OpenCQA [opencqa](opencqa) | | 5,407 | | Vistext [vistext](vistext) | | 11,171 | | ChartSumm [chartsumm](chartsumm) | | 75,255 | | Chart2Text-8k [chart2text-8k](chart2text-8k) | | 7,862 | | ***Chart-to-table generation*** | | | | ChartQA [chartqa](chartqa) | | 19,373 | | PlotQA [plotqa](plotqa) | | 190,720 | | Chart2Text-8k | | 8,305 | | DVQA [dvqa](dvqa) | | 300,000 | | Statista [chart2text](chart2text) | | 29,589 | | ***Chart instruction following*** | | | | ChartLlama [chartllama](chartllama) | | 148,398 | | **Total** | | **1,364,921** | Datasets used for training TinyChart. The benchmark datasets consist of basic chart understanding evaluations including QA, summary, and chart-to-table generation. Note that in ablation studies, we only use the benchmark datasets for training due to limited computational resources.
## Multitask Learning We perform multitask learning to train our TinyChart model. We collect a chart understanding dataset that contains 1.36M samples for supervised fine-tuning. It covers various chart understanding tasks including chart question answering, chart-to-text generation, chart-to-table generation, and chart instruction following. Table 1 shows the collection of our training dataset. We mix data in different tasks together to jointly train the model, and use task-specified instructions to enable the model to differentiate between them. The training objective is language modeling on response tokens as presented in Eq.[eq:loss]. Note that in ablation studies, we train solely with benchmark datasets due to limited computational resources.
@lrcccccccc@ & & & & & Chart-to-Text & Chart-to-Table & OpenCQA (l)5-10 & & & & Aug. & Hum. & Avg. & BLEU4 & RMS$_{F1}$ & BLEU4 GPT-4V [gpt4](gpt4) & - & - & - & - & - & 78.50 & - & - & - Gemini-Ultra [gemini](gemini) & - & - & - & - & - & 80.80 & - & - & - Qwen-VL-Max [qwenvl](qwenvl) & - & - & - & - & - & 79.80 & - & - & - Deplot+Codex [deplot](deplot) & 1.3B+175B & - & - & 91.00 & 67.60 & 79.30 & - & 87.22 & - Llava1.5 [llava1.5](llava1.5) & 13B & 336$\times$``{=html}336 & 1.94 it/s & 72.96 & 37.68 & 55.32 & 7.16 & 48.95 & - Qwen-VL [qwenvl](qwenvl) & 9.6B & 448$\times$``{=html}448 & 1.65 it/s & 78.90 & 44.30 & 61.60 & - & - & - UReader [ureader](ureader) & 7B & 224$\times$``{=html}224($\times$``{=html}20) & 1.67 it/s & 79.42 & 39.12 & 59.30 & - & - & - DocOwl1.5 [docowl1.5](docowl1.5) & 8B & 448$\times$``{=html}448($\times$``{=html}9) & 1.56 it/s & 91.38 & 49.62 & 70.50 & - & - & - ChartInstruct [chartinstruct](chartinstruct) & 7B & - & - & 87.76 & 45.52 & 66.64 & 13.83 & - & 15.59 ChartLlama [chartllama](chartllama) & 13B & 336$\times$``{=html}336 & 1.94 it/s & 90.36 & 48.96 & 69.66 & 14.23 & 90.00 & - ChartAst [chartast](chartast) & 13B & 448$\times$``{=html}448 & 1.47 it/s & **93.90** & 65.90 & 79.90 & 15.50 & 91.60 & 15.50 TinyChart@512 & 3B & 512$\times$``{=html}512 & **3.65** it/s & 93.60 & 72.16 & 82.88 & **17.93** & 92.93 & 19.62 TinyChart@768 & 3B & 768$\times$``{=html}768 & 3.14 it/s & 93.86 & **73.34** & **83.60** & 17.18 & **93.78** & **20.39**
# Experiment ## Implementation Details TinyChart is initialized from TinyLlava [tinyllava](tinyllava), which utilizes the SigLIP [siglip](siglip) as the vision encoder and Phi-2 [phi1.5](phi1.5) as the large language model. The origin input resolution of the vision encoder is 384$\times$``{=html}384. We extend the input resolution to 512$\times$``{=html}512 and 768$\times$``{=html}768 and apply visual token merging with $r=20$ and $r=84$ in each transformer layer respectively. We train the entire model for 3 epochs with a batch size of 512. The learning rate is set to $1e-4$, with a warmup in the beginning 3% steps, and then decays to 0 at the end of training. The total training process costs 3 days on 32 Tesla V100 GPUs with 32 GB VRAMs. ## Evaluation Benchmarks ChartQA [chartqa](chartqa) aims to generate a short answer to the question based on the chart content. It includes a lot of questions that require numerical calculation. We report the relaxed accuracy that allows numerical error within 5% as the metric following [chartqa](chartqa), [chartllama](chartllama), [chartast](chartast). Note that our TinyChart with Program-of-Thoughts learning can perform ChartQA in the following four settings: - **Direct**: the model produces short answers directly. - **PoT**: the model produces Python code. The answer is then calculated through the Python interpreter. - **Combine**: the model produces Python code for questions that require calculation, and Direct answers for others. We determine whether a question requires calculation with a simple rule-based keyword detector. If the question contains one of the calculative keywords[^1], the detector will treat it as a computational question and prompt the model to generate a PoT answer. Otherwise, the model is instructed to produce a Direct answer. Additionally, if the generated program of a calculative question encounters syntax errors, we let the model produce Direct answers for this question in the Combine setting. - **Oracle** We further introduce the Oracle setting for ChartQA evaluation. Under this setting, we always choose the correct one between the Direct and PoT answers after evaluating under both settings. It is the upper bound of the combination across the two answers. We evaluate TinyChart under the Combine setting by default. Chart-to-Text aims to generate a chart summarization based on chart content. We evaluate the model with the Pew benchmark [chart2text](chart2text), and report BLEU4 [bleu](bleu) as the metric. Chart-to-Table aims to extract the underlying data table presented by the chart. We evaluate the performance of Chart-to-Table with the data table annotation provided by ChartQA [chartqa](chartqa) following [chartllama](chartllama), [chartast](chartast). We report RMS$_{F1}$ [deplot](deplot) as the metric. Different from ChartQA, OpenCQA [opencqa](opencqa) evaluates the ability of models to generate free-form answers to the chart-related questions. We report BLEU4 [bleu](bleu) as the metric following [chartinstruct](chartinstruct), [chartast](chartast). ChartX [chartx](chartx) is a recently proposed benchmark that contains more chart types. We evaluate the ChartX cognition tasks since they are more challenging. It covers Question Answering, Chart Description Generation, Chart Summary Generation, and Chart Redrawing. We report the GPT-Accuracy for QA and GPT-score for the remaining 3 tasks as the metrics following ChartX [chartx](chartx). ## Main Results Table [tab:main_result] shows an extensive comparison between TinyChart and existing multimodal large language models on 4 chart understanding benchmarks. Our TinyChart model achieves state-of-the-art performance on ChartQA, Chart-to-Text, Chart-to-Table, and OpenCQA, while excels in larger inference throughput. Specifically, with the input resolution set at 768$\times$``{=html}768, TinyChart achieves an accuracy of 83.60 on ChartQA [chartqa](chartqa), surpassing several closed-source models including GPT-4V, Gemini-Ultra, and Qwen-VL-Max [qwenvl](qwenvl). It also outperforms previous open-source SOTA ChartAst [chartast](chartast) on chart understanding. We find that previous models performed poorly on the ChartQA human subset, with none of them achieving over 70%. In contrast, the performance on the ChartQA-augmentation has approached 93.9%. This is because the questions posed by human annotators involve more computational problems [chartqa](chartqa) and are more challenging. By leveraging the Program-of-Thoughts learning, TinyChart achieves performance of 73.34% on ChartQA-human, which is an improvement of 7.44% over the previous state-of-the-art ChartAst [chartast](chartast). This demonstrates the effectiveness of our proposed learning method based on the Program-of-Thoughts. We observed that models with higher input resolutions generally perform better on chart understanding tasks. However, encoding high-resolution charts leads to a decrease in inference speed (e.g., Qwen-VL vs. Llava1.5, DocOwl1.5 vs. UReader, ChartAst vs. ChartLlama). By leveraging visual token merging, TinyChart is able to accept higher-resolution input images with a limited increase in computing demands, thus achieving better performance. Due to the smaller model size and the efficient visual token merging strategy, TinyChart achieves significantly larger inference throughput compared to previous models. In summary, these results demonstrate that TinyChart can achieve efficient chart understanding with enhanced performance and faster inference. Table 1 shows the performance comparison under different settings. Note that the performance of ChartAst under the Combine setting is from [chartast](chartast), which leverages a combination of Direct answer and executive JSON to produce the final answer. The results indicate that our TinyChart model could achieve SOTA performance on the Direct answer. By combining with PoT answers, TinyChart could make further improvements. In addition, since the combination of Direct and PoT answers is very simple, the performance under the Combine setting falls behind the Oracle setting a lot. Further study can be conducted to better combine the two answers. We divide the questions in ChartQA test set [chartqa](chartqa) into two categories: calculative questions (761 of 2500) and non-calculative questions (1739 of 2500) by checking whether they contain calculative keywords mentioned above. Table [tab:cal_questions] shows the performance of TinyChart@768 on these two types of questions under different settings. We observe that PoT significantly improves the performance on calculative questions compared to Direct settings (78.98 vs. 56.64) and thus it shows overall performance gains (80.84 vs. 76.36). And the simple combination of Direct and PoT strategies further makes improvements.
| Model | ChartQA | | | | |:---|:--:|:--:|:--:|:--:| | 2-5 | Direct | PoT | Oracle | Combine | | ChartLlama [chartllama](chartllama) | 69.66 | \- | \- | \- | | ChartAst [chartast](chartast) | 75.10 | \- | \- | 79.90 | | TinyChart@512 | **76.92** | 79.64 | 88.76 | 82.88 | | TinyChart@768 | 76.36 | **80.84** | **89.12** | **83.60** | Performance on ChartQA under different settings.
To further assess the generalizability of TinyChart, we compare our model with end-to-end General MLLM and Chart MLLM on ChartX-Cognition benchmark [chartx](chartx), since it covers visually diverse chart types. We use TinyChart@768 to perform inference on ChartX without additional fine-tuning. As shown in Table 2, benefiting from our Program-of-Thoughts learning method, TinyChart achieves a 33.35 GPT-Accuracy on the QA task, even surpassing the GPT-4V model. Though it falls behind GPT-4V in Summary, Description, and Redrawing tasks, TinyChart still performs better than open-source Chart MLLMs including ChartLlama and ChartAst. It indicates that TinyChart has a strong capability to generalize across various chart types.
| Model | ChartX Cognition | | | | |:-------------------|:----------------:|:--------:|:-----------:|:---------:| | 2-5 | QA | Summary | Description | Redrawing | | ***General MLLM*** | | | | | | Llava1.5 | 17.19 | 1.48 | 1.29 | 0.75 | | GPT-4V | 33.04 | **3.17** | **3.12** | **2.63** | | ***Chart MLLM*** | | | | | | ChartLlama | 13.80 | 1.04 | 1.02 | 0.94 | | ChartAst | 30.99 | 0.33 | 1.03 | 0.82 | | TinyChart@768 | **33.35** | 1.53 | 1.64 | 1.89 | Evaluation results on ChartX [chartx](chartx).
## Ablation Studies To verify the effectiveness of visual token merging and program-of-thoughts learning, we conduct ablation studies in Table [tab:ablation]. The upper block in Table [tab:ablation] shows the performance of the model with and without the use of Program-of-Thoughts training data. Comparing Row 2 with Row 1, we observe that training solely with template-based PoT improves the model’s ability to generate direct answers (71.12 vs. 70.72). This improvement is attributed to PoT learning enhances the model’s reasoning abilities. At this point, the PoT answers produced by the model are less accurate than direct answers (55.44 vs. 71.12), which may be due to the inability of template-based PoT to cover all questions. However, when we ask the model to generate PoT answers for questions that require calculation and combine with direct answers, it outperforms solely direct answers (73.00 vs. 71.12). This indicates that PoT answers have advantages in computational problems. After incorporating GPT-based PoT into training, the performance of PoT answering surpasses direct answering (76.88 vs. 72.44), and both direct (72.44 vs. 71.12) and combined answering (79.48 vs. 73.00) show further improvements. These results confirm the effectiveness of our proposed Program-of-Thoughts learning method, suggesting that it not only strengthens the model’s computational capabilities but also enhances overall problem-solving capability. The middle block in Table [tab:ablation] compares the performance with and without using visual token merging when the input resolution is 512$\times$``{=html}512, and with different numbers of tokens to merge in each layer. Comparing Row 4 and Row 3, increasing the input resolution from 384 to 512 significantly improves the model’s performance on three chart understanding benchmarks, demonstrating that high resolution is crucial for comprehending chart images. However, a direct increase in resolution leads to a substantial drop in the inference throughput (2.38 it/s vs. 3.73 it/s). The reason is that, given high-resolution images, the standard vision transformer produces a lengthy visual feature sequence that is then processed by the large language model. This brings considerable computational expenses. By adopting the visual token merging, we can control the length of the visual feature sequence by regulating the number of tokens to merge at each layer, and, thereby achieving efficient high-resolution encoding. When setting $r$=20, we attain an inference throughput nearly equal to that with an input resolution of 384$\times$``{=html}384 (3.65 it/s vs. 3.73 it/s), while providing the performance benefits of higher resolutions. To further highlight the advantages of visual token merging, we increase the input resolution to 768 in the bottom block of Table [tab:ablation]. At this point, the length of the visual feature sequence is 2,916, which could not be trained using 32GB V100 due to insufficient VRAM. However, after employing the visual token merging module with $r$=84, the input sequence length is reduced to 732 and we can perform training normally. In this setting, the model’s inference throughput is 3.14 it/s, and demonstrates a certain performance advantage in ChartQA (81.04 vs. 80.76) and Chart-to-Table (88.90 vs. 87.81). It illustrates that by utilizing visual token merging, we are able to leverage higher-resolution chart images under constrained resources, thereby improving performance.
## Visualization To investigate the effects of visual token merging, we visualized the token merging results at the final layer of the vision transformer. In Figure [fig:vis_tokenmerge], we visualize the top ten groups with the largest numbers of tokens. Each group is outlined with a different color. The visualization reveals that these largest groups typically correspond to blank or colored areas. By compressing these areas down to a single token for encoding, our visual token merging module can thus reduce the length of the encoded sequence without losing much information, thereby achieving efficient visual encoding. ## Case study We conduct case studies with TinyChart when conducting chart question answering, chart-to-table, chart-to-text, and chart redrawing in Figure [fig:vis_cases], [fig:table_cases], [fig:summary_cases], and [fig:redraw_cases]. In Figure [fig:vis_cases], we present a case study on ChartQA. As shown in Figure [fig:vis_cases] (a-c), much key information within the chart is provided by visually situated texts within the image, which requires the model to have the ability to process high-resolution images. Since ChartLlama only supports 336 resolutions, it struggles to retrieve accurate information in these charts. In contrast, thanks to the visual token merging, our TinyChart can accept higher-resolution inputs without introducing excessive computations. Thus it can successfully find clues related to the questions. Meanwhile, ChartLlama suffers from numerical errors when faced with calculative questions in Figure [fig:vis_cases] (d-e), and our PoT (Program-of-Thoughts) learning method can accurately solve these problems. These examples further illustrate the advantages of our methods. For chart-to-table extraction, we find that our TinyChart model can successfully extractive values from several visually diverse charts in Figure [fig:table_cases] (a-c), thanks to its excellent text recognition ability with high-resolution input. However, as shown in Figure [fig:table_cases] (d), the model struggles to estimate the values of data points in the absence of OCR words. It seems that the model could make reasonable predictions based on surrounding points, but hardly provide accurate values based on the coordinate axis. This indicates that the model still lacks the ability to understand spatial relationships across large areas. From Figure [fig:summary_cases], we observe that the model can understand the data presented in the chart and generate descriptions and summaries in natural language. Though it can retrieve the data values correctly, we find it sometimes produces contents that do match the chart as shown in Figure [fig:summary_cases] (c-d). This may be due to the inherent limitations of hallucination in MLLMs [chair](chair), [pope](pope), [wang2023evaluation](wang2023evaluation), [amber](amber), and may be alleviated by addressing hallucinations [vcd](vcd), [opera](opera), [jiang2024hallucination](jiang2024hallucination), [less_eos](less_eos). We present four cases of chart redrawing in Figure [fig:redraw_cases]. As shown in Figure [fig:redraw_cases] (a-c), our TinyChart model can generate Python code to redraw visually diverse chart types including lines, heatmaps, and rings. However, it can be hard to draw unseen chart types such as 3D bar charts (Figure [fig:redraw_cases] (d)). This may be mitigated by improving the coverage of different chart types in training data through automatic data construction techniques [chartllama](chartllama), [chartx](chartx). [^1]: sum, mean, average, ratio, mode, divide, dividing, differ, subtract, add, division, times, absolute, minus, exceed, below, less, fewer, bigger, biggest, greater, higher, longer, tallest, lowest, number, how many colors, what is the value # Conclusion This paper introduces TinyChart, a chart understanding Multimodal Large Language Model with 3 billion parameters. To address the inefficiency of lengthy visual token sequences with high-resolution images, TinyChart injects a visual token merging module that merges similar vision tokens together, thereby enabling efficient encoding of high-resolution images. To tackle the challenges of learning numerical computations, we propose a Program-of-Thoughts learning method that trains the model to generate Python programs to answer questions. Our TinyChart model achieves state-of-the-art (SOTA) performance on multiple chart understanding benchmarks, surpassing existing 13 billion parameter chart MLLMs, and outperforms closed-source models like GPT-4V on ChartQA. Extensive ablation studies confirm the effectiveness of our methods. Our code and model are released at .
# ChartQA-PoT Details ## Dataset Statistic We build ChartQA-PoT based on the images and questions in the training split of ChartQA [chartqa](chartqa). ChartQA-PoT consists of two subsets: Template-based PoT and GPT-based PoT. We present the statistics over ChartQA-PoT in Table [tab:chartqa_pot]. We find that answers provided by `gpt-3.5-turbo` are longer than template-based PoT, since they cover more diverse scenarios.
lrrr Statistic & & & Num. of samples & 119,281 & 21,303 & 140,584 Num. of images & 17,498 & 15,521 & 18,133 Avg. answer characters & 319.38 & 381.23 & 328.75 Avg. answer tokens & 117.70 & 136.01 & 120.48
We further present the first 2-gram words of the questions after removing stop words in Template-based PoT and GPT-based PoT in Figure 3. It is observed that GPT-PoT covers more diverse questions for ‘what’ type questions, and questions in Template-based PoT are more evenly distributed across all question types.
Template PoT.
GPT PoT.
First 2-gram of the questions in ChartQA-PoT after removing stop words.
## Instructions for GPT-based PoT Figure [fig:gpt_prompt] shows the instructions for constructing GPT-based PoT answers. Note that we prompt `gpt-3.5-turbo` to provide Python code consisting of assignment statements and avoid using loops or judgment statements. This can simplify the program and reduce syntax errors. We also provide meta information including the chart title, type, and colors to `gpt-3.5-turbo` since some questions rely on this information to answer.
TextSquare: Scaling up Text-Centric Visual Instruction Tuning 2024-04-19 Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
Show Paper Content
# Introduction Recent research on multimodal large language models (MLLMs) has achieved significant advancements in the text-centric visual question-answering(VQA) domain [Text-MLLM-1](None), [Text-MLLM-2](None), [Text-MLLM-3](None), [docpedia](None), with several closed-source state-of-the-art (SOTA) models leading the way. Two representative examples are GPT4V [gpt4v](http://arxiv.org/pdf/2312.04344v2) and Gemini [gemini-pro](http://arxiv.org/pdf/2312.17661v1), which have demonstrated remarkable performance and have even surpassed human-level capabilities in certain aspects. Nevertheless, as illustrated in Figure 1, the performance of open-source models still lags significantly behind that of pioneering closed-source models. This phenomenon can be attributed to various factors, including model architecture, the scale of model parameters, image resolution, the volume of pretraining and instruction tuning data, and training strategies, among others.
The performance of TextSquare in various VQA tasks compared to existing models. (a) shows the comparison with state-of-the-art closed-source models (Gemini and GPT4V ), and (b) shows the comparison with the leading open-source models. The numbers in parentheses after the model names in the legend indicate the average performance ranking across 10 text-centric multimodal benchmarks. TextSquare is marginally superior to GPT4V. Best viewed on screen.
Many pioneering studies [allava](None), [bonito](None), [sharegpt4v](None), [llavar](None) have recently conducted data-centric research into the challenges of insufficient instruction tuning data. For instance, Monkey [monkey](None) initially employed expert models to generate descriptions of different aspects of images, which were then summarized by GPT-4 to produce high-quality and detailed image caption data. For better text-based knowledge injection, For better text-based knowledge injection, LLaVAR [llavar](None) and TG-Doc [tg-doc](None) used GPT-4 to generate conversations for text-rich images by integrating OCR results into the instructions. In order to improve the image caption ability for MLLMs, ShareGPT4V [sharegpt4v](None) constructs a high-quality image caption dataset through GPT4V. While these efforts have achieved remarkable success, they also left some challenges unresolved. Image caption data and VQA data belong to different domains, with inconsistencies in the granularity and scope of image content presentation. Furthermore, the scale of synthetic data remains relatively small, preventing MLLMs from fully realizing their potential. The exploration of methods that leverage large-scale text-centric VQA data for instruction tuning of existing open-source models remains limited. To bridge the gap, this paper proposes a strategy termed Square for obtaining massive, high-quality text-centric VQA data from sophisticated and versatile closed-source MLLMs, resulting in the construction of a dataset (Square-10M) comprising tens of millions of instances for instruction tuning. Specifically, the method consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. The self-questioning step involves utilizing the MLLM’s capabilities in text-image analysis and understanding to generate questions related to the textual content of images. The answering step involves answering these generated questions, leveraging various prompting techniques such as Chain-of-Thought and few-shot prompting. The reasoning step entails probing the model for the reasoning behind its answers, leveraging the powerful reasoning abilities of MLLMs. The evaluation step involves evaluating the question-answer pairs, assessing the validity of the questions and their relevance to the textual content of the images, as well as the correctness of the answers, thereby improving data quality and mitigating hallucinations. Overall, Square comprehensively leverages the capabilities of MLLMs in various aspects, significantly enhancing the data quality. Besides, enriching the diversity of images is also crucial. We collect a diverse set of text-rich images from various public sources, including natural scenes, charts, tables, receipts, books, slides, PDFs, documents, products, and web images. Subsequently, deduplication is performed on this collection. By applying the Square method to these images, Square-10M is constructed. Based on Square-10M, we achieve several remarkable results with extensive and rigorous experiments. First, as shown in Figure 1, our model (TextSquare) achieves comparable or superior performance to advanced closed-source models and substantially outperforms recent state-of-the-art open-source models on various benchmarks. It is notable that the image resolution of TextSquare is $700$ and the parameters are $8.6$B. Second, our experiments validate the beneficial impact of reasoning data on VQA tasks, demonstrating its ability to enhance model performance while mitigating hallucinations. With reasoning data for instruction tuning, TextSquare has a strong reasoning capability to provide elaborate explanations for VQA scenarios. Last but not least, by leveraging the dataset’s massive scale, we unveil the relationships between instruction tuning data scale, training convergence loss, and model performance. Whereas a few instruction tuning data can motivate MLLM well, it is not sufficient. Large amounts of high-quality data can still significantly reduce convergence loss and improve performance. The performance of TextSquare grows and the loss of convergence decreases while continuously scaling up the instruction tuning data, which also demonstrates the effectiveness of our dataset. In summary, the main contributions of this paper can be categorized into four points: - A high-quality dataset (Square-10M) comprising tens of millions of instances for text-centric VQA instruction tuning is constructed by comprehensively collecting text-rich images from various scenarios and employing the Square (Self-Questioning, Answering, Reasoning, and Evaluation) strategy on closed-source MLLMs. - Leveraging Square-10M, TextSquare achieves a significant outperformance of existing open-source models and even comparable or superior performance to SOTA closed-source models on various benchmarks, e.g., +0.9% on ChartQA, +2.1% on WTQ, +4.3% on SROIE. Notably, TextSquare outperforms GPT4V in overall rankings across ten text-centric benchmarks (ranking 2.2 *v.s.* 2.4). - Reasoning data is demonstrated to be beneficial in improving model performance and mitigating hallucinations in text-centric VQA scenarios, as it can deliver rich question-specific contextual information. - Through extensive experiments, we reveal the relationships between data scale, convergence loss, and model performance for text-centric VQA instruction tuning, which demonstrates the effectiveness and necessity of Square-10M. # Related Work ## Multi-modal Large Language Models Recent work has increasingly focused on introducing visual knowledge into LLMs [MLLM-1](None), [MLLM-2](http://arxiv.org/pdf/2308.12966v3), [MLLM-3](None). General attempts connect a visual encoder and an LLM with intermediate modules like Projector [llava](None), Q-Former [blip2](None), Perceiver Resampler [flamingo](None), etc, and go through pre-training alignment and instruction fine-tuning for vision-language understanding. Recently, several researches [Text-MLLM-1](None), [Text-MLLM-2](None), [docpedia](None), [structextv2](None), [vary](None), [omniparser](None), [layoutllm](None), [hrvda](None) propose to enhance MLLMs’ capabilities in understanding textual elements (OCR, text-centric VQA, etc). Among them, mPLUG-DocOwl [Text-MLLM-1](None) creates novel instruction-following datasets to enhance the tuning process. TextMonkey [MLLM-3](None) adopts shifted window attention and filters out significant tokens. DocPedia [docpedia](None) and HRVDA [hrvda](None) enlarges input resolution to bridge the gap between MLLMs and visual document understanding. Despite the extraordinary progress of existing open-source MLLMs, they still suffer from the huge gap against SOTA closed-source models like GPT4V [gpt4v](http://arxiv.org/pdf/2312.04344v2) and Gemini Pro [gemini-pro](http://arxiv.org/pdf/2312.17661v1). In this paper, we propose to mitigate this gap by training with large-scale and high-quality instruction-following data. ## Text-Centric Visual Question Answering Text-Centric Visual Question Answering aims to understand the interactions between the textual and the visual elements in the image. Donut [donut](None) first proposes an end-to-end training method based on a Transformer without OCR. Pix2Struct [pix2struct](None) introduces a variable-resolution input representation to adapt to document images. DoCo [doco](None) enhances the visual representation of the image encoder in LVLMs by aligning the document object of multi-modal inputs. BLIVA [bliva](None) enlarges the input token space by concatenating learned query embeddings and encoded patch embeddings. Several studies [Text-MLLM-2](None), [tg-doc](None), [llavar](None) have performed data-centric attempts in this regard. UniDoc [Text-MLLM-2](None) construct 600k document-oriented image-text pairs from PowerPoint presentations. LLaVAR [llavar](None) and TG-Doc [tg-doc](None) prompt text-only GPT-4 to generate conversations for text-rich images by integrating OCR results into the instructions. These researches are restricted to small-scale annotations or generation based on uni-modal inputs. ## Generating Instruction-Tuning Data via LLMs The success of LLMs has inspired recent work to employ them as training data generators [sharegpt4v](None), [allava](None), [self-instruct](None), [synthetic-prompting](None). In this regard, we anchor on generating instruction-following data. Self-Instruct [self-instruct](None) took the initial step towards synthesizing instructions via language models and improving the instruction-following capabilities. Llama-GPT4 [llama-gpt4](None) uses GPT-4 to generate instruction-following data for LLM fine-tuning. Synthetic Prompting [synthetic-prompting](None) leverages a few handcrafted examples to prompt LLMs to generate more examples. Bonito [bonito](None) converts unannotated text into task-specific training datasets for instruction tuning. Recently, ALLAVA [allava](None) employs GPT4V to generate reasoning instructions and detailed answers from unlabeled images. All of the above attempts suffer from the low quality of the generated data and are typically performed on a small scale. In contrast, we collect massive text-centric images (*i.e.*, tens of millions) and devise comprehensive generating methods and filtering rules to ensure the quantity and quality of the instruction tuning dataset.
Overview of Square-10M: the distribution of images, the average tokens of the QAs, etc.
# Square-10M: A Massive and High-quality Text-Centric VQA Instruction Tuning Dataset Square-10M is synthesized by our proposed Square pipeline, *i.e.*, Self-Questioning, Answering, Reasoning, and Evaluation. ## Overview of Square Figure 3 presents an overview of our proposed Square. Square generally consists of three stages for synthesizing high-quality instruction tuning data for text-centric VQA: (1) Data Collection for collecting large-scale images with textual elements of diverse properties. (2) Data Generation involves self-questioning, answering, and reasoning of the collected data. In this phase, the MLLM is prompted to generate VQA pairs based on the given image, as well as the reasoning behind its answers. (3) Data Filtering for self-evaluation of the generated content, aiming to discard meaningless questions and erroneous answers by employing the evaluation capabilities of MLLMs. The above procedures result in our Square-10M dataset, standing out with its massive and high-quality text-centric VQA pairs and reasoning context. To be more specific, a total of 3.8 million images with rich textual elements are collected from diverse sources. After that 20 million question-answer pairs are obtained from Data Generation. Finally, 9.1 million QA pairs as well as the reasoning context are distilled with our Square strategy. A more precise analysis of Square-10M is depicted in Figure 2. ## Data Collection The data collection strategy is driven by the primary objective of encompassing a broad range of real-world text-rich scenarios. To this end, we collect 3.8 million unlabeled text-rich images (Figure 2). These images exhibit diverse properties. For instance, Chart and Table focus on textual elements with intense statistical information; Slide, Screenshot, and WebImage are designed for the interaction between text and prominent visual messages; Document/PDF, Receipt, and e-commerce contain images with fine and dense text; Street-View is derived from natural scenes. The collected images form a mapping of the textual elements in the real world and constitute the foundation of our research on text-centric VQA.
Pipeline for the proposed Square strategy. Gemini’s versatile multi-modal comprehension capabilities are utilized to synthesize Square-10M, which consists of four stages, self-questioning, answering, reasoning, and evaluation.
## Data Generation: Self-Questioning, Answering, and Reasoning We build our Square-10M dataset by employing the multi-modal understanding capabilities of Gemini Pro, one of the most advanced LLMs. For each image selected from a specific data source, Gemini Pro is instructed to generate VQA pairs and reasoning context through the subsequent three stages: **Stage 1: Self-Questioning.** In this stage, Gemini Pro is prompted to generate profound, meaningful, and non-trivial questions about the given image. We ask Gemini Pro to first comprehensively analyze the image and then raise questions based on its understanding, as shown in Figure 3. Considering that advanced MLLMs typically have weaker understanding capabilities of the textual elements than visual elements, we also prepend the extracted text to the prompt by employing expert OCR models. **Stage 2: Answering.** Gemini Pro is then instructed to give appropriate answers to the generated questions. We leverage various prompting techniques to enrich the contextual information and improve the reliability of the generated answers, such as Chain-of-Thought and few-shot prompting. Figure 3 shows an example prompt for generating answers to a given question. **Stage 3: Reasoning.** We require Gemini Pro to elaborate on the detailed reasons behind its answers. Such an effort enforces Gemini Pro to think more about the connections between the questions and the visual elements, thus reducing hallucinations and providing accurate answers. Moreover, the generated reasons could serve as extra contextual information specific to individual questions, favoring possible research on the mechanism behind in-context learning. We present an example prompt for self-reasoning in Figure 3. ## Data Filtering: Self-Evaluation and Answering Consistency Despite the effectiveness of Self-Questioning, Answering, and Reasoning, the generated image-text pairs could face hallucinatory content, meaningless questions, and erroneous answers. We thus devise filtering rules based on the Evaluation capabilities of LLMs to select high-quality VQA pairs. The whole filtering system is established upon three aspects. **Self-Evaluation of MLLMs.** We prompt Gemini Pro as well as other advanced MLLMs to judge whether the generated questions are meaningful and whether the answers are good enough to correctly address the questions. Figure 3 depicts an example prompt for self-evaluation. **Multi-Prompt Consistency.** Besides direct evaluation of the generated content, we manually augment the prompt and context space in Data Generation. A correct and meaningful VQA pair should be semantically consistent when provided with different prompts. Specifically, in the stage of Answering we provide Gemini Pro with different but semantically similar prompts to answer the given question. Then we discard the VQA pairs if the generated answers are not stable in semantics. An example is given in Figure 3. **Multi-Context Consistency.** Similar to Multi-Prompt Consistency, we further validate the VQA pairs by prepending the question with varied context information. Given the generated question, three types of answers are produced by Gemini Pro with different contexts: (1) Answering with reasoning. Gemini Pro answers the question with a detailed explanation prepended (*i.e.*, content generated in the stage of Reasoning). (2) In-Context answering. Gemini Pro answers the question with chain-of-thought or few-shot prompts prepended. (3) Naive answering. Gemini Pro answers the question with no extra context. We then discard the VQA pairs if the generated answers are not semantically consistent. # TextSquare: A Text-Centric Multimodal Large Language Model ## Model Architecture The model architecture of TextSquare follows the paradigm established by InternLM-Xcomposer2 [internlm-xcomposer2](None), including three integral components: (1) A Vision Encoder modified from OpenAI CLIP ViT-L-14-336 [clip](None), where the resolution is increased to 700 for improved performance. (2) A LLM based on InternLM-2 [internlm2](None), utilizing InternLM2-7B-ChatSFT as the practical variant. (3) A Projector, which semantically aligns the vision token and the text token. ## Supervised Fine-Tuning with Square-10M TextSquare is achieved by performing Supervised Fine-Tuning (SFT) with Square-10M. The SFT process comprises three stages: In the first stage, we unfreeze all the three components (*i.e.*, the Vision Encoder, the LLM, and the Projector) and train the model in a resolution of 490. In the second stage, the input resolution is increased to 700 and only the Vision Encoder is trained to adapt to the resolution change. In the third stage, we further perform full-parameter fine-tuning in the resolution of 700. TextSquare demonstrates that with our Square-10M dataset, a model with 8B parameters and normal-size image resolution can achieve extraordinary performance on text-centric VQA, surpassing most available MLLMs and even the closed-source SOTA models. # Experiment ## Implementation Details The training data contains Square-10M and in-domain datasets (consistent with Monkey’s SFT data). The training process is divided into three phases, using the same data and the AdamW [adamw](None) optimizer with 64 A100-80G GPUs. In the first phase, we fine-tune InternLM-Xcomposer2 with full parameters, and the learning rate decreases from 1e-5 to 1e-6, taking about 9520 GPU hours; In the second phase we scale up the image resolution to 700, and train only VIT, with the learning rate decreasing from 1e-4 to 1e-5, taking about 7280 GPU hours; In the third stage, we perform full-parameter fine-tuning at 700 image resolution, and the learning rate drops from 1e-5 to 1e-6, spending about 12350 GPU hours. ## Benchmark Evaluation We report the results on Scene Text-centric VQA, Document-oriented VQA, Table VQA, Text-centric KIE, OCRBench, and General VQA for a comprehensive comparison of the performance of our model with existing models. The metrics of each benchmark are listed in Table [benchmark] in the Supplementary Material. **Document-Oriented Benchmark.** While the documents have a clean background, dense text and complex typography pose distinct challenges. To effectively evaluate our model, we select representative benchmarks including DocVQA [docvqa](None), ChartQA [chartqa](None), and InfographicVQA [infographicvqa](None). The results, detailed in Table [table-text-bench], show that TextSquare outperforms all the open-source models in these three document-oriented VQA tasks with an average improvement of $3.5$%, specifically, DocVQA $84.3$% *vs.* $81.6$% (Cogagent and mPLUG-DocOwl 1.5), ChartQA $79.4$% *vs.* $72.7$% (Intern-Xcomposer2), InfographicVQA $51.5$% *vs.* $50.4$% (mPLUG-DocOwl 1.5). On the ChartQA dataset, TextSquare outperforms GPT4V and Gemini Pro by a slight margin. Note that TextSquare employs an image resolution of 700, which is smaller than most document-oriented MLLMs. Our model relies on comprehensively high-quality VQA information specific to the text in the document, improving its ability to recognize and understand various document elements such as text, diagrams, infographics, and so on. If the image resolution is further increased, it is believed that the model performance will be further improved, as demonstrated by Monkey et al. **Scene Text-centric Benchmark.** The ability to answer text-based questions in images becomes an important aspect of the answering task as textual information is usually present in real-world scenes. In the evaluation, we utilize two datasets: TextVQA [textvqa](None) and AI2D [ai2d](None). As shown in Table [table-text-bench], in this scenario, although TextSquare achieves SOTA performance on the AI2D dataset, there is no major improvement over our baseline Intern-Xcomposer2, which may be due to the fact that Intern-Xcomposer2 has been adequately optimized with high-quality in-domain data. **Table VQA Benchmark.** Due to the complex structure of tables and the dense text, the understanding of the content of tables remains a challenging issue. In order to evaluate the performance of the comprehension of table content and structure, we choose two widely utilized datasets, Wiki Table Questions (WTQ) [wtq](None) and Table Fact (TabFact) [tabfact](None), as shown in Table [table-text-bench]. On the Table VQA benchmarks, TextSquare achieves optimal performance among the leading models with an average $3.0$% improvement. This demonstrates that our model has reached a new level of table understanding, where high-quality generated table VQA and reasoning data play a key role. **Text-centric KIE Benchmark.** Text-centric key information extraction tasks are frequently encountered in the information processing of various types of products, certificates, and receipts. We select a receipt information extraction dataset (SROIE) [sroie](None) and a product information extraction dataset (POIE) [poie](None), and the KIE task is converted to the VQA task. TextSquare achieves optimal performance in both datasets, with a major average lift of $14.8$% (shown in Table [table-text-bench]). It is worth noting that there is no training set of POIE added to the training set and there is not much data in the domain of product scenarios. This illustrates the extensive textual comprehension capabilities of our model. **OCRBench.** OCRBench [ocrbench](None) is a comprehensive benchmark consisting of 29 OCR-related assessments, with text recognition, formula recognition, text-centric VQA, KIE, etc. TextSquare achieves optimal performance in OCRBench except for the closed-source models and becomes the first MLLM that exceeds $600$ points with about $10$B parameters. It indicates that the model performs well in both text-centric perception and comprehension tasks, especially in text recognition, where little in-domain data is included in the training set.
| | | | | | | |:---|:--:|:--:|:--:|:--:|:--:| | Method | General VQA and Hallucination Evaluation | | | | | | | VizWiz | VQAv2 | GQA | POPE$^{adv}$ | Average | | Qwen-VL [MLLM-2](http://arxiv.org/pdf/2308.12966v3) | 35.2 | 79.5 | 59.3 | \- | \- | | Monkey [monkey](None) | 61.2 | 80.3 | 60.7 | 80.3$^*$ | 70.6 | | Cogagent [cogagent](None) | 36.7$^*$ | **83.7** | 62.3$^*$ | 85.9 | 67.2 | | DocOwl 1.5 [docowl-1.5](None) | 43.5$^*$ | 68.0$^*$ | 48.5$^*$ | 79.7$^*$ | 59.9 | | Llava Next 34B [llava-next](http://arxiv.org/pdf/2404.05465v1) | 63.8 | **83.7** | **67.1** | 83.4 | 74.5 | | GPT4V [gpt4v](http://arxiv.org/pdf/2312.04344v2) | 64.9$^*$ | 77.2 | 48.4$^*$ | 79.6$^*$ | 67.5 | | Gemini Pro [gemini-pro](http://arxiv.org/pdf/2312.17661v1) | 42.8$^*$ | 71.2 | 52.2$^*$ | 84.5$^*$ | 62.7 | | Xcomposer2 [internlm-xcomposer2](None) | 58.9$^*$ | 81.8 | 64.5 | 78.5 | 70.9 | | TextSquare (ours) | **71.4** | 78.0 | 64.5 | **86.6** | **75.1** | Quantitative comparison of our model with existing MLLMs on representative General VQA and hallucination evaluation benchmarks. VizWiz and POPE are relevant to both VQA and hallucination. Following Cogagent, we evaluate the adversarial part of POPE.
**General VQA and Hallucination Evaluation Benchmark.** General VQA requires the ability to learn both visual and textual information and a deep understanding of their inter-relationships. For general VQA, we validate on four benchmarks: VizWiz [vizwiz](None), VQAv2 [vqav2](None), GQA [gqa](None), and POPE [pope](None). The VizWiz and POPE benchmarks are also relevant for hallucination evaluation. The results are shown in Table 1. On VQAv2 and GQA, TextSquare does not have a significant degradation compared to InternLM-Xcomposer2 and still maintains comparable performance. TextSquare exhibits superior capabilities in VizWiz and POPE, outperforming the closest competing method by an average of $3.6$%. These results highlight the effectiveness of our approach, which is also able to mitigate model hallucinations in particular with large-scale instruction tuning. We observe that it is partly attributed to the high-quality reasoning data that provides detailed explanations for VQA. ## Qualitative Analysis As illustrated in Figure 4, TextSquare has a formidable capability to provide plausible explanations of the answers to questions in a variety of text-centric VQA scenarios. Figure 4(a) shows that TextSquare has simple arithmetic capabilities. Figure 4(b) shows the ability to understand textual content and provide approximate location in dense text. Figure 4(c) shows the comprehension of table structure and the ability to extract contextual information relevant to the question.
Qualitative results of VQA and reasoning for various text-centric scenarios.
| Model | OCRBench | DocVQA | ChartQA | InfoVQA | WTQ | SROIE | Average | |:---------------|:--------:|:------:|:-------:|:-------:|:----:|:-----:|:-------:| | Xcomposer2$^*$ | 571 | 74.8 | 73.2 | 41.6 | 40.3 | 44.7 | 54.9 | | TextSquare | 622 | 84.3 | 79.4 | 46.2 | 49.7 | 53.2 | 62.6 | Ablation study on Incorporating Square-10M for Instruction Tuning.
## Ablation Study **The Effect of Incorporating Square-10M for Instruction Tuning.** In order to verify the effectiveness of Square-10M, we fine-tune the baseline model InternLM-Xcomposer2 on the public text-centric VQA instruction tuning dataset (consistent with Monkey’s training data). As shown in Table, TextSquare substantially outperforms Xcomposer2$^*$ (fine-tuned) on various text-centric VQA benchmarks by $7.7$%, which corroborates that Square-10M can fully exploit MLLM’s ability in text-centric VQA scenarios and that a large amount of high-quality instruction tuning data has a major improvement in performance. **The Effect of Evaluation Step of the Square Strategy.** As shown in Table [Tab1], there is a distinct improvement in model performance after incorporating the evaluation of the generated VQA data, which verifies that the evaluation step of the Square strategy improves the quality of VQA instruction tuning data. **The Effect of VQA Reasoning Data on Model Performance and Hallucination Evaluation.** From Table [Tab2], we can find that VQA Reasoning data is helpful in both improving VQA performance and mitigating hallucinations. Specifically, in terms of enhancing VQA performance, there is a 1.4% and 1.3% gain on DocVQA and ChartQA. In terms of mitigating hallucinations, there is a $2.7$% and $3.2$% gain on POPE and WizViz.
The relationship between instruction tuning dataset scale, convergence loss, and model performance in text-centric VQA scenarios. Figure (a) and Figure (b) show the relationship between data scale and convergence loss, distinguished by a scaling of the horizontal coordinate of Figure (b) with log10. Figure (c) and Figure (d) show the relationship between data scale and model performance, distinguished by a scaling of the horizontal coordinate of figure (e) with log10.
## Relationships between Instruction Tuning Data Scale, Convergence Loss, and Model Performance To explore the relationship between instruction tuning data scale, convergence loss, and model performance based on the merged large-scale Square-10M and the in-domain instruction tuning dataset, we conduct 10 sets of experiments for different data scales. The average performance of the models is evaluated on DocVQA, ChartQA, InfoVQA, WTQ, and SROIE. As shown in Figure 5(a)(b), the convergence loss of the model continues to decrease as the data scale grows, whereas the rate of decrease becomes progressively slower. The relationship between the convergence loss and the instruction tuning data scale approximately conforms to a logarithmic function. Similarly, from Figure 5(c)(d), it can be seen that as the instruction tuning data grows, the model performs better and better, but the rate of growth continues to slow down. Their relationship is also approximately in accordance with a logarithmic function. Holistically, there is a corresponding scaling law in the instruction tuning phase in text-centric VQA scenarios, where model performance is proportional to the logarithm of the scale of data. It can guide the construction of potentially larger datasets and predict model performance. # Limitation Although our approach achieves remarkable results in various scenarios, there are some limitations. Firstly, large-scale data requires plenty of GPUs for long-time training, which greatly increases the training consumption. Second, while the Square strategy improves the quality of synthetic data, it still cannot reach the human level. # Conclusion In this paper, we present the Square strategy for constructing a high-quality text-centric instruction tuning dataset(Square-10M). Leveraging this dataset, TextSquare significantly surpasses recent open-source models and even achieves performance comparable to GPT4V across various benchmarks. Furthermore, we derive the relationship between instruction tuning dataset scale, convergence loss, and model performance in order to pave the way for constructing even much larger datasets. Our approach provides a data-centric perspective that revisits the role of instruction-tuning data in text-centric VQA, confirming that both the quantity and quality of data are crucial to model performance. We believe that there is a promising direction on how to further improve the data quantity and quality for closing the gap between open-source models and the leading ones. # Supplementary Material ## Summary of the Evaluation Benchmarks We summarize the evaluation benchmarks used in this paper in Table [benchmark].
Best Practices and Lessons Learned on Synthetic Data for Language Models 2024-04-11 Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai
The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.
Show Paper Content
# Introduction
image
The rapid advancement of artificial intelligence (AI) technologies has led to their widespread adoption across numerous domains, from assistant agents (e.g., ACT-1, from Adept AI[^1]) and software development (e.g., Devin, from Cognition Lab[^2]) to healthcare [singhal2022large](https://arxiv.org/abs/2212.13138) and finance [zheng2022ai](http://arxiv.org/pdf/2106.01901v1). However, the success of AI models heavily relies on the availability of large, diverse, and high-quality datasets for training and evaluation. Acquiring such datasets can be a significant challenge due to data scarcity [babbar2019data](http://arxiv.org/pdf/2208.00147v1), privacy concerns [abay2019privacy](http://arxiv.org/pdf/1801.01594v2), and the sheer cost of data collection and annotation [gilardi2023chatgpt](http://arxiv.org/pdf/2303.15056v2). Pessimists predict that we will run out of fresh text data in 2050 and image data in 2060 [villalobos2022will](https://arxiv.org/abs/2211.04325). Synthetic data has emerged as a promising solution to address these challenges [nikolenko2021synthetic](http://arxiv.org/pdf/1909.11512v1). Synthetic data refers to artificially generated data that mimics the characteristics and patterns of real-world data, but is created through algorithms [saxton2019analysing](https://openreview.net/forum?id=H1gR5iR5FX), generative models [borisov2022language](https://arxiv.org/abs/2210.06280), [meng2022generating](http://arxiv.org/pdf/2004.13952v2), or even simulations [vezhnevets2023generative](https://arxiv.org/abs/2312.03664), [liu2023training](https://arxiv.org/abs/2305.16960), rather than being directly created by humans. By leveraging synthetic data, we can not only overcome the limitations of real-world data but also unlock the potential to develop more robust, reliable, and fair AI models [lucini2021real](http://arxiv.org/pdf/2208.07943v1), [lu2023machine](https://arxiv.org/abs/2302.04062). One of the many benefits of synthetic data is that it can be generated at scale, providing an abundant supply of training and testing data for AI models. This is particularly valuable in domains where real-world data is scarce or difficult to obtain (e.g., weather data covering all conditions [li2023seeds](https://arxiv.org/abs/2306.14066), [lam2023learning](http://arxiv.org/pdf/2402.00059v1)). Second, synthetic data can be tailored to specific requirements, such as ensuring a balanced representation of different classes by introducing controlled variations (e.g., up-weighting low-resource languages in multilingual language learning [przystupa2019neural](https://doi.org/10.18653/v1/W19-5431)). This level of control over data characteristics can improve model performance and generalization. Third, synthetic data can help mitigate privacy concerns by creating anonymized or de-identified datasets that do not contain sensitive personal information [howe2017synthetic](https://arxiv.org/abs/1710.08874), [el2020practical](http://arxiv.org/pdf/2401.06883v1). This is crucial in domains such as healthcare, where patient privacy is of utmost importance [dahmen2019synsys](http://arxiv.org/pdf/2304.03243v1), [wei2019generative](http://arxiv.org/pdf/1910.05827v1). Despite its promise, synthetic data also presents challenges that need to be addressed. One of them is ensuring the factuality and fidelity of synthetic data [wood2021fake](https://doi.org/10.1109/ICCV48922.2021.00366), [heusel2017gans](https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html), as models trained on false, hallucinated or biased synthetic data may fail to generalize to real-world scenarios [van2023synthetic](http://arxiv.org/pdf/2305.09235v2), [guarnera2020deepfake](http://arxiv.org/pdf/2004.10448v1). Researchers must develop sophisticated generative models and evaluation metrics to create synthetic data that accurately reflects the complex patterns and relationships found in real-world data. Another challenge is the potential for synthetic data to amplify biases or introduce new biases if not carefully designed and validated [barbierato2022methodology](http://arxiv.org/pdf/2203.04462v1), [gupta2021transitioning](https://arxiv.org/abs/2105.04144). We believe rigorous testing and fairness assessments are necessary to mitigate these risks. In this paper, we track the current state of synthetic data research and discuss current best practices and lessons learned. The rest of the paper is organized as follows. Section [sec:training] provides an overview of synthetic data generation techniques and their applications in model training, presenting case studies and empirical evidence. Section [sec:evaluation] discusses the usefulness of synthetic data in evaluation. Section [sec:limitation_risks] discusses the challenges and limitations of synthetic data, and in Section [sec:future] we outline potential solutions and future research directions. [^1]: ACT-1: [^2]: Devin: # Synthetic Data in Training [sec:training] Synthetic data, which is generated by mimicking authentic data collected from the real world, has proven to be an effective and relatively low-cost alternative of real data. This section explores several notable domains that leverage synthetic training data. # Synthetic Data in Evaluation [sec:evaluation] Synthetic data is widely used in evaluations of different perspectives: #### Factuality. AI systems may generate information or responses that are not grounded in factual knowledge or data, leading to the creation of misleading or false content, formally known as *hallucination* [ji2023survey](http://arxiv.org/pdf/2311.05232v1). Factuality evaluation aims to ensure the consistency of the knowledge in the AI system’s output with the knowledge provided by its training data and knowledge base [ji2023survey](http://arxiv.org/pdf/2311.05232v1), [zhang2023siren](https://arxiv.org/abs/2309.01219). Early statistical-based hallucination evaluation methods relied on n-grams to directly calculate the overlap of vocabulary between the input and output content  [dhingra2019handling](https://doi.org/10.18653/v1/P19-1483), [wang2020towards](https://doi.org/10.18653/v1/2020.acl-main.101). However, these methods have limitations, as they only consider lexical overlap and do not account for semantics or sentence meaning [ji2023survey](http://arxiv.org/pdf/2311.05232v1), making them unsuitable for evaluating more complex forms of hallucination. Subsequent assurance methods shifted from statistical approaches to model-based methods, which are more robust compared to token-difference-based methods [honovich2021q2](https://doi.org/10.18653/v1/2021.emnlp-main.619). While these model-based evaluation methods are more advanced than their predecessors, they still have limitations. For example, the models can only output the degree of hallucination and may struggle to pinpoint specific errors [falke2019ranking](https://doi.org/10.18653/v1/P19-1213). [feng-etal-2023-factkb](https://doi.org/10.18653/v1/2023.emnlp-main.59) propose to combine LLMs generation with random walks on knowledge graphs to generate synthetic evaluation data for factuality, which is aware of entities and relations on the graphs. [Wei2024LongformFI](https://api.semanticscholar.org/CorpusID:268724304) created a synthetic dataset called LongFact for long-form factuality evaluation and used Google Search as the grounding source and LLM for the automated judgement, to achieve human-level accuracy but with significally lower cost [min2023factscore](http://arxiv.org/pdf/2402.05629v3). #### Safety. Red teaming is a powerful technique for evaluating the safety and robustness of AI models [ganguli2022red](https://arxiv.org/abs/2209.07858), [casper2023explore](https://arxiv.org/abs/2306.09442). By generating diverse and realistic scenarios designed to elicit unaligned or harmful outputs [casper2023red](http://arxiv.org/pdf/2302.10894v3), red teaming can expose vulnerabilities and weaknesses in AI systems [perez2022red](https://aclanthology.org/2022.emnlp-main.225). For example, [perez2022discovering](http://arxiv.org/pdf/2211.04476v2) use LMs to generate datasets for evaluating the behavior of other LMs. They end up producing 154 high-quality datasets which are verified by humans, and discover new cases of inverse scaling where LMs get worse with size. [hubinger2024sleeper](https://arxiv.org/abs/2401.05566) leverage synthetic data to trigger backdoor attacks to LMs at scale; they find LMs can exhibit deceptive behavior and create a false impression of safety under such attacks, and standard “safety training” could not remove such deception easily. These methods demonstrate the feasibility of using AI assistance to scale up human oversight [bowman2022measuring](https://arxiv.org/abs/2211.03540) over complex problems and unseen domains. #### Assisting human evaluation. Recent studies have shown that in many cases, synthetic judgements from large-scale LMs (LLMs) can serve as qualified, fast, and low-cost alternatives to actual human evaluation [doi:10.1073/pnas.2305016120](https://doi.org/10.1073/pnas.2305016120). Using GPT-4 as the judge, Alpaca Eval [alpaca_eval](https://github.com/tatsu-lab/alpaca_eval) and MT Bench [zheng2023judging](https://arxiv.org/pdf/2306.05685) are two popular benchmarks that measure the comprehensive abilities of LM-based ChatBot. In coding tasks, synthetic environment is a common choice to aid human evaluation, as humans can make the assessment more efficiently via actual executions and analysis on running logs. [gu2024cruxeval](https://arxiv.org/abs/2401.03065) propose CRUXEval, a code execution reasoning benchmark consisting of 800 Python functions generated by CodeLLaMA-34B. Similarly, [liu2024codemind](https://arxiv.org/abs/2402.09664) introduce CodeMind, a framework to gauge the code reasoning abilities of LLMs on Independent Execution Reasoning (IER), Dependent Execution Reasoning (DER), and Specification Reasoning (SR). All these evaluations based on synthetic data show strong correlation with real human judgements. # Challenges and Limitations of Synthetic Data [sec:limitation_risks] While synthetic data offers numerous benefits and applications, it is crucial to acknowledge and address the potential challenges and limitations associated with its use. This section delves into three significant concerns surrounding synthetic data: #### Misuse of synthetic data might proliferate misinformation. The potential misuse of synthetic data is a significant concern that must be addressed to ensure the responsible development of AI systems. Current AI models become increasingly capable of generating human-like data ranging from text [reid2024gemini](https://arxiv.org/abs/2403.05530), [team2023gemini](https://arxiv.org/abs/2312.11805), images [saharia2022photorealistic](http://arxiv.org/pdf/2205.11487v1), [ramesh2022hierarchical](https://arxiv.org/abs/2204.06125), songs [^1], to even videos (e.g., OpenAI SORA [^2]). This can be particularly dangerous when synthetic data is used to impersonate real people, manipulate public opinion, or influence political processes. Moreover, the dissemination of synthetic data-driven misinformation can erode trust in legitimate information sources, making it increasingly difficult for people to distinguish between truth and falsehood [byman2023deepfakes](http://arxiv.org/pdf/2209.09111v1), [rid2020active](http://arxiv.org/pdf/2005.13466v2). To mitigate these risks, it is crucial for researchers, developers, and policymakers to establish clear guidelines and best practices for the ethical generation and use of synthetic data, including robust mechanisms for detecting and countering synthetic misinformation [groh2022deepfake](http://arxiv.org/pdf/2105.06496v2). By proactively addressing these challenges, we can harness the benefits of synthetic data while minimizing its potential for harm. #### Synthetic data might cause ambiguity in AI alignment. The increasing use of synthetic data in aligning AI models (e.g., Constitutional AI [bai2022constitutional](https://arxiv.org/abs/2212.08073)) can introduce significant ambiguity and uncertainty. The goal of AI alignment is to ensure that AI systems behave in ways that are aligned with human values and intentions. However, synthetic data, which is artificially generated rather than collected from real-world sources, may not accurately represent the nuances and complexities of human values and preferences [zhou2024real](https://arxiv.org/abs/2403.05020). This discrepancy can lead to AI models learning from data that is biased [feng2023pretraining](https://arxiv.org/abs/2305.08283), [liu2021mitigating](https://ojs.aaai.org/index.php/AAAI/article/view/17744), ungrounded [liu2022mind](https://arxiv.org/abs/2210.05359), [patel2021mapping](https://openreview.net/forum?id=gJcEM8sxHK), or misrepresentative of real-world scenarios [weidinger2021ethical](https://arxiv.org/abs/2112.04359), [ji2023survey](http://arxiv.org/pdf/2311.05232v1). As a result, AI systems trained on synthetic data may exhibit behaviors that are misaligned with human expectations, potentially leading to unintended consequences or even harmful actions [zou2023universal](https://arxiv.org/abs/2307.15043), [anderljung2023frontier](https://arxiv.org/abs/2307.03718). Moreover, the ambiguity introduced by synthetic data can make it challenging to interpret and understand the decision-making processes of AI models [lightman2023let](https://arxiv.org/abs/2305.20050), further complicating the task of ensuring alignment. To mitigate these risks, it is crucial for researchers to carefully consider the limitations and potential drawbacks of using synthetic data in alignment research and to develop robust methods for validating and testing AI models trained on such data. #### Training with synthetic data makes evaluation decontamination harder. The use of synthetic data in model training poses significant challenges to fair evaluation. Evaluation benchmarks are often created by referring to public text sources, such as coursework websites or forums. Consequently, it is arguable that all publicly available benchmark test cases might occasionally be included in the pre-training data of LLMs [hoffmann2022empirical](http://arxiv.org/pdf/2309.08777v2), [gao2020pile](https://arxiv.org/abs/2101.00027). The use of synthetic data exacerbates this issue rather than mitigating it. Although the community has proposed several techniques to detect such evaluation contamination, such as *min-$k$% prob* [shi2023detecting](https://arxiv.org/pdf/2310.16789), which checks the probabilities of $k$ long-tail tokens, these token-level decontamination methods are inadequate when the model is trained with synthetic data. Synthetic data might include rephrased versions of the benchmark data [oren2023proving](https://arxiv.org/abs/2310.17623), [mattern2023membership](https://arxiv.org/abs/2305.18462), rendering token-level decontamination ineffective. In addition to developing more advanced evaluation contamination detection techniques, we recommend that model developers invest in creating and maintaining in-house and protected evaluation benchmarks. These proprietary benchmarks should be carefully safeguarded to prevent leakage and ensure the integrity of the evaluation process. [^1]: Make songs with Suno AI: [^2]: OpenAI Sora: # Directions for Future Work [sec:future] As the field of synthetic data continues to evolve, there are several promising directions for future research and development. This section outlines three key areas that warrant further exploration: #### Synthetic data scaling. The impressive performance of many over-trained small language models (e.g., Mistral series models [jiang2023mistral](https://arxiv.org/abs/2310.06825), and Gemma series models [team2024gemma](https://arxiv.org/abs/2403.08295), *inter alia*) demonstrates the necessity of training with large amount of tokens (even passing the compute-optimal chinchilla law [rae2021scaling](https://arxiv.org/abs/2112.11446)). However, whether we have similar conclusions on the training with synthetic data is still an open question, as the quality of synthetic data may not be as consistent as real-world data [yu2024large](http://arxiv.org/pdf/2306.15895v2). Future research should investigate the scaling laws for synthetic data and determine the optimal balance between the quantity and quality of synthetic samples. This exploration could help us understand the most effective strategies for leveraging synthetic data in training large-scale language models, potentially leading to more efficient and cost-effective approaches [muennighoff2024scaling](http://arxiv.org/pdf/2202.03371v1). #### Further improving quality and diversity of synthetic data. While existing methods for generating synthetic data have shown promise, there is still room for improvement in terms of creating high-quality, attributed synthetic samples that closely mimic real-world data. Future research should focus on developing new advanced techniques (or based on existing ones such as Generative Adversarial Networks (GANs) [goodfellow2020generative](http://arxiv.org/pdf/1810.12576v1) or Diffusion Models [ho2020denoising](https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html), *inter alia*) that can control and manipulate specific attributes of the generated data, enabling the creation of diverse and customizable synthetic datasets. Additionally, researchers should explore methods that can incorporate domain-specific knowledge to ensure the generated data adheres to the underlying constraints and patterns present in the target domain (e.g., via Retrieval Augmented Generation (RAG) [lewis2020retrieval](https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html), [borgeaud2022improving](https://proceedings.mlr.press/v162/borgeaud22a.html)) while maintaining the data quality. By advancing the state-of-the-art in attributed synthetic data generation, we can unlock new opportunities for privacy-preserving analysis [assefa2020generating](http://arxiv.org/pdf/2111.12984v1), and model training across various fields, from healthcare (e.g., synthetic medical images [frid2018synthetic](http://arxiv.org/pdf/1803.01229v1), [wei2019generative](http://arxiv.org/pdf/1910.05827v1)) and finance (e.g., simulated trading trajectories [zheng2022ai](http://arxiv.org/pdf/2106.01901v1)) to social sciences [argyle2023out](http://arxiv.org/pdf/2209.06899v1), [park2023generative](http://arxiv.org/pdf/2208.04024v1) and beyond. #### Towards high-fidelity and more efficient scalable oversight. As AI models become increasingly complex and autonomous, it becomes challenging to monitor and assess their behavior using traditional oversight methods that rely on human supervision or real-world data [amodei2016concrete](https://arxiv.org/abs/1606.06565). Future research should explore the use of synthetic data for high-fidelity scalable oversight of these advanced systems. Existing methods typically simulate a certain scenario in social iterations, such as debate [leike2018scalable](https://arxiv.org/abs/1811.07871), reflection [zhang2023exploring](https://arxiv.org/abs/2310.02124), or revisions [liu2023training](https://arxiv.org/abs/2305.16960) to obtain synthetic data, while new approaches could cover more comprehensive scenarios and more modalities [sun2023aligning](https://arxiv.org/abs/2309.14525), as recent studies have found many issues of simulation that only covers a narrowed down [cheng-etal-2023-compost](https://doi.org/10.18653/v1/2023.emnlp-main.669) or over-simplified [zhou2024real](https://arxiv.org/abs/2403.05020) scenes. Looking forward, another growing direction could be how to achieve scalable oversight more efficiently—given that we have the full control over the synthetic data generation, we can probably provide more targeted oversights with less synthetic data. As the need for effective AI governance and regulation grows, synthetic data will play an increasingly vital role in enabling more trustworthy scalable oversight mechanisms that promote robust, accountable, and safe deployment of AI technologies for the benefit of society [askell2021general](https://arxiv.org/abs/2112.00861), [bowman2022measuring](https://arxiv.org/abs/2211.03540). #### The emergent self-improvement capability. We typically choose the most capable model to generate synthetic data, as its generation is of higher quality. However, an intriguing question arises: can a model generate synthetic data that is better than the data it was trained on, thus enabling it to improve itself? This concept of self-improvement through synthetic data generation is an exciting avenue for future research. If a model can generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data [chen2024selfplay](https://arxiv.org/pdf/2401.01335). This self-improvement capability could lead to the emergence of more advanced AI systems that can autonomously refine their skills and knowledge over time [burns2023weak](https://arxiv.org/abs/2312.09390), [huang-etal-2023-large](https://doi.org/10.18653/v1/2023.emnlp-main.67). Although recent work shows encouraging progress in this direction [chen2024selfplay](https://arxiv.org/pdf/2401.01335), [yuan2024self](https://arxiv.org/abs/2401.10020), the upper bound of self-improvement and the underlying reasons for its effectiveness remain open questions. Future research should investigate the theoretical underpinnings and practical feasibility of self-improvement through synthetic data generation in more diverse scenarios, examining the necessary conditions, potential limitations, and associated risks. By unlocking the potential of emergent self-improvement capabilities, we could enable more adaptable, efficient, and autonomous learning processes [lecun2022path](http://arxiv.org/pdf/1409.8027v2). # Conclusion Synthetic data has emerged as a promising solution to address the challenges of data scarcity, privacy concerns, and high costs in AI development. By generating realistic and diverse datasets, synthetic data enables the training and evaluation of AI models at scale across various domains. As we approach human-level or even superhuman-level intelligence, obtaining synthetic data becomes even more crucial, given that models need better-than-average-human quality data to progress. However, ensuring the factuality, fidelity, and lack of bias in synthetic data remains a critical challenge. Future research directions on synthetic data could focus on improving the fidelity and controllability of generative models and developing standardized evaluation and contamination protocols and tools. We could also explore the integration of synthetic data with other techniques and its application in other domains. Despite the challenges, the potential benefits of synthetic data in advancing AI research are significant. By leveraging synthetic data responsibly and effectively, we can build more powerful, inclusive, and trustworthy AI systems that benefit society as a whole.
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset 2024-03-14 Hugo Laurençon, Léo Tronchon, Victor Sanh
Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
Show Paper Content
Hugging Face
0.5cm0.5cm Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
Comparison of an original web page (input) on the left, and the rendering of the code generated by our model - Sightseer - (output) on the right. To match the visual aspect of the original webpage, the model picked a suitable image background from www.unsplash.com
# Introduction Current advancements in vision-language models (VLMs) have significantly improved their capabilities, enabling them to master a variety of tasks including image captioning, question answering, and optical character recognition (OCR) [openai2023gpt4](https://arxiv.org/pdf/2303.08774), [geminiteam2023gemini](https://arxiv.org/pdf/2312.11805), [hong2023cogagent](https://arxiv.org/pdf/2312.08914), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/). Despite these achievements, the task of converting screenshots of websites or web components into usable HTML code—a process highly valuable to web developers—remains relatively unexplored, particularly in the open-source community. The development and open-source release of a model capable of such a conversion could unlock new AI-powered tools for UI developers, facilitating the creation of no-code modules and plugins for design tools like Figma. For instance, the ability to rapidly transform a design sketch into a functional UI component and code could significantly increase the iteration pace for UI developers. We posit that the primary challenge for VLMs to achieve proficiency in this specific task does not stem from the inherent difficulty of the task itself. Rather, it is the lack of a large, high-quality, dataset of pairs of HTML codes and their associated screenshots that poses the primary obstacle. In fact, VLMs are commonly trained on web-scale datasets of image-text pairs [schuhmann2022laion5b](https://arxiv.org/pdf/2210.08402), [gadre2023datacomp](https://arxiv.org/pdf/2304.14108) or multimodal web documents [laurencon2023obelics](https://openreview.net/forum?id=SKN2hflBIZ), [zhu2023multimodal](https://openreview.net/forum?id=tOd8rSjcWz). Having such a dataset of screenshots-HTML pairs as an open and accessible artifact would significantly accelerate research in this area by enabling the community to inspect the data, its limitations and improve upon the dataset. Consequently, our initial focus is on developing a dataset useful for the fine-tuning of VLMs for this task. To accomplish this, several strategies can be considered: 1. *Leveraging existing webpages and their HTML codes.* The vast repository of HTML files available online (and often captured in web crawls like Common Crawl) presents a tempting resource for generating pairs of screenshots and corresponding HTML codes by simply rendering the HTML and capturing the output. However, this approach poses significant challenges. HTML files found on the web are often laden with noise like comments, scripts or data, and can be excessively lengthy, encompassing a very large number tokens, sometimes even exceeding the maximum sequence length of most current models. This complexity hinders a model’s ability to accurately learn the correlation between the contents of a screenshot and the underlying HTML syntax. Additionally, HTML codes frequently incorporate references to external JavaScript (JS) or Cascading Style Sheets (CSS) scripts, or rely on files located in separate directories. This dependency complexifies the creation of a self-contained HTML file that faithfully reproduces the intended design in a screenshot. Given these obstacles, we opted to forego this method in favor of a more controlled approach. 2. *Synthesizing HTML codes with Large Language Models (LLMs).* The most recentlarge language models, especially those trained extensively on programming languages, show remarkable proficiency in generating high-quality code applicable to various domains, including website development. This capability opens the door to artificially create a vast corpus of HTML codes using a LLM specialized in coding which has been further fine-tuned to follow instructions. By adapting the prompts, we can introduce specific constraints to the code generation process, such as controlling the topic, the text length or the image placement in the websites. This level of control not only ensures the production of relevant HTML code but also makes them more suitable for VLMs by providing the models with cleaner, more concise, and structured data that models can be effectively trained on. Our study adopts this approach. In response to the identified gap, we develop WebSight, a comprehensive synthetic dataset comprising 2 million examples of HTML code paired with corresponding screenshots. Leveraging this dataset, we proceed to fine-tune our forthcoming foundational VLM of 8 billion parameters, notably enhanced by robust OCR capabilities, to obtain the specialized model Sightseer. This fine-tuning process yields promising outcomes, demonstrating the model’s proficiency in converting webpage screenshots into functional HTML code. Remarkably, the model also exhibits the versatility to adapt to untrained scenarios, such as transforming handwritten sketches into functional HTML code. To accelerate advancements in this direction, we open source WebSight. # Related work [Nguyen2015ReverseEM](https://api.semanticscholar.org/CorpusID:7499368) uses a classical pipeline of interface elements recognition (images, texts, containers, etc.) with computer vision and optical character, followed by heuristics to generate code on these detections. The authors show the effectiveness of this approach on mobile UIs. [beltramelli2017pix2code](http://arxiv.org/pdf/1705.07962v2) introduces an end-to-end method for generating computer code from graphical user interface (GUI) screenshots using deep learning. The model, trained end-to-end, can generate code for different platforms (iOS, Android, and web) from a single input image. It uses convolutional and recurrent neural networks to interpret GUI screenshots and generate corresponding code. In [pix2struct](https://proceedings.mlr.press/v202/lee23g.html), authors pre-train a VLM to convert masked screenshots of web pages into simplified HTML, and show the effectiveness of this training objective to pretrain foundational VLM that transfers well to a variety of downstream tasks. Similar to Sightseer, their model accepts images of varying resolutions as input. In our recent beta release of WebSight-v0.1, we provided a dataset with 823K synthetic pairs of screenshots and associated HTML + traditional CSS code. In the current version of WebSight discussed in this paper (v0.2), we introduce significant improvements. First, WebSight-v0.2 replaces the colored rectangles used as image placeholders in WebSight-v0.1 with real images that match the website’s content. Additionally, we adopt Tailwind CSS to streamline the code and facilitate the creation of visually appealing designs. Other notable upgrades include 2.5x the dataset size, offering higher resolution screenshots, and providing richer metadata. WebSight-v0.1 has already proven to be a helpful resource. In Design2Code [si2024design2code](https://arxiv.org/pdf/2403.03163), the authors create a benchmark for evaluating VLMs at generating HTML code given a screenshot. They also fine-tune an 18B-parameter VLM on WebSight-v0.1, after observing that models trained on synthetic examples outperform those trained on longer, more complex real-world code data. # Construction of the dataset

image image

Examples of synthetic web pages present in WebSight.
#### Overview of the strategy Our synthetic HTML code generation process involves two key steps for maximizing diversity and quality. First, we employ a smaller language model to generate a variety of website themes and designs. These creative outputs serve as the foundation for the next stage, where they are fed into the prompts of a larger language model, mostly trained on code data. This LLM then generates the final HTML code, ensuring that our dataset encompasses a wide range of styles while generating high-quality codes. #### Generating diverse website concepts We employ Mistral-7B-Instruct [jiang2023mistral](https://arxiv.org/pdf/2310.06825) to generate several million unique website concepts and designs with the prompt: Here are 5 examples of generated concepts: #### Opting for Tailwind CSS over traditional CSS Generating visually diverse and appealing designs requires more than just pure HTML. However, to simplify the learning process of VLMs, employing standalone code is preferable to managing separate files. In this context, Tailwind CSS emerges as an ideal solution. This utility-first framework allows creating unique designs by providing a wide array of utility classes, enables direct styling within the HTML document, and eliminates the need for external style files. Tailwind CSS offers an extensive array of predefined classes that mirror various CSS properties. By integrating these utility classes into HTML elements, we can efficiently style web pages, resulting in concise code that is easier for VLMs to learn from. #### Using a code specialized LLM to generate the HTML codes To generate the final HTML codes, we leverage Deepseek-Coder-33b-instruct [guo2024deepseekcoder](https://arxiv.org/pdf/2401.14196), a state-of-the-art language model mostly trained on code data and fine-tuned to follow instruction. We use the prompt: An initial challenge was the text-only nature of our outputs, contrasting with the real websites containing many images. The task of integrating images into an HTML code seems hard, especially when trying to look for images related to the context of the web page. However, we discovered an effective solution through photo stocks like `https://source.unsplash.com/`, which offers the capability to dynamically generate images based on keywords, thus providing images of any size and relevant to any specified topics. After a filtering step in which we discard web pages with insufficient text, generic content or images not aligning with the website’s topic, we finally ended up with 2 million web pages. #### Screenshot capture process We use Playwright[^1] to visualize and capture the output of our generated HTML codes. We ensure that screenshots encompass the entire web page, regardless of its length. As a result, our dataset features screenshots in a wide range of resolutions. This diversity in image size and format is useful for enhancing the robustness of our model. #### Visualization of WebSight examples Five examples present in WebSight are shown in Figure 2. # Fine-tuning a foundation vision-language model on WebSight #### Model prerequisites for webpage conversion For a model to accurately convert webpage screenshots into HTML code, it necessitates several capabilities. These include advanced OCR to transcribe text from images, spatial understanding to arrange elements on the page, and object recognition abilities to replicate images similar to those in the input with the strategy explained above. We use our forthcoming foundation VLM as the base model. It is built upon Mistral-7B [jiang2023mistral](https://arxiv.org/pdf/2310.06825) and SigLIP-SO400M [zhai2023sigmoid](https://arxiv.org/pdf/2303.15343), and is using the Patch n’ Pack strategy [dehghani2023patch](https://arxiv.org/pdf/2307.06304) to preserve the original aspect ratio of the input images, with a resolution of up to 980 pixels for each side. This base model was trained mostly on OBELICS [laurencon2023obelics](https://openreview.net/forum?id=SKN2hflBIZ), synthetic captions of image/text pairs datasets, and a combination of OCR datasets [biten2022ocridl](https://arxiv.org/pdf/2202.12985). Further insights into the model’s architecture and its training process will be detailed upon its release. #### Fine-tuning on WebSight For the fine-tuning, instead of unfreezing all the weights, which requires lowering significantly the learning rate for a stable training, we use the parameter efficient DoRA method [liu2024dora](https://arxiv.org/pdf/2402.09353) with a rank 64. We use the same learning rate that was chosen during the pre-training, $10^{-4}$, while seeing 2016 examples per iteration, for a total of 1100 iterations, representing a bit less than one epoch. We find that the validation loss is not a good indicator of the trained model and in particular the quality of generated codes in real-world cases. Consequently, we perform checkpoint selection by manually inspecting generated samples rather than relying on the validation loss. Despite the validation loss continuing to decrease significantly over several epochs, it did not translate into an increased ability to generalize to websites that differ from those in the training dataset. # Qualitative evaluation ## Results on different screenshots
Comparison of an original web page (input) on the left, and the rendering of the code generated by our model, Sightseer, (output) on the right.
Figure 3 showcases various outputs from Sightseer when fed with simple website designs. Notably, in instances where the input contains a limited amount of text, this text tends to be accurately preserved in the output. Remarkably, Sightseer sometimes exhibits the capability to generalize beyond its training dataset to websites that differ significantly in appearance, as evidenced by its conversion of a handwritten website sketch into functional HTML code. ## Failure cases
Examples where Sightseer-generated codes fall short in replicating the intended design.
In our analysis, Sightseer struggles with complex website layouts, excessive text, or designs significantly divergent from its training data. In some instances, generated code includes elements such as images, text, or buttons that fail to appear upon rendering. This can result from issues like text colored identically to its background or incorrect syntax use, suggesting that Sightseer has not fully mastered the HTML + Tailwind CSS syntax. While the model produces visually more attractive websites, it sometimes produces errors not observed in our initial model[^2] trained on WebSight-v0.1, which used traditional CSS instead of Tailwind CSS. As a more recent framework than traditional CSS, Tailwind CSS has less frequent occurrence in the pre-training data of the base LLM, and we hypothesize that the LLM has bigger challenges in fully mastering its syntax. We posit that starting with a foundational VLM pre-trained with text-only HTML + Tailwind CSS in the mixture of data could significantly enhance Sightseer’s translation accuracy, and we are exploring related strategies to achieve this improvement. # Conclusion In this work, we introduce WebSight, a large synthetic dataset of 2 million pairs of HTML codes and corresponding renderings, and Sightseer, a vision and language model with OCR ability fine-tuned on WebSight, as contributions towards automating the conversion of webpage screenshots to HTML code. By leveraging synthetic data generation and fine-tuning a high-capacity base VLM on the dataset, we demonstrate a viable path to accelerate UI development tasks and enhance no-code solutions with increasingly more powerful AI-powered tools. By open-sourcing WebSight, we aim to foster further innovation and research in this area. [^1]: [^2]:
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts 2023-10-03 Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
Show Paper Content
# Introduction [sec:introduction] This complexity is observed not only in textual scenarios but also significantly in visual contexts. For instance, when assessing a child’s mathematical and reasoning capabilities, problems are often designed to encompass visual contexts in addition to arithmetic calculations [stipek1989developmental](http://arxiv.org/pdf/2002.02184v2), [pollitt2020assessing](http://arxiv.org/pdf/1506.02794v1). At the same time, AI agents with strong mathematical reasoning capabilities in visual contexts have a wide range of real-world applications, such as solving complex problems in educational disciplines [seo2015solving](http://arxiv.org/pdf/2308.02823v1), [wang2017deep](http://arxiv.org/pdf/2302.03145v1), helping analysts with logical queries about statistical data [wu2023bloomberggpt](None), [yang2023fingpt](None), and assisting in theorem proving and scientific discovery in advanced research fields [taylor2022galactica](http://arxiv.org/pdf/2405.11265v1), [dong2023large](None), [trinh2024solving](http://arxiv.org/pdf/2404.06405v2).
Accuracies of on our proposed across mathematical reasoning and visual context types. PoT GPT-4 is a textual, program-aided LLM augmented with the Bard caption and OCR text.
Numerous datasets have been curated to assess the mathematical reasoning abilities of AI systems, with most presented purely in text form. Some datasets such as ChartQA [lu2021inter](None), [dahlgren2022clevr](None), [masry2022chartqa](None) have explored mathematical reasoning in vision-language settings. However, these datasets tend to either focus on specific tasks, like math word problems, or particular visual contexts, such as geometry problems or bar charts. General-purpose visual question answering (VQA) datasets on natural scenes contain only a small portion of questions necessitating mathematical reasoning, leaving a comprehensive investigation of vision-language reasoning within a mathematical framework largely unexplored. Large Language Models (LLMs) [openai2022chatgpt](https://openai.com/blog/chatgpt), [openai2023gpt4](None) and Large Multimodal Models (LMMs) [google2023bard](https://bard.google.com/), [openai2023gpt4v](https://openai.com/research/gpt-4v-system-card), [team2023gemini](http://arxiv.org/pdf/2405.12107v1) have exhibited impressive problem-solving skills in many tasks and domains. Recently, some studies have aimed to augment existing LLMs with mathematical and scientific reasoning capabilities using external tools [lu2023chameleon](http://arxiv.org/pdf/2305.00061v1), [wang2023scibench](None). However, the ability of these foundation models to perform mathematical reasoning in visual contexts has not been systematically examined. In this paper, we present , a consolidated **Math**ematical reasoning benchmark in **Vis**ual contexts. We propose a task taxonomy to guide the development of : (1) we identify seven mathematical reasoning types: *algebraic reasoning*, *arithmetic reasoning*, *geometry reasoning*, *logical reasoning*, *numeric common sense*, *scientific reasoning*, and *statistical reasoning*; (2) we focus on five primary tasks: *figure question answering* (FQA), *geometry problem solving* (GPS), *math word problem* (MWP), *textbook question answering* (TQA), and *visual question answering* (VQA); and (3) we encompass a diverse array of visual contexts, including natural images, geometry diagrams, abstract scenes, synthetic scenes, as well as various figures, charts, and plots. incorporates 28 existing multimodal datasets, including 9 math-targeted question answering (MathQA) datasets and 19 VQA datasets. In addition, we have created three new datasets (*i.e.*, IQTest, FunctionQA, PaperQA) which are tailored to evaluating logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. Overall, consists of 6,141 examples, with 736 of them being newly curated (Table [tab:statistics]). To facilitate fine-grained evaluation, examples are annotated with metadata, including question type, answer type, task category, grade level, visual context, and required reasoning skills. Detailed descriptions of data collection can be found in §[sec:dataset], §[sec:collection_guideline], and §[app:collection_details]. We conduct extensive experiments on to evaluate the reasoning abilities of 12 foundation models known for their leading performance in mathematical and multimodal reasoning. This ensemble includes three LLMs (*i.e*, ChatGPT, GPT-4, Claude-2), two proprietary LMMs (*i.e.*, GPT-4V, Bard), and seven open-source LMMs. For LLMs, we examine zero-shot and few-shot settings using two prompting strategies: chain-of-thought (CoT) [wei2022chain](http://arxiv.org/pdf/2201.11903v6) and program-of-thought (PoT) [chen2022program](http://arxiv.org/pdf/2211.12588v4). These LLMs can also be augmented with off-the-shelf visual models for image captioning and OCR. We establish a human performance baseline by engaging qualified human annotators with a high school diploma or higher. We show that , featuring advanced topics such as college curricula and scientific reasoning, is a very challenging benchmark, with human performance reaching only 60.3% accuracy. Our results indicate that CoT GPT-4, the best-performing LLM without visual tool augmentations, achieves an overall accuracy of 29.2%. Multimodal Bard, the best-performing LMM, achieves 34.8% (§[sec:results]), which attains only 58% of human performance (34.8% vs 60.3%). When augmented with Bard captions and OCR text, PoT GPT-4 obtains 33.9%, closely matching Multimodal Bard (§[sec:fine_grained_results]). Further analysis indicates that the Multimodal Bard model failures arise from incorrect calculations and hallucinations caused by visual perception and textual reasoning (§[sec:qualitative_analysis]).
Examples of our newly annotated datasets: IQTest, FunctionQA, and PaperQA.
# The Dataset [sec:dataset] ## Collection Guidelines As discussed previously, there is a notable gap in existing benchmarks, which primarily evaluate mathematical reasoning in textual contexts, overlooking the intrinsic visual nature of many mathematical problems. Our dataset, , is therefore motivated to bridge this gap, offering a robust evaluation benchmark for mathematical reasoning intertwined with visual understanding, thus pushing AI assistants towards general-purpose capabilities. Our benchmark adheres to the following collection guidelines: (1) it covers multiple tasks and topics to mirror real-world applications; (2) it incorporates diverse visual contexts and mathematical skills to foster a well-rounded evaluation; (3) it offers varying levels of challenge to effectively probe and uncover the potential limitations of current models; and (4) it provides robust evaluation settings for deterministic evaluations. The taxonomy for this work is introduced as follows: We identify seven types of mathematical reasoning: *algebraic reasoning*, *arithmetic reasoning*, *geometry reasoning*, *logical reasoning*, *numeric common sense*, *scientific reasoning*, and *statistical reasoning*, with detailed definitions provided in §[sec:math_reasoning] and examples shown in §[app:math_examples]. We focus on five primary tasks: *figure question answering* (FQA), which centers around statistical reasoning over multiple charts and plots; *geometry problem solving* (GPS), which deals with geometrical topics; *math word problem* (MWP), which involves arithmetic reasoning in everyday scenarios; *textbook question answering* (TQA), which usually entails knowledge-intensive reasoning on scientific topics and figures; and *visual question answering* (VQA). Furthermore, our objective is to account for a diverse array of visual contexts, including natural images, geometry diagrams, abstract scenes, synthetic scenes, multiple charts and plots, scientific figures, tables, function plots, puzzle test figures, and more, with examples shown in §[app:visual_context]. ## Data Collection [sec:data_collection] #### Collection of MathQA datasets. We collected nine MathQA datasets in multimodal settings, including four for GPS, two for MWP with visual contexts of synthetic scenes, abstract diagrams, and tables, and two for TQA on college curricula (see §[sec:source_data]). Annotations such as solutions, programs, parsing results, and grounded theorems are also collected, providing demonstration examples for LLMs. Each source dataset is limited to up to 400 examples to ensure a balanced representation of each source in our final compiled benchmark. In total, we collected 2,666 examples. #### Review and collection of VQA datasets. Many existing VQA datasets feature instances requiring mathematical reasoning abilities, such as arithmetic operations or numeric common sense. Incorporating these datasets enhances problem diversity in terms of tasks, domains, visual contexts, and reasoning skills involved. We reviewed more than 70 datasets, collecting 19 of them that contain math-related instances and are publicly available, as listed in §[sec:source_data]. Since these datasets are not originally math-targeted, we initially designed heuristic rules to automatically select examples likely to involve mathematical reasoning from a large pool of candidates. Examples with numeric answers or those containing quantity words (as listed in §[sec:automatic_selection]) in the questions were selected. This automatic filtration yielded 4,949 VQA-format examples, though some false positive examples remained. Therefore, we engaged three expert annotators to manually label these examples to determine if they involve mathematical reasoning (more details in § [sec:human_is_math]). Utilizing majority voting and limiting each source dataset to 400 examples, we finalized a collection of 2,739 examples. #### Collection of three new datasets. While the source datasets we collected encompass multiple visual contexts and mathematical reasoning abilities, certain scenarios remain unaddressed: logical reasoning on puzzle test diagrams, statistical reasoning on functional plots, and scientific reasoning on academic figures. To address these gaps, we introduced three new datasets: IQTest, FunctionQA, and PaperQA, with examples illustrated in Figure [fig:our_new_3_datasets]. IQTest comprises 228 examples requiring inductive reasoning, abstract thinking, pattern prediction, and calculations, sourced from puzzle test figures on online learning platforms. FunctionQA, with 400 examples, emphasizes subtle visual perceptions of functional plots and algebraic reasoning concerning variables, expressions, equations, and functions. PaperQA is a novel dataset featuring questions derived from informative academic illustrations, including tables, figures, and charts from online education resources, with 107 examples sourced from papers released in August 2023 on Huggingface[^1]. To ensure data quality, all questions were manually annotated by graduate students in STEM fields and further refined through a rigorous review process. The GUI of the annotation tool is shown in Figure [fig:gui_new_data_annotation] in §[sec:annotate_new_data]. ## Metadata Annotation Fine-grained metadata facilitates a comprehensive analysis of models’ reasoning capabilities across various aspects. To this end, we annotate the examples in with information including question type, answer type, language, source, category, task, grade level, and visual context, which can be accurately obtained from the details provided in the source datasets. features seven different types of mathematical reasoning abilities, as categorized in Table [tab:math_definition][sec:math_reasoning]). Coarse labels of mathematical reasoning can be automatically obtained from the details of the source datasets. To verify the quality of automatic annotation, expert annotators manually label the mathematical reasoning categories from seven candidates for 1,000 examples, using the annotation tool illustrated in §[sec:human_math_reasoning]. The results show that 94.1% of the examples from automatic and human annotations have the exact same set of reasoning types, while 98.79% of the individual labels are identical, indicating that the automatic annotation for the labeling of mathematical reasoning is highly accurate. ## Data Preparation and Release consists of 6,141 examples, divided into two subsets: *testmini* and *test*. *testmini* contains 1,000 examples, intended for model development validation or for those with limited computing resources. The *test* set features the remaining 5,141 examples for standard evaluation. Notably, the answer labels for *test* will not be publicly released to prevent data contamination, and we will maintain an online evaluation platform. To ensure that each source dataset is well represented in *testmini* and to maintain a distribution in *testmini* closely resembling the whole set, we adopted this sampling strategy: (1) first, randomly sample questions with a threshold number of 4 for each source dataset; (2) then, randomly sample the remaining questions for each source dataset on its proportion in the entire set. The KL Divergence and Total Variation (TV) distance between the *testmini* set and the entire set are 0.008 and 0.035, respectively, suggesting that *testmini* is close to the distribution of the whole set. We also conducted several quality checks to address any unidentified errors. ## Data Analysis
Statistic Number
Total questions 6,141
 - multiple-choice questions 3,392 (55.2%)
 - Free-form questions 2,749 (44.8%)
 - Questions with annotations 5,261 (85.6%)
 - Questions newly annotated 736 (12.0%)
Unique number of images 5,487
Unique number of questions 4,746
Unique number of answers 1,464
Source datasets 31
 - Existing VQA datasets 19
 - Existing MathQA datasets 9
 - Our newly annotated datasets 3
Visual context (image) classes 19
Maximum question length 213
Maximum answer length 27
Maximum choice number 8
Average question length 15.6
Average answer length 1.2
Average choice number 3.4

Source dataset distribution of . FQA: figure question answering, GPS: geometry problem solving, MWP: math word problem, TQA: textbook question answering, VQA: visual question answering.
The main statistics of are presented in Table [tab:statistics]. There are two types of questions: multiple-choice and free-form. Answers to free-form questions are categorized as integers, floating numbers, or lists. The large unique number of images, questions, and answers ensures pattern diversity in . is derived from 31 source datasets, including three newly annotated datasets to address the missing types of mathematical reasoning over specific visual contexts. Dataset examples in Table [tab:math_examples][app:math_examples] ) highlight the richness of mathematical reasoning involved. Examples in §[app:visual_context] demonstrate the diverse visual contexts present in . Further details on data analysis are available in §[app:data_analysis]. [^1]: # Experiments ## Evaluation Protocols [sec:evaluation_protocol] Recent LLMs and LMMs have been instructed to generate long responses in conventional settings instead of short text. Therefore, we propose a new strategy for benchmarking , unlike using human-designed or template matching rules [lu2022learn](http://arxiv.org/pdf/2209.09513v2). The evaluation process consists of three stages: *response generation*, *answer extraction*, and *score calculation*. Initially, the baselines generate responses given the input query, which incorporates the task description, the question, the choices, and the metadata, using the template defined in Table [tab:promt_response_generation][sec:promt_response_generation]). Next, the short answer text is extracted from the detailed response. We propose an answer extractor (§[sec:promt_answer_extraction]) based on LLMs such as GPT-4, inspired by its remarkable ability for text processing [wei2022chain](http://arxiv.org/pdf/2201.11903v6). A preliminary study of 200 examples shows that GPT-4 can extract the answer text with more than 99.5% accuracy. Finally, the extracted answer is normalized to a required answer format (e.g., an option letter or an integer), and the target metric scores are computed. Taking advantage of the fact that the instances in are either multiple-choice questions for textual answers or free-form questions for numerical answers, accuracy scores are used as metrics for deterministic evaluation. ## Experimental Setup [sec:experimental_setup] We evaluate the models on under three setups: (a) *Text-Only LLMs* including ChatGPT [openai2022chatgpt](https://openai.com/blog/chatgpt), GPT-4 [openai2023gpt4](None), and Claude-2 [claude2](https://www.anthropic.com/index/claude-2) in zero-shot and two-shot settings with Chain-of-Thought (CoT) [wei2022chain](http://arxiv.org/pdf/2201.11903v6) and Program-of-Thought (PoT) [chen2022program](http://arxiv.org/pdf/2211.12588v4), (b) *Augmented-LLMs* where the LLMs are provided with additional visual information including the generated image captions from Multimodal Bard [google2023bard](https://bard.google.com/) and the detected OCR text from EasyOCR [jaidedai2020easyocr](https://github.com/JaidedAI/EasyOCR), (c) *LMMs* that include open-source models such as IDEFICS-9B [laurencon2023obelics](https://arxiv.org/pdf/2306.16527), mPLUG-OWL-LLaMA-7B [ye2023mplug](None), miniGPT-4-LLaMA-2-7B [zhu2023minigpt](None), LLaMA-Adapter-V2-7B [gao2023llamaadapterv2](None), InstructBLIP-Vicuna-7B [instructblip](https://arxiv.org/pdf/2305.06500), LLaVA-LLaMA-2-13B [liu2023llava](http://arxiv.org/pdf/2402.11690v1), LLaVAR [zhang2023llavar](None), and . We provide the prompts for LLMs and the hyperparameters used for LMMs in §[app:setup].
## Experimental Results [sec:results] We compare the performance of several models, including Text-only LLMs, Augmented LLMs, and LMMs on in Table [tab:mathvista]. We include random chance (*i.e.*, one of the options in multiple-choice questions, and empty in the free-form questions) and frequency guess (§[sec:frequent_guess]) as naive baselines. Additionally, we established a human performance baseline using Amazon Mechanical Turk. Eligible human annotators must have a satisfactory annotating history, successfully pass qualification examples, and possess a high school degree or higher. We asked each annotator to complete five questions within 20 minutes. Further details can be found in §[sec:human_performance]. Among text-only LLMs, all models outperform the random baselines, with the 2-shot GPT-4 using chain-of-thought (CoT) prompting achieving 29.2%. The limited performance of text-only LLMs suggests that our dataset requires models to reason within visual contexts for optimal results. When equipped with image captions and detected OCR text, augmented LLMs exhibit superior performance compared to their text-only counterparts on . Specifically, the best-performing augmented LLM is the 2-shot GPT-4 employing program-of-thought (PoT) prompting, which scores 33.9%. The open-source models (IDEFICS to LLaVA) achieve underwhelming performance on . This can be attributed to their lack of math reasoning capabilities, text recognition (useful for math word problems), shape detection (useful for geometrical problems), and chart understanding. Notably, these models utilize different model architectures for processing the vision (e.g., OpenCLIP, CLIP, Vit-G) and language (e.g., LLaMA-1, LLaMA-2), different alignment strategies (e.g., MLP projection in LLaVA, Q-former in InstructBLIP, visual abstractor in mPLUGOwl), and instruction tuning data (e.g., 150K instruction-response pairs from LLaVA data, 3,500 instruction-response pairs from miniGPT-4). While fine-tuned with instruction-following data from text-rich images, LLaVAR does not perform well, indicating that strong text recognition abilities do not guarantee high performance on , which requires comprehensive visual perception and mathematical reasoning. This underscores that there are immense possibilities for innovations in model, data, or training objectives to improve the zero-shot performance of LMMs on . ## Fine-grained Results [sec:fine_grained_results] We also report fine-grained scores for a comprehensive study of the capabilities of existing models across different tasks (Table [tab:mathvista]), mathematical reasoning abilities (Table [tab:mathvista], Figures [fig:tease_scores], [fig:math_reasoning_bar_chart]), visual context types (Figures [fig:tease_scores], [fig:visual_context_bar_chart]), and grade levels (Figure [fig:grade_level_bar_chart]). We perform an ablation study on the augmented LLMs and present the results in Table [fig:llm_ablation_study] (see §[sec:llm_ablation_study]). The gap in the performance of the Augmented LLMs can be attributed to poor image captions, which may not adequately describe the math in visual contexts, the inability of the OCR to detect shapes useful for geometrical reasoning, and the lack of mathematical reasoning capabilities. ## Qualitative Analysis [sec:qualitative_analysis]
Error analysis of Bard results: (a) presents errors in answers and explanations; (b) delves into the details of wrong explanations. Notations: “Answer” is “Ans.”, “Explanation” is “Exp.”, “Partially Correct” is “Partial”, and “Not applicable” refers to unanswerable or indeterminate cases.
image **Question:** Is the function ($f:$ $R$ to $R$) injective? **Choices:** (A) Yes (B) No **Response from Bard:** The function $f(x) = x^2$ is not injective because it takes two different inputs, $-1$ and $1$, to the same output, $1$. This means that there is no one-to-one correspondence between the domain and the range of the function. In the image, we can see that the graph of the function passes through the points $(-1, 1)$ and $(1, 1)$. This means that $f(-1) = f(1) = 1$. So, if we have two different inputs, $-1$ and $1$, they both map to the same output, $1$. This violates the definition of an injective function. Therefore, the answer is (B). **Correct output:**
image **Question:** Find $z$. **Choices:** (A) 7      (B) 9 (C) 12    (D) 15 **Response from Bard:** The Law of Cosines states that for a triangle with side lengths $a$, $b$, and $c$ and angle measure $\theta$ opposite side $c$, the following equation holds: $c^2 = a^2 + b^2 - 2ab\cos(\theta)$ In this case, we have $a = 7$, $b = 9$, and . Plugging these values into the Law of Cosines, we get: Since , we can simplify the equation to get: Taking the square root of both sides, we get . Since $\sqrt{130}$ is not among the answer choices, we round $z$ to the nearest integer, which is $\boxed{12}$. **Correct output:**
#### Success and failure analysis of Multimodal Bard. In §1.3, we observe that Multimodal Bard achieves the highest average accuracy on . Here, we analyze its predictions through human evaluation to understand its mode of success and failure. To do so, we ask the human workers, from Amazon Mechanical Turk (AMT), to study Bard’s predictions given the math question, its associated image, and the ground truth from dataset for 250 instances. Specifically, workers were instructed to decide whether the predictions contained the correct answer with the correct explanation. If the workers find that the model’s explanation is incorrect, they had to choose whether the wrong explanation was due to various failure modes such as incorrect reasoning with *hallucination* or wrong calculations. In our setup, we define hallucination as an introduction of incorrect facts, in the model explanation, that is not mentioned in the context of the image or question (e.g., in Figure [fig:visual_hallucination] and Figure [fig:textual_hallucination]). More details can be found in §[sec:human_study_bard]. We present the distribution of the quality of Bard’s predictions, judged by the human annotators, in Figure 1 (a). We find that 44.6% of the Bard’s predictions had incorrect answers with incorrect explanations. Interestingly, we observe that Bard responds with partial (6.8%) or completely (8.1%) incorrect explanations despite giving the correct answer to the input image and question, highlighting its failure to reach the correct answer for the wrong reasons. In Figure 1 (b), we present the distribution over possible reasons when Bard provides incorrect explanations. Notably, we find that 49.6% of its responses contain hallucinations. Our analysis highlights that hallucination is a major source of errors in the generative foundation models [lu2023dl4math](http://arxiv.org/pdf/2212.10535v2), [ji2023survey](http://arxiv.org/pdf/2311.05232v1). We also observe that the model responds with correct reasoning but either hallucinates (18.6%) or performs wrong calculations (19.5%) leaving an overall impression of being a wrong explanation. #### Qualitative examples of Multimodal Bard. We also present a few qualitative examples of Bard’s predictions. In Figure [fig:bard_examples] (a), we find that Bard generates the correct answer with the correct explanation, including detecting the correct function (*i.e.*, $f(x) = x^2$) and analyzing its properties (*i.e.*, injective) to answer the question. However, in Figure [fig:bard_examples] (b), we observe that the model provides the correct answer (*i.e.*, 12) but with an incorrect explanation (*i.e.*, using the law of cosines when the question requires an understanding of the properties of isosceles triangles). We present more examples in §[sec:more_examples_bard]. Overall, our analysis of Bard highlights its modes of failure in detail, which could guide future foundation model design to address these issues.
image **Question:** A race car driver kept track of how many laps he drove in the past 5 days. What is the mode of the numbers? **Detected OCR text:** (\[55, 9\], Laps driven), (\[21, 34\], Day), (\[81, 37\], Number of laps), (\[1, 64\], Thursday),  (\[129, 67\], 53), (\[3, 99\], Friday),  (\[129, 99\], 53), (\[1, 126\], Saturday),  (\[129, 129\], 53), (\[3, 161\], Sunday),  (\[129, 161\], 55), (\[1, 188\], Monday),  (\[129, 191\], 55) **Response from GPT-4:** ``` python def mode(laps): return max(set(laps), key=laps.count) laps = [53, 53, 53, 55, 55] print(mode(laps)) ``` **Correct output:**
image **Question:** What is this function most likely be? **Choices:** (A) polynomial (B) trigonometric (C) exponential (D) logarithmic **Caption from Bard:** The image shows a graph of a function $f(x) = c$ on a graph paper. The $x$ axis is the horizontal axis and the $y$ axis is the vertical axis. The graph is a horizontal line that passes through the point $(0, c)$. ... The text “$(0, c)$” means that the point $(0, c)$ is on the graph. The text “$f(x) = c$” means that the function $f(x)$ is equal to $c$ for all values of $x$. ... The graph of a constant function is a horizontal line. . ... **Response from GPT-4:** The function shown in the image is a constant function, which is a type of polynomial function. Thus, the answer is A. **Correct output:**
#### Qualitative examples of Augmented GPT-4. Augmented with external visual models, CoT GPT-4 and PoT GPT-4 are able to achieve comparable performance with Multimodal Bard. As shown in Figure [fig:gpt4_examples] (a), provided with the accurate OCR text detected in the image, PoT GPT-4 accurately understands the structural information of the image and generates a code snippet to perform precise statistical reasoning. In Figure [fig:gpt4_examples] (b), the caption provides some accurate descriptions of the image (e.g., $f(x)=c$) along with hallucination (e.g., $y=3$, the line passes through $(0,3)$) caused by the external Bard model. Although CoT GPT-4 predicts the correct answer given the partially correct information, the qualities of visual information augmented by external models have an impact on the accurate visual perception and thus the final mathematical reasoning performance. Examples in §[sec:model_comparison] show failure cases due to hallucination caused by external visual models. # Related Work [sec:short_related_work] Several benchmarks [amini2019mathqa](http://arxiv.org/pdf/1905.13319v1), [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), [mishra2022lila](None), [frieder2023mathematical](http://arxiv.org/pdf/2306.16282v1) have emerged to assess the mathematical reasoning capabilities of LLMs, but most focus solely on text-based tasks. Current benchmarks, such as GSM-8K [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), exhibit performance saturation. Given the rise of LMMs [li2023multimodal](http://arxiv.org/pdf/2309.10020v1), there is a need for robust multimodal benchmarks in scientific domains. To address this gap, we introduce a math reasoning dataset that incorporates visual contexts. VQA datasets [antol2015vqa](None), [gurari2018vizwiz](None), [mobasher101parsvqa](None) gauge the visual reasoning abilities of LMMs. Recent studies explore assessing LMMs beyond natural images, including abstract scenes, geometry diagrams, figures, charts, documents, and synthetic images [lu2021inter](None), [kahou2017figureqa](None), [masry2022chartqa](None). In this work, we introduce new datasets (IQTest, FunctionQA, PaperQA) to create a holistic benchmark for evaluating mathematical reasoning. Generative foundation models like GPT-3, ChatGPT, GPT-4, Claude, and LLaMA have enabled diverse task solutions without fine-tuning. Specialized pretraining methods like PixStruct [lee2023pix2struct](None), MatCha [liu2022matcha](None), and UniChart [masry2023unichart](None) enhance chart reasoning in visual contexts. Models like LLaVA, miniGPT4, InstructBLIP, and Bard leverage large-scale image-text data, while specialized versions, such as LLaVAR [zhang2023llavar](None), [ye2023mplug](None), emphasize document understanding and math comprehension. Recent works [bitton2023visit](None), [yu2023mm](None) evaluate instruction-following and reasoning capabilities, underscoring the growing importance of generative foundation models in practical applications. We introduce as a benchmark to evaluate their math reasoning capabilities in varied visual contexts. # Conclusion # Detailed Related Work [sec:related_work] #### Mathematical reasoning benchmarks. Recently, numerous benchmarks [amini2019mathqa](http://arxiv.org/pdf/1905.13319v1), [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), [mishra2022lila](None), [frieder2023mathematical](http://arxiv.org/pdf/2306.16282v1) have been proposed to evaluate the mathematical reasoning capabilities of Large Language Models (LLMs). However, most of these are textual only [lu2023dl4math](http://arxiv.org/pdf/2212.10535v2), despite a substantial amount of mathematical information and reasoning being encapsulated in visual modalities. Meanwhile, some datasets exhibit performance saturation; for instance, GPT-4 achieves 92.0% accuracy on GSM-8K [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), a dataset of grade-school mathematics questions. On the other hand, the recent rapid advancement of Large Multimodal Models (LMMs) necessitates the establishment of robust multimodal benchmarks. However, current multimodal reasoning benchmarks provide limited coverage of rigorous and scientific domains [antol2015vqa](None), [kembhavi2016diagram](http://arxiv.org/pdf/1603.07396v1), [kahou2017figureqa](None), [mathew2022infographicvqa](None), which are key components for creating general-purpose AI assistants. To bridge this gap, it is crucial to develop a robust math reasoning dataset that integrates visual contexts. #### Vision-language reasoning benchmarks. High-quality evaluation datasets and benchmarks are a cornerstone for assessing the progress of machine learning models to solve real-world tasks [liao2021we](None). Prior studies such as VQA [antol2015vqa](None), [goyal2017making](None), VizWiz [gurari2018vizwiz](None), and ParsVQA-Caps [mobasher101parsvqa](None) assess the general-purpose visual question answering abilities of the LMMs, with or without task-specific training, on open-ended questions about images. In addition, there are several works that focus on evaluating specific skills of the LMMs beyond natural scenes, such as abstract scenes and shapes) [antol2015vqa](None), [lu2021iconqa](None), [ji2022abstract](http://arxiv.org/pdf/2211.16492v1), geometry diagrams  [seo2015solving](http://arxiv.org/pdf/2308.02823v1), [lu2021inter](None), [chen2022unigeo](None), [cao2022augmented](http://arxiv.org/pdf/2206.02978v1), figures and charts [methani2020plotqa](None), [masry2022chartqa](None), [kahou2017figureqa](None), [chang2022mapqa](None), [kafle2018dvqa](None), documents (text in images)  [singh2019towards](None), [mathew2022infographicvqa](None), [liu2023hidden](None), or synthetic images [dahlgren2022clevr](None), [li2023super](None), [bitton2023breaking](None). Besides, there has been significant progress on developing datasets to judge LMMs on skills that require external knowledge  [schwenk2022okvqa](None), [shah2019kvqa](None), common sense reasoning [zellers2019recognition](http://arxiv.org/pdf/2402.17213v1), [yin2021broaden](http://arxiv.org/pdf/2402.17213v1), scientific-knowledge [lu2022learn](http://arxiv.org/pdf/2209.09513v2), [kembhavi2017you](None), [kembhavi2016diagram](http://arxiv.org/pdf/1603.07396v1), medical understanding [zhang2023pmc](None), [lau2018dataset](http://arxiv.org/pdf/2311.18681v1). In this work, we create new datasets (IQTest, FunctionQA, PaperQA) and subsequently design a benchmark for holistic evaluation of the math reasoning capabilities of the LMMs. #### Generative foundation models and their evaluation. Recently, there has been a surge of generative foundation models [bommasani2021opportunities](http://arxiv.org/pdf/2110.10024v1) that are trained on web-scale data, such as GPT-3, ChatGPT, GPT-4, Claude, LLaMA, LLaMA-Adapter [brown2020language](http://arxiv.org/pdf/2112.07522v2), [openai2022chatgpt](https://openai.com/blog/chatgpt), [openai2023gpt4](None), [claude2](https://www.anthropic.com/index/claude-2), [touvron2023llama](None), [llamaadapter2023](None), with the ability to solve a wide range of downstream tasks [wei2022emergent](http://arxiv.org/pdf/2403.15796v2) without any task-specific finetuning. Prior work has focused on evaluating their abilities to respond to the queries from various disciplines, grounded in text, such as QA, math, medicine, coding and science [bubeck2023sparks](http://arxiv.org/pdf/2303.12712v5), [nori2023capabilities](None), [chen2021evaluating](http://arxiv.org/pdf/1810.11895v3), [fu2023codeapex](None), [sun2023scieval](None), [wang2023scibench](None), [huang2023c](http://arxiv.org/pdf/2305.08322v3), [huang2022language](http://arxiv.org/pdf/2404.04619v1), [liu2023agentbench](None), [llamaadapter2023](None). Prior work, such as PixStruct [lee2023pix2struct](None), MatCha [liu2022matcha](None), and UniChart [masry2023unichart](None), has focused on developing specialized pretraining recipe for improved math and chart reasoning in visual contexts. On the vision-language side, there are several generative foundation models such as LLaVA, miniGPT4, InstructBLIP, Flamingo, LLaMA-Adapter V2, Multimodal Bard [liu2023llava](http://arxiv.org/pdf/2402.11690v1), [zhu2023minigpt](None), [instructblip](https://arxiv.org/pdf/2305.06500), [alayrac2022flamingo](http://arxiv.org/pdf/2205.07065v1), [awadalla2023openflamingo](None), [gao2023llamaadapterv2](None), [google2023bard](https://bard.google.com/) that are trained on vast amount of paired [schuhmann2022laion](None), [sharma2018conceptual](None), [lin2014microsoft](None) and interleaved image-text data [zhu2023multimodal](None). In addition, there has been recent development on specialized versions of these LMMs for document understanding where visual contexts require text recognition, math understanding being one of them [zhang2023llavar](None), [ye2023mplug](None). In recent times, there have been several works, such as Visit-Bench, LVLM-eHub, MMBench [bitton2023visit](None), [yu2023mm](None), [liu2023mmbench](None), [xu2023lvlm](None), [shao2023tiny](None), that assess their instruction-following and reasoning capabilities. As the generative foundation models become more relevant to real-world applications, unlike prior work, we propose to benchmark their capabilities of math reasoning (logical, arithmetic, statistical) on a diverse set of visual contexts (word problems in images, natural scenes, geometrical shapes, and plots). #### # Limitations of the Benchmark # Data Collection Guidelines [sec:collection_guideline] ## Mathematical Reasoning Definition [sec:math_reasoning] Seven mathematical reasoning types are defined in Table 1.
Definitions and proportions of seven mathematical reasoning categories in .
Math Reasoning Description
Arithmetic Reasoning
()
It covers the fundamental operations such as addition, subtraction, multiplication, division, and understanding of number properties. It may also include the ability to interpret numerical data in different forms.
It focuses on data interpretation and analysis, including measures (mean, median, mode), dispersion metrics (standard deviation, range), probability concepts, regression, correlation, and data inferences. It also identifies trends, outliers, and patterns.
It encompasses understanding variables, equations, and the manipulation of expressions with polynomials and exponents. It also covers solving simple to complex equations, and grasping functions, their properties, and graphical depictions.
Geometry Reasoning
()
It emphasizes spatial understanding, analysis of 2D and 3D figures, and reasoning about their shapes, sizes, and relationships. It includes symmetry, congruency, similarity, area, volume, and transformations.
It involves intuitive understanding of daily numerical concepts, including understanding time differences, numerical judgment, and estimates. It covers temporal reasoning, spatial numeric assessments, and practical uses like budgeting and time reading.
It deals with the application of mathematical concepts in scientific contexts. This includes scientific notations, formula use, understanding rates, proportions, and percentages in practical situations, and problem-solving in scientific inquiries.
Logical Reasoning
()
It focuses on critical thinking and deduction from provided information, including pattern recognition, sequence understanding, predictions, and statement evaluation. Key components include premises, conclusions, and the use of abstract reasoning.
Definitions and proportions of seven mathematical reasoning categories in .
## Mathematical Reasoning Examples [app:math_examples]
| **Math** | **Examples** | |:---|:---| | ARI | image | | **Solution:** | | | Find the cost of the silk scraps. Multiply: \$9.08 $\times$ 4 = \$36.32 | | | Find the cost of the canvas scraps. Multiply: \$8.17 $\times$ 4 = \$32.68 | | | Now find the total cost by adding: \$36.32 + \$32.68 = \$69 | | | She spent \$69. | | | **Answer:** 69 | | | STA | image | | **Answer:** 2 | | | ALG | image | | **Choices:** (A) larger than (B) equal to (C) smaller than | | | **Answer:** (A) larger than | | | | | | **Question:** How many zeros does this function have? | | | **Answer:** 1 | | | | | | **Question:** What is the value of $y$ at $x=1$? | | | **Answer:** 0 | | | | | | GEO | image | | **Diagram logic forms:** | | | `PointLiesOnLine(D, Line(B, A))` | | | `PointLiesOnCircle(B, Circle(D, radius))` | | | `PointLiesOnCircle(A, Circle(D, radius))` | | | `PointLiesOnCircle(C, Circle(D, radius))` | | | **Answer:** (C) 8.5 | | | NUM | image | | **Named entities:** Winston Churchill, Charles de Gaulle | | | **Wiki caption**: Winston Churchill and General de Gaulle at Marrakesh, January 1944 | | | **Answer:** 16 | | | SCI | image | | **Answer:** 5.77 | | | LOG | image | | **Solution:** | | | Circle + Square = 5, Triangle + Triangle = 8, | | | Triangle = 4. | | | Circle + Triangle = 7, Circle = 3. | | | Therefore Square = 2 | | | **Answer:** 2 | | Examples of seven mathematical reasoning categories in .
## Visual Context Types [app:visual_context]
Examples of the visual context for the geometry diagram type.
Examples of the visual context for the synthetic scene type.
Examples of the visual context for the bar chart type.
Examples of the visual context for the natural image type.
Examples of the visual context for the scientific figure type.
Examples of the visual context for the table type.
Examples of the visual context for the function plot type.
Examples of the visual context for the abstract scene type.
Examples of the visual context for the puzzle test type.
Examples of the visual context for the scatter plot type.
Examples of the visual context for the line plot type.
Examples of the visual context for the pie chart type.
Examples of the visual context for the document image type.
Examples of the visual context for the medical image type.
Examples of the visual context for other types, including word cloud, map chart, radar chart, violin plot, and heatmap chart.
## Source Dataset Summary [sec:source_data] The source datasets are summarized in Table 3.
| **Dataset** | **Category** | **Task** | **Context** | **Math Skill** | |:---|:--:|:--:|:--:|:--:| | IQTest (Ours) | Math-Targeted | FQA | Puzzle Test | Logical, Arithmetic | | PaperQA (Ours) | Math-Targeted | FQA | Charts and Plots | Scientific | | FunctionQA (Ours) | Math-Targeted | TQA | Function Plot | Algebraic | | Geometry3K `\citeyearpar{lu2021inter}`{=latex} | Math-Targeted | GPS | Geometry Diagram | Geometry, Algebraic | | GeoQA+ `\citeyearpar{cao2022augmented}`{=latex} | Math-Targeted | GPS | Geometry Diagram | Geometry, Algebraic | | GEOS `\citeyearpar{seo2015solving}`{=latex} | Math-Targeted | GPS | Geometry Diagram | Geometry, Algebraic | | UniGeo `\citeyearpar{chen2022unigeo}`{=latex} | Math-Targeted | GPS | Geometry Diagram | Geometry, Algebraic | | CLEVR-Math `\citeyearpar{dahlgren2022clevr}`{=latex} | Math-Targeted | MWP | Synthetic Scene | Arithmetic | | IconQA `\citeyearpar{lu2021iconqa}`{=latex} | Math-Targeted | MWP | Abstract Scene | Arithmetic | | TabMWP `\citeyearpar{lu2023dynamic}`{=latex} | Math-Targeted | MWP | Table | Statistical, Arithmetic | | SciBench `\citeyearpar{wang2023scibench}`{=latex} | Math-Targeted | TQA | Scientific Figure | Scientific | | TheoremQA `\citeyearpar{chen2023theoremqa}`{=latex} | Math-Targeted | TQA | Scientific Figure | Scientific | | ChartQA `\citeyearpar{masry2022chartqa}`{=latex} | General VQA | FQA | Charts and Plots | Statistical | | FigureQA `\citeyearpar{kahou2017figureqa}`{=latex} | General VQA | FQA | Charts and Plots | Statistical | | DVQA `\citeyearpar{kafle2018dvqa}`{=latex} | General VQA | FQA | Bar Chart | Statistical | | MapQA `\citeyearpar{chang2022mapqa}`{=latex} | General VQA | FQA | Map Chart | Statistical | | PlotQA `\citeyearpar{methani2020plotqa}`{=latex} | General VQA | FQA | Scatter Plot | Statistical | | DocVQA `\citeyearpar{mathew2022infographicvqa}`{=latex} | General VQA | FQA | Document Image | Statistical | | AI2D `\citeyearpar{kembhavi2016diagram}`{=latex} | General VQA | TQA | Scientific Figure | Scientific | | ScienceQA `\citeyearpar{lu2022learn}`{=latex} | General VQA | TQA | Scientific Figure | Scientific | | TQA `\citeyearpar{kembhavi2017you}`{=latex} | General VQA | TQA | Scientific Figure | Scientific | | A-OKVQA `\citeyearpar{schwenk2022okvqa}`{=latex} | General VQA | VQA | Natural Image | Arithmetic, Numeric | | KVQA `\citeyearpar{shah2019kvqa}`{=latex} | General VQA | VQA | Natural Image | Arithmetic, Numeric | | ParsVQA-Caps `\citeyearpar{schwenk2022okvqa}`{=latex} | General VQA | VQA | Natural Image | Arithmetic, Numeric | | TextVQA `\citeyearpar{singh2019towards}`{=latex} | General VQA | VQA | Natural Image | Arithmetic, Numeric | | VizWiz `\citeyearpar{gurari2018vizwiz}`{=latex} | General VQA | VQA | Natural Image | Arithmetic, Numeric | | VQA2.0 `\citeyearpar{goyal2017making}`{=latex} | General VQA | VQA | Natural Image | Arithmetic, Numeric | | PMC-VQA `\citeyearpar{zhang2023pmc}`{=latex} | General VQA | VQA | Medical Image | Scientific | | VQA-RAD `\citeyearpar{lau2018dataset}`{=latex} | General VQA | VQA | Medical Image | Scientific | | Super-CLEVR `\citeyearpar{li2023super}`{=latex} | General VQA | VQA | Synthetic Scene | Arithmetic | | VQA-AS `\citeyearpar{antol2015vqa}`{=latex} | General VQA | VQA | Abstract Scene | Arithmetic | Summary of the 31 different source datasets in . Among these, FunctionQA, IQTest, and PaperQA are our newly annotated datasets. The table provides details on their category, task, visual context, and primary mathematical reasoning skill types.
# Data Collection Details [app:collection_details] ## Automatic Selection of Mathematical Problems [sec:automatic_selection]
1 most, least, fewest more, less, fewer, largest, smallest, greatest, larger, smaller, greater, highest, lowest, higher, lower, increase, decrease, minimum, maximum, max, min, mean, average, median, total, sum, add, subtract, difference, quotient, gap, half, double, twice, triple, square, cube, root, approximate, approximation, triangle, rectangle, circle, square, cube, sphere, cylinder, cone, pyramid, multiply, divide, percentage, percent, ratio, proportion, fraction, rate
## Human Labeling of Mathematical Problems [sec:human_is_math]
GUI for labeling if a problem involves mathematical reasoning.
1 We are compiling a dataset that incorporates image context and involves mathematical reasoning (MathQA in visual contexts). We have gathered a set of examples in which some involve mathematical reasoning, while others do not. In our task, a question can be classified as a mathematical problem if it - Involves numbers or symbols in the question text or the image context, AND requires further operations or transformations to be performed on them to reach a solution. - Involves more complex forms of mathematical reasoning, including logical reasoning, abstract thought, and understanding of patterns. Based on the definition above, a problem is classified as a negative example (NOT involving mathematical reasoning) if it: - Does not involve any numbers or quantity words, OR - Involves only counting, reading, or recognizing numbers, OR - Relies solely on factual information, such as recalling years and dates.
We developed an annotation tool, as illustrated in Figure 1, to enable expert annotators to label problems that involve mathematical reasoning. Annotators were trained using detailed instructions, as shown in Table [tab:instruction_is_math], along with a variety of examples—positive ones that involve mathematical reasoning and negative ones that do not. We provided three labeling options: - *Yes* - This indicates that the problem involves mathematical reasoning. - *No* - This indicates that the problem does not involve mathematical reasoning. - *Unsure* - This option should be selected if it is uncertain whether the problem involves mathematical reasoning. (Annotators are advised to use this option sparingly.) They may leave comments if they find anything incorrect or offensive for removal at a later stage. In our study, we employed the Fleiss Kappa score to conduct an inter-annotator agreement analysis among three annotators tasked with labeling examples based on mathematical reasoning. The Fleiss Kappa score is a statistical measure used to evaluate the reliability of agreement between multiple raters, providing a quantifiable metric to assess the consistency across different annotators. A score of 1 indicates perfect agreement, while a score of 0 suggests no agreement beyond what would be expected by chance. Our analysis yielded a Fleiss Kappa score of 0.775, indicating a substantial level of consistency among the annotators. This high degree of agreement underscores the reliability of our annotation process and affirms the quality of the labeled data generated for our study. ## Annotating Three New Datasets [sec:annotate_new_data]
GUI for annotating our new source datasets.
## Human Labeling of Mathematical Reasoning [sec:human_math_reasoning]
GUI for labeling mathematical reasoning skills.
# More Dataset Analysis [app:data_analysis] #### Question distribution. Apart from English questions, contains 6.57% non-English questions, including languages such as Chinese and Persian. The multilingual feature necessitates that models be capable of understanding and processing multiple languages to ensure accurate results across the dataset. As illustrated in Table [fig:source_dataset], the average number of words in English questions within is 15.58, while the maximum number of words in a question reaches 213. Figure 1 further elucidates the distribution of word counts, highlighting the diverse patterns of questions. features two types of questions: multiple-choice questions and free-form questions. For multiple-choice questions, the average number of choices is 3.4, while the maximum number of choices is 8. In the case of free-form questions, answers can be integers, floating-point numbers, or lists, which can be converted into a standard format. The standard settings in question and answer types facilitate consistent accuracy evaluation for existing models.
The distribution of the number of words per question in . Questions with a length greater than 60 are categorized as 61 for visualization simplicity.
#### Dataset category and task type. Source datasets in can be categorized into two types: math-targeted VQA datasets, which are originally proposed for assessing mathematical reasoning, and general VQA datasets, which address visual reasoning in everyday scenarios. The distribution proportions of these two categories (55.4% vs. 44.6%, as illustrated in Figure 2) within enable a balanced examination of mathematical reasoning in both domain-specific and general-purpose applications. The distribution of the five tasks contained within is visualized in Figure 3. The relatively balanced distribution of these tasks enhances the benchmarking robustness that our dataset provides.
Category distribution of problems within .
Task type distribution of problems within .
#### Grade level. The datasets within are categorized into four distinct grade levels: *elementary school*, *high school*, *college*, and *not applicable*, each representing a different level of reasoning complexity and contextual application. The *elementary school* category aligns with the typical mathematical curriculum of elementary education, introducing basic topics such as arithmetic operations and introductory geometry. *High school* level questions delve into more complex mathematical concepts such as algebra, geometry, and introductory calculus. The *college* category encapsulates the highest level of complexity, featuring questions on advanced mathematical and scientific concepts like calculus, linear algebra, and physics. Questions without specific grade levels are categorized as *not applicable*. The distribution of questions across these grade levels is visualized in Figure 4. This structured categorization enriches the diversity of , providing a meaningful framework for evaluating and benchmarking the mathematical and visual reasoning capabilities of various models across different educational contexts, thereby assessing their practical utility and educational relevance.
Distribution of questions across different grade levels within .
#### Visual context. The datasets within encompass over 10 different visual contexts (with the distribution shown in Figure 5), crucial for evaluating models’ ability to interpret and reason across diverse visual information. Common visual contexts include geometry diagrams, synthetic scenes, bar charts, natural images, and scientific figures as illustrated in Figure [fig:2] to Figure [fig:13]. Less frequent, yet equally important visual contexts such as medical images, word clouds, map charts, radar charts, violin plots, and heatmap charts are depicted in Figure [fig:14] and Figure [fig:15]. These visual contexts, ranging from common to specialized representations, challenge the models to decode and reason with varying visual information, contributing to a more robust and comprehensive evaluation. The diversity in visual contexts enriches , enhancing the benchmarking robustness and providing a solid foundation for understanding the practical utility and domain-specific performance of various models across different domains and applications.
Visual context distribution within .
#### Mathematical reasoning ability. The datasets within encompass a spectrum of seven distinct mathematical reasoning types, facilitating a thorough evaluation of models’ mathematical reasoning capabilities. Figure 6 illustrates the portion of each reasoning type involved in the problems, with arithmetic being the most frequent and logical reasoning being the least frequent. This distribution reflects the varying degrees of mathematical reasoning required across different problems. Figure 7 further delineates the distribution of reasoning types, showcasing a mean of 1.45. The sparse distribution observed aids in the precise analysis of each type’s performance by the models, providing a nuanced understanding of their strengths and weaknesses across different mathematical reasoning domains. This structured representation of mathematical reasoning types within not only enriches the dataset but also significantly contributes to a more robust and comprehensive evaluation of models, aiding in the identification of areas for improvement and the development of more proficient mathematical reasoning models.
Portion of each mathematical reasoning type involved in the problems of .
Distribution of the number of mathematical reasoning types within .
# More Details on the Setup [app:setup] ## Frequent Guess [sec:frequent_guess] We employ a strategy where the most frequent answers in the *testmini* set are utilized as predictions for various question and answer types. For multiple-choice questions, the most frequent option is selected based on the number of available options. For instance, option $B$ is chosen for questions with two options, aligning with the answer distribution in *testmini*. Similarly, for questions requiring an answer type of integer, a floating number with one decimal place, a floating number with two decimal places, or a list, we use $2$, $1.2$, $0.21$, and $[0, 2, 0, 2, 1, 7, 1, 2, 0, 3, 0, 6]$ respectively, in accordance with the answer distribution observed in *testmini*. ## Prompt for Answer Extraction [sec:promt_answer_extraction] The prompt used to instruct GPT-4 for answer extraction is illustrated in Table 1.
| **Element** | **Prompt** | |:--:|:---| | Task description | Please read the following example. Then extract the answer from the model response and type it at the end of the prompt. | | Example 1 | | | **Question:** Which number is missing? | | | **Model response:** The number missing in the sequence is 14. | | | **Extracted answer:** | | | | | | Example 2 | | | **Question:** What is the fraction of females facing the camera? | | | **Model response:** The fraction of females facing the camera is 0.6, which means that six out of ten females in the group are facing the camera. | | | **Extracted answer:** | | | Example 3 | | | **Question:** How much money does Luca need to buy a sour apple candy and a butterscotch candy? (Unit: \$) | | | **Model response:** Luca needs \$1.45 to buy a sour apple candy and a butterscotch candy. | | | **Extracted answer:** | | | | | | Example 4 | | | **Question:** Between which two years does the line graph saw its maximum peak? | | | **Model response:** The line graph saw its maximum peak between 2007 and 2008. | | | **Extracted answer:** | | | Example 5 | | | **Question:** What fraction of the shape is blue? | | | **Choices:** (A) 3/11 (B) 8/11 (C) 6/11 (D) 3/5 | | | **Model response:** The correct answer is (B) 8/11. | | | **Extracted answer:** | | Task description along with five examples used to prompt GPT-4 for answer extraction.
## Prompts for Response Generation [sec:promt_response_generation]
| **Question type** | **Answer type** | **Task instruction** | |:--:|:--:|:---| | multiple-choice | Text | Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. | | | | Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end. | | Free-form | Float (1) | Please answer the question requiring a floating-point number with one decimal place and provide the final value, e.g., 1.2, 1.3, 1.4, at the end. | | | | Please answer the question requiring a floating-point number with two decimal places and provide the final value, e.g., 1.23, 1.34, 1.45, at the end. | | Free-form | List | Please answer the question requiring a Python list as an answer and provide the final list, e.g., \[1, 2, 3\], \[1.2, 1.3, 1.4\], at the end. | The task instructions for different question and answer types in answer extraction. Here, Float (1) refers to a floating-point number with one decimal place, and Float (2) refers to a floating-point number with two decimal places.
## Prompt for Caption Generation We instruct Multimodal Bard to generate a detailed description for an input image, aiming to augment current LLMs with visual understanding capabilities. The prompt is shown in Table [tab:prompt_bard_caption].
1 Describe the fine-grained content of the image or figure, including scenes, objects, relationships, and any text present.
## Model Hyperparameters The hyperparameters for the experiments in §[sec:experimental_setup] are set to their default values unless specified otherwise. Table 3 and Table 4 detail specific generation parameters for the various large language models (LLMs) and large multimodal models (LMMs) we evaluated, respectively.
| **Model** | **Generation Setup** | |:----------|:------------------------------------------------------------| | Claude-2 | model = `claude-2`, temperature = 0, max_tokens = 1024 | | ChatGPT | model = `gpt-3.5-turbo`, temperature = 0, max_tokens = 1024 | | GPT-4 | model = `gpt-4-0613`, temperature = 0, max_tokens = 1024 | Generating parameters for various LMMs.
| **Model** | **Generation Setup** | |:-----------------|:------------------------------------------------| | IDEFICS-9B-Instruct | max_new_tokens = 256, temperature = 1.0 | | mPLUG-Owl-LLaMA-7B | do_sample = True, top-k = 5, max_length = 512 | | miniGPT4-LLaMA-2-7B | num_beams = 1, temperature = 1.0, max_new_tokens = 300, max_length = 1000 | | | max_gen_len = 256, temperature = 0.1, top_p= 0.75 | | LLaVAR | do_sample = True, temperature = 0.2, max_new_tokens = 1024 | | InstructBLIP-Vicuna-7B | do_sample = False, num_beams = 5, max_length = 256, min_length = 1, top_p = 0.9, repetition_penalty = 1.0, temperature = 1 | | | do_sample = True, temperature = 0.2, max_new_tokens = 1024 | | | Chatbot URL: , evaluation dates range from Sep 8, 2023 to Sep 10, 2023 | | | Chatbot URL: , evaluation dates range from Oct 7, 2023 to Oct 15, 2023 | Generating parameters for various LMMs.
## Human Performance [sec:human_performance] We conducted a study to evaluate human performance on the *testmini* subset of the , utilizing Amazon Mechanical Turk (AMT). Each question from the *testmini* subset was assigned to five annotators, all of whom have a history of completing more than 5,000 HIT tasks and boast an acceptance score higher than 0.99, to ensure the quality of the results. The study comprised five test questions and two qualification questions, which were to be answered within a 20-minute timeframe. The qualification questions consisted of elementary math word problems requiring basic arithmetic operations (e.g., addition and subtraction). Only annotators who successfully answered the qualification questions were deemed eligible for the study, and their responses were included in the final analysis. Additionally, annotators were requested to provide information regarding their highest level of educational attainment. We retained the results exclusively from annotators who had achieved a high school diploma or higher, as 30.9% of the problems in are of high-school level difficulty and 10.8% correspond to college-level curricula. ## Multimodal Bard Assessment Task [sec:human_study_bard] A screenshot of our AMT worker interface, utilized for the Multimodal Bard assessment task, is provided in Figure 1. The workers were compensated at a rate of \$18 per hour.
Screenshot of the Multimodal Bard assessment task interface.
# More Experimental Results ## Results on the Test Set Table [tab:mathvista_test_result] reports the accuracy scores of two heuristic baselines, two leading augmented LLMs (CoT GPT-4, PoT GPT-4), and one leading LMM (LLaVA-LLaMA-2-13B) on the *test* subset. The minor differences between scores on the *test* subset and the *testmini* subset, as shown in Table [tab:mathvista], suggest that *testmini* effectively mirrors the *test* subset, serving as a valuable evaluation subset for model development, especially for those who have limited computing resources.
## Scores for Math Reasoning Types [sec:scores_math_reasoning] ## Scores for Various Visual Contexts [sec:scores_visual_contexts] 1 illustrates the accuracy scores of leading baselines on across a diverse range of visual contexts. Other foundation models trail behind humans in visual perception and reasoning across most visual context categories. Multimodal Bard demonstrates comparable performance to humans in questions with a visual context of geometry diagrams, showcasing its promising capabilities in recognizing geometric shapes and relationships. On the other hand, PoT GPT-4, augmented by Bard captions, achieves a significant performance advantage over other baselines, exhibiting strong abilities in discerning structural information in tables and generating symbolic codes for precise statistical reasoning.
Accuracy scores of leading baselines across various visual contexts in .
## Scores Across Different Grade Levels [sec:grade_level_bar_chart] 2 displays the average accuracy scores across different grade levels (*elementary school*, *high school*, and *college*) for the leading foundation models. Humans exhibit the highest performance on questions at the elementary school level , while they fare the worst on college-level questions within . Foundation model baselines exhibit varying performance behaviors: they achieve better accuracy scores on high school level questions compared to the other two categories.
Average accuracy scores across different grade levels for .
## Ablation Study for LLMs [sec:llm_ablation_study] Table 3 presents an ablation study conducted on LLMs, examining their performance under varying visual information inputs.
Average accuracy scores of LLM baselines under various visual inputs.
## LLMs with Different Shots [sec:llm_fewshot]
## LMMs with Different Shots [sec:lmm_fewshot]
## Hallucinations in Model Explanations [sec:hallucination_in_explanation]
image **Question:** What is the highest amount this class measures? **Response from Bard:** . The markings on the side of the beaker are in milliliters (ml). . The other markings on the beaker are 300 ml, 200 ml, 100 ml, and . . **Wrong output:** **Ground truth:**
image **Question:** As shown in the figure, if $CB = 4.0$, $DB = 7.0$, and $D$ is the midpoint of $AC$, then the length of $AC$ is (). **Choices:** (A) $6$cm (B) $7$cm (C) $4$cm (D) $5$cm **Response from Bard:** Since $D$ is the midpoint of $AC$, then $AD = DC = (AC/2)$. We know that $CB = 4.0$ and $DB = 7.0$, and that $AD = DC$. So we can write the equation: $$\red{AD + DC = 4.0 + 7.0}$$ $$2AD = 11.0$$ $$AD = 5.5$$ $$AC = 2AD = 11.0 \text{cm}$$ Therefore, the length of $AC$ is $11.0$ cm, which is answer choice (D). **Wrong output:** **Ground truth:**
## More Examples for Multimodal Bard [sec:more_examples_bard]
image **Question:** As shown in the figure: In Rt$\triangle$ABC, $\angle$C = 90.0, AC = 8.0, AB = 10.0, then the value of $\text{sin} B$ is equal to (). **Choices:** (A) $\frac{3}{5}$ (B) $\frac{4}{5}$ (C) $\frac{3}{4}$ (D) $\frac{4}{3}$ **Response from Bard:** In a right triangle, the sine of an angle is equal to the ratio of the opposite side to the hypotenuse. In this case, the opposite side is $AC = 8$ and the hypotenuse is $AB = 10$. Therefore, $\text{sin} B$ = $8/10$ = . Here is the detailed solution: Given: \* $\triangle ABC$ is a right triangle with right angle at $C$ \* $AC = 8$ \* $AB = 10$ To find: $\text{sin} B$ $$\text{sin} B = \text{opposite}/\text{hypotenuse} = AC/AB = 8/10 = \red{3/5}$$ **Wrong output:** **Ground truth:**
image **Question:** What is the age gap between these two people in image? (unit: years) **Response from Bard:** **Wrong output:** **Ground truth:**
image **Question:** What is the accuracy of the algorithm with lowest accuracy? **Response from Bard:** The other algorithms are *desert*, *pine*, and *globe*. **Wrong output:** **Ground truth:**
image **Question:** What is the age gap between these two people in image? (unit: years) **Response from Bard:** The dissatisfied people are 78%, the satisfied people (20%). **Wrong output:** **Ground truth:**
image **Question:** Is the number of cars in front of the tiny metal thing less than the number of large matte things in front of the cyan rubber road bike? **Choice:** (A) Yes (B) No **Response from Bard:** The answer is (B), No. There are no cars in front of the tiny metal thing (the motorcycle). There are two large matte things in front of the cyan rubber road bike (the van and the bicycle). So the number of large matte things in front of the cyan rubber road bike is greater than the number of cars in front of the tiny metal thing. **Wrong output:** **Ground truth:**
image **Question:** What is the missing number in the picture? **Choice:** (A) 6 (B) 8 (C) 10 (D) 11 **Response from Bard:** . 2 10 (8) /12 **Correct output:**
image **Question:** The magnitude of the acceleration vector a is $10 \mathrm{~cm} / \mathrm{s}^2$. Use the figure to estimate the normal components of $\mathbf{a}$. **Response from Bard:** The normal component of the acceleration is given by $$a_n = a \cos \theta = 10 \text{ cm}/s^2 \cdot \cos 30^\circ = \dfrac{10 \sqrt{3}}{2} \text{ cm}/s^2$$ The normal component of the acceleration is approximately $7.07 \text{ cm}/s^2$. Here is a breakdown of the steps involved in finding the normal component of the acceleration: 1. Find the angle between the acceleration vector and the radius vector. 2. Use the cosine function to find the cosine of the angle. 3. Multiply the magnitude of the acceleration vector by the cosine of the angle to find the normal component of the acceleration. **Wrong output:** **Ground truth:**
image **Question:** The derivative of $y$ at $x=6$ is ------------------------------------------------------------------------ that at $x=8$ **Chocies:** (A) larger than (B) equal to (C) smaller than **Response from Bard:** The graph of the function $y=\log_2 x$ is a concave up curve. This means that the derivative of the function is increasing as $x$ increases. So, . Here is a table of the values of the derivative of $y$ at different values of $x$: $x \mid dy/dx$ –$\mid$– As you can see, . . **Wrong output:** **Ground truth:**
image **Question:** The derivative of $f(x)$ at $x=2$ is ------------------------------------------------------------------------ that at $x=5$ **Chocies:** (A) larger than (B) equal to (C) smaller than **Response from Bard:** The derivative of a function at a point gives the slope of the line tangent to the function’s graph at that point. The steeper the line, the larger the derivative. In mathematical terms, the derivative of $f(x)$ is given by the expression . Plugging in $x=2$ and $x=5$, . Therefore, . **Wrong output:** **Ground truth:**
## Comparisons of Different Models [sec:model_comparison]
image **Question:** Which number is missing? **Ground truth:** **Solution:** $4 + 1 = 5, 5 + 2 = 7 ~(2 = 1 + 1)$ $10 + 4 = 14, 14 + 5 = 19 ~(5 = 4 + 1)$ $25 + 7 = 32, 32 + 8 = 40 ~(8 = 7 + 1)$
**IDEFICS:**
**mPLUG-Owl:**
**miniGPT4:**
**LLaMA-Adapter-V2:**
**LLaVAR:**
**InstructBLIP:**
**LLaVA:**
**Multimodal Bard:** The sequence progresses by adding consecutive odd numbers to each number. The first number is 4, and we add 1 to get 5. , and so on. The missing number is therefore the next number after 32, . Here is the complete sequence: $$4 + 1 = 5$$ $$5 + \red{3 = 7}$$ $$7 + \red{5 = 10}$$ $$10 + \red{7 = 14}$$ $$14 + \red{9 = 19}$$ $$19 + \red{11 = 25}$$ $$25 + \red{13 = 32}$$ $$32 + \red{15 = 28}$$
**GPT-4V:** 4, 5, 7, 10, 14, 19, 25, 32, ?, 52 Answer: The missing number is 40. This sequence follows a pattern where the difference between each number increases by one. For example, the difference between 4 and 5 is 1, the difference between 5 and 7 is 2, the difference between 7 and 10 is 3, and so on. . **Correct output:**
image **Question:** Which number is missing? **Ground truth:** **Solution:** $4 + 1 = 5, 5 + 2 = 7 ~(2 = 1 + 1)$ $10 + 4 = 14, 14 + 5 = 19 ~(5 = 4 + 1)$ $25 + 7 = 32, 32 + 8 = 40 ~(8 = 7 + 1)$
**Augmented Caption from Bard:** The image shows a set of 8 numbers in circles on a white background. The numbers are 4, 5, 7, 10, 14, 19, 25, and 32. There is a question mark in the last circle. The numbers are arranged in increasing order, . Therefore, . The image is a simple math problem, and it is likely that it is being used as part of an IQ test or other kind of intelligence test. The image is also consistent with the webpages that you mentioned, which are all related to IQ tests. **Augmented OCR Text from EasyOCR:** (\[300, 80\], 5), (\[83, 281\], 10), (\[285, 281\], 14), (\[491, 279\], 19), (\[81, 485\], 25), (\[285, 485\], 32),
**CoT GPT-4:** The pattern in the numbers is that . For example, the second number (5) is 1 more than the first number (4) . The third number (7) is . Following this pattern, . $$32 + \red{9 + 1 = 42}$$ So, .
**PoT GPT-4:** ``` python numbers = [4, 5, 7, 10, 14, 19, 25, 32] diffs = [numbers[i+1] - numbers[i] for i in range(len(numbers)-1)] # [1, 2, 3, 4, 5, 6, 7] missing_number = numbers[-1] + diffs[-1] print(missing_number) ``` **Output:**
image **Question:** What fraction of the shape is blue? **Choices:** (A) 3/11 (B) 8/11 (C) 6/11 (D) 3/5 **Ground truth:**
**IDEFICS:**
**mPLUG-Owl:**
**miniGPT4:**
**LLaMA-Adapter-V2:**
**LLaVAR:**
**InstructBLIP:**
**LLaVA:**
**Multimodal Bard:** The shape is divided into 11 equal parts. . Therefore, . The other choices are incorrect. Option (A), . Option (B), 8/11, is the fraction of the shape that is not blue. Option (D), 3/5, is not a fraction of the shape.
**GPT-4V:**
image **Question:** What fraction of the shape is blue? **Choices:** (A) 3/11 (B) 8/11 (C) 6/11 (D) 3/5 **Ground truth:**
**Augmented Caption from Bard:** The image shows a blue circle with a white stripe in the middle. . The colors are blue, . There is no text present in the image. The image could be used to represent a variety of concepts, such as fractions, colors, or the four elements. It could also be used as a decorative element or a symbol.
**CoT GPT-4:**
**PoT GPT-4:** ``` python def fraction_of_blue(description, choices): parts = { 'blue': 1, 'green': 1, 'yellow': 1, 'red': 1 } total_parts = sum(parts.values()) blue_fraction = parts['blue'] / total_parts for choice in choices: fraction = choice.split('/) if int(fraction[0]) / int(fraction[1]) == blue_fraction: return choice return "" choices = ['3/11', '8/11', '6/11', '3/5'] description = "The image shows a blue circle with a white stripe in the middle. The circle is divided into four parts, each with a different color. The colors are blue, green, yellow, and red. There is no text present in the image." print(fraction_of_blue(description, choices)) ``` **Output:**
image **Question:** If you add the two visible numbers, on the jerseys, what is the total sum? **Ground truth:** **Solution:** 2 + 1 = 3
**IDEFICS:**
**mPLUG-Owl:**
**miniGPT4:**
**LLaMA-Adapter-V2:**
**LLaVAR:**
**InstructBLIP:**
**LLaVA:** The two visible numbers on the jerseys are 2 and 1. . **Correct output:**
**Multomodal Bard:**
**GPT-4V:**
**Augmented OCR Text from EasyOCR:** (\[110, 154\], 21) **CoT GPT-4:** The image only shows one visible number on the jerseys, which is 21.
**Augmented OCR Text from EasyOCR:** (\[110, 154\], 21) **PoT GPT-4:** ``` python visible_numbers = [21] total_sum = sum(visible_numbers) print(total_sum) ``` **Execution output:**
image **Question:** How much money does Luca need to buy a sour apple candy and a butterscotch candy? (Unit: \$) **Ground truth:** **Solution:** Add the price of a sour apple candy and the price of a butterscotch candy: \$0.06 + \$0.07 = \$0.13. Luca needs \$0.13.
**IDEFICS:** .
**mPLUG-Owl:**
**miniGPT4:** The answer to the question is: Explanation: \* The price of the sour apple candy is . \* The price of the butterscotch candy is . \* Therefore, to buy both candies.
**LLaMA-Adapter-V2:** .
**LLaVAR:** .
**InstructBLIP:**
**LLaVA:** .
**Multimodal Bard:** The price of a sour apple candy is \$0.06 and the price of a butterscotch candy is \$0.07. So, . In two decimal places, . **Correct output:**
**GPT-4V:** **Correct output:**
image **Question:** How much money does Luca need to buy a sour apple candy and a butterscotch candy? (Unit: \$) **Ground truth:** **Solution:** Add the price of a sour apple candy and the price of a butterscotch candy: \$0.06 + \$0.07 = \$0.13. Luca needs \$0.13.
**Augmented Caption from Bard:** The image shows a table with the prices of different types of candy. The table is on a pink background. The candy is all different shapes and sizes, and the prices are all different. The candy is all different colors, and the prices are all different. The following text is present in the image: \* Sour apple candy \$0.06 \* Piece of gum \$0.07 \* Gummy worm \$0.09 \* Lemon drop \$0.05 \* Piece of licorice \$0.07 \* Butterscotch candy \$0.07 **Augmented OCR Text from EasyOCR:** (\[9, 7\], sour apple candy),  (\[195, 5\], \$0.06), (\[7, 39\], piece of gum),  (\[195, 37\], \$0.07), (\[9, 73\], gummy worm),  (\[195, 69\], \$0.09), (\[7, 105\], lemon drop),  (\[195, 101\], \$0.05), (\[6, 132\], piece of licorice),  (\[195, 133\], \$0.07), (\[9, 167\], butterscotch candy),  (\[195, 165\], \$0.07)
**CoT GPT-4:** Find the total cost of a sour apple candy and a butterscotch candy. \$0.06 + \$0.07 = \$0.13 Luca needs \$0.13. **Correct output:**
**PoT GPT-4:** ``` python sour_apple_candy_price = 0.06 butterscotch_candy_price = 0.07 total_price = sour_apple_candy_price + butterscotch_candy_price print(total_price) ``` **Execution output:** **Correct output:**
# A Comparative Study of GPT-4V, Bard, and Other Models [app:gpt4v_study] GPT-4 with vision (GPT-4V) is the multimodal version of GPT-4 that is instructed to understand multiple modalities such as texts and images. Due to its remarkable improvements over other AI models (§[sec:results] and §[sec:fine_grained_results]), we have conducted a comprehensive evaluation to understand its capabilities, strengths, and areas for improvement. Our findings not only validate GPT-4V’s various problem-solving skills but also shed light on developing general-purpose multimodal AI agents. ## GPT-4V Playground for Manual Evaluation Given that GPT-4V does not offer API access, we have performed manual evaluations using the playground platform[^1]. For a fair comparison, we used the same input queries as those for all the other LMMs and recorded the responses in a single round of chat without additional feedback (Figure 1).
The GPT-4V playground for our manual evaluation.
## Leaderboard Scores [sec:gpt4v_leaderboard] The leaderboard in Figure 2 highlights GPT-4V’s substantial advancements over the current LLM and LMM baselines. Notably, there is a 15.1% improvement over the second-best performing Multimodal Bard model. However, a significant gap of 10.4% still exists between GPT-4V and human performance, indicating plenty of room for further improvement by developing new LMMs and tool-augmented LLMs.
[^1]: ## Abilities in Mathematical Reasoning [sec:gpt4v_math] This section compares the mathematical reasoning ability of GPT-4V with that of other LLMs on , including LLaMA-Adapter-V2-7B (LLaMA-Adapter-V2 for simplification), LLaVA-LLaMA-2-13B (LLaVA for simplification), and Multimodal Bard. ### Algebraic Reasoning Algebraic reasoning problems on require understanding the function plots from figures and inferring their properties. As shown in Figure [fig:tease_scores], GPT-4V demonstrates outstanding capabilities in algebraic reasoning, surpassing all competing models and even humans. For instance, GPT-4V accurately identifies the function plot by its equation and subsequently infers its correct properties (Figure [fig:example_926]). However, both GPT-4V and the other LLMs face challenges in comprehending low-resolution figures (Figure [fig:example_346]) and those that depict multiple functions (Figure [fig:example_963]).
image **ID:** 926 **Question:** What is the size of the shaded area under the curve? Round the answer to 2 decimal places. **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:**
**Multimodal Bard:** The given curve is a circle with center at $(3, 0)$ and radius 3. The shaded area is a quarter of the circle’s area. $$\begin{aligned} \text{Area of the circle} &= \pi r^2 \\ &= 3.14 \times 3^2 \\ &= 28.26.\\ \text{Area of the shaded region} &= \frac{28.26}{4} \\ &= 7.065. \end{aligned}$$ Rounding to two decimal places, the area is $7.07$. **Correct output:**
**GPT-4V:** To find the size of the shaded area under the curve $f(x) = \sqrt{9 - (x - 3)^2}$, we need to integrate the function over the given interval. The given function represents the upper half of a circle with a radius of 3 and centered at the point (3,0). The shaded region corresponds to the area of a quarter circle with a radius of 3. The formula for the area of a circle is $\pi r^2$. For a quarter of the circle, the area is $\frac{1}{4} \pi r^2$. Given: $r = 3$ $$\begin{aligned} \text{Area} &= \frac{1}{4} \pi (3^2) \\ \text{Area} &= \frac{1}{4} \pi (9) \\ \text{Area} &= \frac{9}{4} \pi \\ \text{Area} &= 7.07 \text{~(rounded to 2 decimal places)} \end{aligned}$$ So, the size of the shaded area under the curve is approximately $7.07$ square units. **Correct output:**
image **ID:** 346 **Question:** What is the $y$-intercept of this function? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The $y$-intercept of the function is the point where the graph crosses the $y$-axis. Since the $y$-axis is also the line $x=0$, the $x$-coordinate of the $y$-intercept will always be 0. In the image, . Therefore, . Here are the steps on how to find the $y$-intercept of a function: 1. Set the $x$-coordinate to 0. 2. Solve for the $y$-coordinate. In this case, we set $x=0$ and get $y=3$. Therefore, .
**GPT-4V:** The $y$-intercept of a function is the point where the graph of the function crosses the $y$-axis. From the provided graph, . Thus, .
image **ID:** 963 **Question:** For what $x$ does $f$ reach its local maximum? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The local maximum of the function is the point where the graph of the function changes from increasing to decreasing. In the given graph, this point is at $x = 3$. So the answer is **3**. Here is a more detailed explanation: - A local maximum point of a function is a point on the graph of the function where the function value is higher than all the function values in a small neighborhood of the point. - . However, for values of $x$ greater than 3, the function value starts to decrease. . - Therefore, the local maximum point of the function is at $x = 3$. **Correct output:**
**GPT-4V:** The function $f(x)$ reaches its local maximum where its first derivative, $f'(x)$, changes from positive to negative. Looking at the graph, . So, .
### Arithmetic Reasoning Arithmetic reasoning problems in require accurate fundamental operations in conjunction with understanding diverse visual contexts. As illustrated in Figure [fig:tease_scores], GPT-4V exhibits a significant improvement in arithmetic reasoning compared to existing models. For instance, some LLMs struggle with basic arithmetic tasks, such as determining the difference between two values in a bar chart (Figure [fig:example_74]) or computing the probability based on simple statistical data (Figure [fig:example_718]).
image **ID:** 74 **Question:** What is the difference of largest and smallest bar? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The largest bar in the image is Iceland with 100%, and the smallest bar is Cambodia with 52.4%. So the difference between the largest and smallest bar is 100 - 52.4 = 47.6%. However, the question asks for the answer in one decimal place. . Therefore, .
**GPT-4V:** From the graph: Largest bar (Iceland) = 100% Smallest bar (Cambodia) = 52.4% Difference = 100% - 52.4% = 47.6% **Correct output:**
image **ID:** 718 **Question:** Jeffrey is the proud owner of an eclectic bow tie collection. He keeps track of how many bow ties he has, and organizes them by pattern and material. What is the probability that a randomly selected bow tie is designed with swirls and is made of velvet? Simplify any fractions. **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** To determine the probability of a randomly selected bow tie being designed with swirls and made of velvet, we need to divide the total number of bow ties with the specific characteristics (swirls and velvet) by the total number of bow ties in Jeffrey’s collection. In the image, . So, the probability of a randomly selected bow tie being designed with swirls and made of velvet is:
**Multimodal Bard:** The probability that a randomly selected bow tie is designed with swirls and is made of velvet is: $$(\text{Number of swirls and velvet bow ties}) / (\text{Total number of bow ties})$$ . We can find the total number of bow ties by adding the values in all three columns. This gives us $5 + 3 + 1 + 5 = 14$ bow ties. Therefore, the probability that a randomly selected bow tie is designed with swirls and is made of velvet is: $$\red{(8) / (14) = 0.57 = 57.14\%}$$ To two decimal places, .
**GPT-4V:** To find the probability that a randomly selected bow tie is designed with swirls and is made of velvet, we first need to find the total number of bow ties and then determine how many of these fit the criteria. Total number of bow ties: $$\text{Zigzags (Velvet)} = 5$$ $$\text{Zigzags (Cotton)} = 1$$ $$\text{Swirls (Velvet)} = 3$$ $$\text{Swirls (Cotton)} = 5$$ $$\text{Total} = 5 + 1 + 3 + 5 = 14$$ Number of bow ties that are designed with swirls and are made of velvet = 3 Probability = Number of desired outcomes / Total number of possible outcomes Probability = $3/14$ Now, converting the fraction to a decimal (rounded to two decimal places): $3 ÷ 14 = 0.21$ **Correct output:**
### Geometry Reasoning In geometry reasoning, the performance of GPT-4V is comparable to that of humans on , as demonstrated in Figure [fig:tease_scores]. Figure [fig:example_606] and Figure [fig:example_640], respectively, present two geometry reasoning problems: one at an elementary level and the other at a college level. For both problems, GPT-4V produces the correct answers accompanied by detailed explanations.
image **ID:** 606 **Question:** What is the perimeter of the rectangle? **Ground truth:**
**LLaMA-Adapter-V2:** The perimeter of the rectangle is . **Correct output:**
**LLaVA:** The perimeter of the rectangle is 10 cm. **Correct output:**
**Multimodal Bard:** I see the image. The rectangle has a width of 3 cm and a length of 2 cm. .
**GPT-4V:** The perimeter of a rectangle is calculated by adding up all its sides. Using the provided dimensions: Perimeter = 2 × (length + width) Perimeter = 2 × (3 cm + 2 cm) Perimeter = 2 × 5 cm Perimeter = 10 cm The perimeter of the rectangle is 10 cm. **Correct output:**
image **ID:** 640 **Question:** Find $\widehat{\mathrm{WN}}$ if $\triangle \mathrm{IWN}$ is equilateral and $W N=5$ **Choices:** (A) $\frac { 3 } { 5 } \pi$ (B) $\frac { 5 } { 3 } \pi$ (C) $5 \pi$ (D) $10 \pi$ **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** Given that $\triangle \text{IWN}$ is equilateral, we can use the cosine rule to find the length of the side opposite . where $a$ and $b$ are the lengths of the adjacent sides of the equilateral triangle. $$a = 5$$ $$b = 5$$ Substituting the values, we get: $$\red{\cos \angle \text{WN} = \frac{5}{5} = 1}$$ So, the length of the side opposite . Therefore,
**Multimodal Bard:** The correct answer is (B). Since $\triangle IWN$ is equilateral, $\widehat{WIN} = \widehat{WNI} = 60^\circ$. Therefore, $\red{\widehat{WN} = \pi - 2 \cdot 60^\circ} = \boxed{\frac{5}{3}\pi}$. **Correct output:**
**GPT-4V:** For an equilateral triangle, each interior angle is $60^{\circ}$. The arc corresponding to a central angle in a circle is given by: $\text{Arc length} = \text{central angle in radians} \times \text{radius}$ Given that $WI = IN = NW = 5$ (since $\triangle IWN$ is equilateral), the radius of the circle is 5. Using the formula above: $\widehat{WN} = 60^{\circ} \times \frac{\pi}{180^{\circ}} \times 5 = \frac{\pi}{3} \times 5 = \frac{5\pi}{3}$ The correct option is: (B) $\frac{5}{3} \pi$. **Correct output:**
### Logical Reasoning Logical reasoning problems represent a different type of question in . Solving these problems requires abstract thinking to deduce the underlying patterns of numbers or shapes from figures. Current foundation models struggle to effectively tackle logical reasoning problems: GPT-4V achieves only 21.6% accuracy in logical reasoning, which is a modest improvement of 8.1% over random chance, as shown in Table [tab:mathvista]. The challenges that logical reasoning problems present to current LMMs are further highlighted in Figures [fig:example_98], [fig:example_778], and [fig:example_750].
image **ID:** 98 **Question:** Find the value of the square in the figure. **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** . Here’s how to solve it: 1. $5 + 8 = 13$ 2. $13 + 7 = 20$ 3. $20 / 2 = 10$
**GPT-4V:** From the given figure: $$\begin{aligned} \text{Circle} + \text{Square} &= 5 \\ \implies \text{Circle} &= 5 - \text{Square} \quad \dots (i) \end{aligned}$$ $$\begin{aligned} \text{Triangle} + \text{Triangle} &= 8 \\ \implies \text{Triangle} &= 4 \quad \dots (ii) \end{aligned}$$ $$\begin{aligned} \text{Circle} + \text{Triangle} &= 7 \\ \text{Using (ii),} \quad \text{Circle} + 4 &= 7 \\ \implies \text{Circle} &= 3 \quad \dots (iii) \end{aligned}$$ Using (i) and (iii), $$\begin{aligned} 5 - \text{Square} &= 3 \\ \implies \text{Square} &= 2. \end{aligned}$$ **Correct output:**
image **ID:** 778 **Question:** Can you find the missing shape in this picture puzzle? **Choices:** (A) A (B) B (C) C (D) D **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The missing shape in the picture puzzle is **(D)**. The shapes in the puzzle are all regular polygons, and . , the second shape is a hexagon, , and . . The answer choice A is a pentagon, . . . . Therefore, .
**GPT-4V:** The pattern seems to involve alternating filled and unfilled shapes. Following the pattern: - The first shape is a filled . - The second shape is an unfilled hexagon. Thus, the missing shape should be a filled . .
image **ID:** 750 **Question:** Which number is missing? **Ground truth:** **Solution:** $5 \times 2 = 10$, $6 \times 2 = 10$, $9 \times 2 = 18$
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The missing number in the puzzle is **7**. The numbers in the puzzle are arranged in a circle, and . For example, .
**GPT-4V:** From the given image, : $$\red{12 - 10 = 2}$$ $$\red{9 - 5 = 4}$$ $$\red{6 - ? = 2}$$ . .
### Numeric Commonsense Reasoning Problems involving numeric commonsense reasoning on require commonsense knowledge about daily objects and celebrities to answer visual questions. However, these problems present significant challenges to existing foundation models, including GPT-4V, as depicted in Figure [fig:tease_scores]. For instance, Multimodal Bard struggles to understand the optical illusion in an image (Figure [fig:example_693]) and to infer the age gap between two celebrities from another image (Figure [fig:example_29]). Figure [fig:example_19] poses a question about the maximum volume a beaker can measure. However, GPT-4V lacks commonsense knowledge regarding the use of a beaker, resulting in an incorrect prediction.
image **ID:** 693 **Question:** Is the orange larger than the car? **Choices:** (A) Yes (B) No **Ground truth:**
**LLaMA-Adapter-V2:** No, the orange is not larger than the car. **Correct output:**
**LLaVA:** .
**Multimodal Bard:** The answer is (B), No. The orange is not larger than the car. The car is clearly much larger than the orange. Even if it is a large orange, it is still much smaller than the car. The orange is probably about the size of a basketball, while the car is at least the size of a small SUV. . **Correct output:**
**GPT-4V:** The orange is not larger than the car; . **Correct output:**
image **ID:** 29 **Question:** What is the age gap between these two people in image? (unit: years) **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** Based on the image you sent, . The king is Richard III of England, who was born in 1452. The queen is Anne Neville, who was born in 1456. So, . The stained glass window is likely depicting the couple after their marriage, so .
**GPT-4V:** The two figures depicted in the stained glass are King Richard III and Anne Neville. Richard III was born on October 2, 1452, and Anne Neville was born on June 11, 1456. Calculating the age gap: $$1456 - 1452 = 4$$ So, the age gap between Richard III and Anne Neville is 4 years. **Correct output:**
image **ID:** 19 **Question:** What is the highest amount this glass measures? **Ground truth:**
**LLaMA-Adapter-V2:** The highest amount this glass measures is 400 ml. **Correct output:**
**LLaVA:** .
**Multimodal Bard:** . The markings on the side of the beaker are in milliliters (ml). . . The negative markings indicate the amount of liquid that is below the zero mark.
**GPT-4V:** .
### Scientific Reasoning Scientific reasoning represents a distinct mathematical reasoning ability within our . To tackle problems in this area, a model must not only accurately interpret domain-specific information from figures, but also possess the necessary in-domain knowledge to reason rigorously on scientific topics. Figure [fig:tease_scores] shows that GPT-4V substantially outperforms the other foundation models. This superiority is further illustrated by the examples in Figures [fig:example_807] and [fig:example_278]. However, the failure of GPT-4V, as shown in Figure [fig:example_725], indicates that there is considerable room for improvement.
image **ID:** 807 **Question:** Which of the following leaf shapes would have the least amount of wind resistance and water loss? **Choices:** (A) Truncate (B) Acuminate (C) Rounded (D) Sagittate **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** . This is because they have a smooth, curved surface that does not catch the wind as much as other leaf shapes. Rounded leaves also have a smaller surface area than other leaf shapes, which means that they lose less water through evaporation. . Truncate leaves have a flat end, which can catch the wind. . Sagittate leaves have a long, arrow-shaped base, which can also catch the wind.
**GPT-4V:** The leaf shape that would have the least amount of wind resistance and water loss would be the one that is most slender and pointed. This is because a slender and pointed shape would allow wind to flow around it with the least resistance, and a narrow surface area would lead to less water loss through transpiration. Among the options, the “acuminate” leaf shape is the most slender and pointed. **Correct output:**
image **ID:** 278 **Question:** Among the following objects, which one has the best PSNR score? **Choices:** (A) Lego (B)Mats (C) Mic (D)Ship **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The answer is **(D) Ship**. . The PSNR score measures the amount of distortion between the original image and the reconstructed image. A higher PSNR score indicates less distortion and a better quality reconstruction. The other objects have the following PSNR scores: \* Lego: 29.0 dB \* Mats: 27.0 dB \* Mic: 30.7 dB .
**GPT-4V:** Based on the table for PSNR (dB) scores: Lego: DMTet: 29.0 FLeXiCubes: 28.8 Mats: DMTet: 27.0 FLeXiCubes: 26.7 Mic: DMTet: 30.7 FLeXiCubes: 30.8 Ship: DMTet: 26.0 FLeXiCubes: 25.9 Among the choices, “Mic” has the highest PSNR scores with both DMTet and FLeXiCubes, so the answer is (C) Mic. **Correct output:**
image **ID:** 725 **Question:** In Figure, suppose that Barbara’s velocity relative to Alex is a constant $v_{B A}=52 \mathrm{~km} / \mathrm{h}$ and car $P$ is moving in the negative direction of the $x$ axis. If Alex measures a constant $v_{P A}=-78 \mathrm{~km} / \mathrm{h}$ for car $P$, what velocity $v_{P B}$ will Barbara measure? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The velocity of car $P$ relative to Barbara is given by: $$v_{PB} = v_{PA} - v_{BA} = -78 \mathrm{~km} / \mathrm{h} - 52 \mathrm{~km} / \mathrm{h} = -130 \mathrm{~km} / \mathrm{h}$$ Therefore, Barbara will measure a velocity of **-130** $\mathrm{~km} / \mathrm{h}$ for car $P$. Here is the explanation: \* The velocity of car $P$ relative to Alex, $v_{PA}$, is measured by Alex, who is stationary in frame $A$. \* The velocity of Barbara relative to Alex, $v_{BA}$, is measured by Alex, who is also stationary in frame $A$. \* The velocity of car $P$ relative to Barbara, $v_{PB}$, is measured by Barbara, who is moving in frame $B$ with a velocity of $v_{BA}$. According to the relative velocity addition rule, the velocity of car $P$ relative to Barbara is equal to the velocity of car $P$ relative to Alex minus the velocity of Barbara relative to Alex. In this case, the velocity of car $P$ relative to Alex is -78 $\mathrm{~km} / \mathrm{h}$ and the velocity of Barbara relative to Alex is 52 $\mathrm{~km} / \mathrm{h}$. Therefore, the velocity of car $P$ relative to Barbara is -130 $\mathrm{~km} / \mathrm{h}$. **Correct output:**
**GPT-4V:** Given: $v_{BA}$ = 52 $\mathrm{~km} / \mathrm{h}$ (velocity of Barbara relative to Alex) $v_{PA}$ = -78 $\mathrm{~km} / \mathrm{h}$ (velocity of car $P$ as measured by Alex) We need to find $v_{PB}$ (velocity of car $P$ as measured by Barbara). Using the Galilean transformation for velocities in one dimension: $$\begin{aligned} v_{PB} &= \red{v_{PA} + v_{BA}} \\ v_{PB} &= \red{-78 \text{ $\mathrm{~km} / \mathrm{h}$} + 52 \text{ $\mathrm{~km} / \mathrm{h}$}} \\ v_{PB} &= \red{-26 \text{ $\mathrm{~km} / \mathrm{h}$}} \end{aligned}$$
### Statistical Reasoning In , problems encompass a variety of charts, plots, and graphs designed to assess the statistical reasoning capabilities of foundation models. As demonstrated in Figure [fig:tease_scores], GPT-4V shows strong statistical reasoning ability. For instance, GPT-4V produces accurate answers for the format-rich table in Figure [fig:example_649] and the data analysis table in Figure [fig:example_389].
image **ID:** 649 **Question:** How many values are below 30 in Mainly are incidents of individual misconduct? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** . , which is the percentage of women who believe that recent allegations of sexual harassment and assault are mainly incidents of individual misconduct. .
**GPT-4V:** Based on the “Mainly are incidents of individual misconduct” column, the following values are below 30: - 28 (Total) - 22 (Women) - 19 (College grad+) - 25 (Dem/Lean Dem) That’s a total of 4 values. So, the answer is 4. **Correct output:**
image **ID:** 389 **Question:** Look at the table. Then answer the question. At a price of \$320, is there a shortage or a surplus? **Choices:** (A) shortage (B) surplus **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** . At a price of \$320, . Here is a table summarizing the supply and demand at each price:
| | | | | |:-----:|:------:|:---:|:---:| | \$270 | | | | | \$320 | | | | | \$370 | | | | | \$420 | 19,600 | | | | \$470 | | | |
**GPT-4V:** At a price of \$320: Quantity demanded = 20,000 Quantity supplied = 12,500 Since the quantity demanded (20,000) is greater than the quantity supplied (12,500), there is a shortage. **Correct output:**
## Abilities Across Visual Contexts [sec:gpt4v_visual] This section compares the reasoning abilities of GPT-4V with other large multimodal models (LLMs) on , considering various types of visual contexts. Models used for comparison include LLaMA-Adapter-V2-7B (simplified as LLaMA-Adapter-V2), LLaVA-LLaMA-2-13B (simplified as LLaVA), and Multimodal Bard. ### Abstract Scene Based on Figure [fig:tease_scores], current foundation models lag behind human performance in mathematical reasoning in abstract scenes by a substantial margin. Consider the problems in Figures [fig:example_742] and [fig:example_7] that are derived from math word problems found in elementary school curricula. Despite their advanced capabilities, foundation models such as Multimodal Bard and GPT-4V fail to produce the correct responses.
image **ID:** 742 **Question:** Move the ruler to measure the length of the twig to the nearest inch. The twig is about ( ) inches long. **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** According to the image, . . Therefore, .
**GPT-4V:** .
image **ID:** 7 **Question:** What time is shown? Answer by typing a time word, not a number. It is ( ) after eight. **Choices:** (A) half (B) quarter (C) o’clock (D) quarter to (E) quarter past **Ground truth:**
**LLaMA-Adapter-V2:**
**LLaVA:** .
**Multimodal Bard:** The image you sent shows a clock with the hands pointing to 8:15. The hour hand is between 8 and 9, which means it is past 8 o’clock. The minute hand is pointing to the 3, which is 15 minutes past. Therefore, the time is quarter past eight.
**GPT-4V:** , which indicates that it is 45 minutes past the hour. The hour hand is pointing a little past the 8, which means it’s after 8 o’clock. So, The correct answer is .
### Bar Chart As shown in Figure [fig:tease_scores], foundation models, including GPT-4V, significantly underperform humans in mathematical reasoning when bar charts serve as the visual context. Neither Multimodal Bard nor GPT-4V can solve the problems depicted in Figures [fig:example_497] and [fig:example_629], which do not need complex understanding and reasoning.
image **ID:** 497 **Question:** What is the value of the largest bar? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** .
**GPT-4V:** .
image **ID:** 629 **Question:** What percentage of people prefer the object hut? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** Based on the image, . .
**GPT-4V:** .
### Function Plot GPT-4V outperforms other baselines on problems related to function plots and even exceeds human performance. Figures [fig:example_767] and [fig:example_39] show questions with digital and hand-drawn function plots, respectively. In both cases, GPT-4V accurately identifies their functions and infers the correct properties.
image **ID:** 767 **Question:** What is the maximum value of $y$? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** . . This is the equation of a circle with center $(0, 0)$ and . .
**GPT-4V:** The equation given is for a circle: $x^2 + y^2 = 25$. The radius of this circle is the square root of 25, which is 5. The maximum value of $y$ corresponds to the highest point of the circle, which is equal to the radius. So, the maximum value of $y$ is 5. **Correct output:**
image **ID:** 39 **Question:** Is this function odd or even? **Choices:** (A) odd (B) even **Ground truth:**
**LLaMA-Adapter-V2:** The function is odd. **Correct output:**
**LLaVA:** The function in the image is odd. **Correct output:**
**Multimodal Bard:** The function in the image is **odd**. An even function is a function where $f(-x) = f(x)$. An odd function is a function where $f(-x) = -f(x)$. , if we substitute $x = -x$, we get: $$\begin{aligned} \red{y} &= \red{a_2(-x) - a_1} \\ &= \red{-a_2x + a_1} \end{aligned}$$ This is not equal to $f(x)$, so the function is **odd**. The correct choice is (A). **Correct output:**
**GPT-4V:** The given function is $y = a^x$ where $a > 1$. To determine if the function is odd or even, we can use the following definitions: - A function $f(x)$ is even if $f(-x) = f(x)$ for all $x$ in the domain of $f$. - A function $f(x)$ is odd if $f(-x) = -f(x)$ for all $x$ in the domain of $f$. For the given function $y = a^x$, when we plug in $-x$ for $x$: $$y = a^{-x}$$ This is not equal to $-a^x$ and it’s not equal to $a^x$ either. Thus, the function is neither odd nor even. However, since the given choices only include “odd” and “even”, the best answer is neither. But since I must choose one of the provided options (A) odd. **Correct output:**
### Geometry Diagram Geometry diagrams are a distinct type of visual context in . To answer questions involving these diagrams, a model must comprehend the fine-grained details, including symbols, variables, and relations from the figures. Additionally, it should apply associated theorems before executing calculations to produce final responses. GPT-4V surpasses other models and even humans due to its superior capabilities in geometry recognition and reasoning. In the examples shown in Figures [fig:example_478] and [fig:example_410], GPT-4V delivers the correct results through the application of relevant theorems and subsequent calculations.
image **ID:** 478 **Question:** As shown in the figure, in the diamond $ABCD$, two diagonal lines $AC = 12.0$, $BD = 16.0$, then the edge length of this diamond is (). **Choices:** (A) 10 (B) 8 (C) 6 (D) 5 **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The answer is (A). 10. In a rhombus, the diagonals bisect each other at right angles. So, we can draw two right triangles within the rhombus, each with a leg of length 6 and a hypotenuse of length 10. By the Pythagorean Theorem, . So the answer is (A). **Correct output:**
**GPT-4V:** Given that $AC = 12.0$ and $BD = 16.0$ in the rhombus (diamond) $ABCD$. Since all sides of a rhombus are congruent, the diagonals of a rhombus are perpendicular bisectors of each other. This means that each half of the diagonal $BD$ is 8.0 and each half of diagonal $AC$ is 6.0. Using the Pythagorean theorem for one of the right triangles (formed by the diagonals): $$\begin{aligned} \text{Side}^2 &= \left(\text{Half of $BD$}\right)^2 + \left(\text{Half of $AC$}\right)^2 \\ \text{Side}^2 &= 8^2 + 6^2 \\ \text{Side}^2 &= 64 + 36 \\ \text{Side}^2 &= 100 \\ \text{Side} &= 10 \end{aligned}$$ **Correct output:**
image **ID:** 410 **Question:** Find the perimeter of $\triangle D E F,$ if $\triangle D E F \sim \triangle C B F,$ perimeter of $\triangle C B F=27, D F=6,$ and $F C=8$. **Choices:** (A) 20.25 (B) 21 (C) 27 (D) 36 **Ground truth:**
**LLaMA-Adapter-V2:** The perimeter of $\triangle D E F$ is 20.25, as shown in the image. **Correct output:**
**LLaVA:** The perimeter of $\triangle D E F$ can be calculated using the similarity theorem, which states that if two triangles are similar, their corresponding sides are proportional. In this case, we know that $\triangle D E F \sim \triangle C B F$ and the perimeter of $\triangle C B F$ is 27. Since the ratio of the corresponding sides of similar triangles is the same, we can use the perimeter of $\triangle C B F$ to find the perimeter of $\triangle D E F$. The perimeter of $\triangle D E F$ is proportional to the perimeter of $\triangle C B F$, so we can multiply the perimeter of $\triangle C B F$ by the ratio of the corresponding sides. . Therefore, .
**Multimodal Bard:** Since $\triangle DEF \sim \triangle CBF$, we have $\frac{DE}{CB} = \frac{DF}{FC}$. Since $DE = 6$ and $FC = 8$, we have $\frac{DE}{CB} = \frac{6}{8} = \frac{3}{4}$. . Substituting $\frac{3}{4}$ for $\frac{CB}{BF}$, . The perimeter of $\triangle DEF$ is $DE + EF + DF = \red{6 + EF + 6 = EF + 12}$. Substituting $\frac{3}{4}$ for $\frac{EF}{BF}$, . Therefore, .
**GPT-4V:** Given $\triangle D E F \sim \triangle C B F$ and $DF = 6$ and $FC = 8$, we can find the scale factor of the two similar triangles. Scale factor = $\frac{DF}{FC} = \frac{6}{8} = \frac{3}{4}$. Given the perimeter of $\triangle CBF = 27$, we can find the perimeter of $\triangle DEF$ by multiplying by the scale factor. Perimeter of $\triangle DEF$ = $27 \times \frac{3}{4} = 20.25$. **Correct output:**
### Line Plot As evidenced by Figure [fig:tease_scores], current models such as GPT-4V do not perform as well as humans in mathematical reasoning involving line plots. We speculate that the low performance is mainly due to the difficulty in detecting OCR text in the figures and accurately grounding the values, as illustrated by the examples in Figures [fig:example_981] and [fig:example_209].
image **ID:** 981 **Question:** What is the sum of 2002, 2003 and 2004? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** I see the image. . .
**GPT-4V:** The sum for the years 2002, 2003, and 2004 from the graph are: 2002: 19.4% 2003: 23.2%
image **ID:** 209 **Question:** What is the highest value in black line chart? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** .
**GPT-4V:** .
### Natural Image includes questions that require numeric and spatial reasoning based on text and objects in natural images. If models have limited abilities to recognize text (OCR), as shown in Figure [fig:example_980], or to identify visual objects, as in Figure [fig:example_884], they are unlikely to generate correct answers to visual questions.
image **ID:** 980 **Question:** What is the highest number shown? **Ground truth:**
**LLaMA-Adapter-V2:** The highest number shown is 12. **Correct output:**
**LLaVA:** .
**Multimodal Bard:** I see the image. .
**GPT-4V:** The highest number shown on the clock is 12. **Correct output:**
image **ID:** 884 **Question:** What is the blue kite in the lower right corner shaped like? **Choices:** (A) ferret (B) cat (C) cloud (D) octopus **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** .
**GPT-4V:** The blue kite in the lower right corner is shaped like (D) octopus. **Correct output:**
### Puzzle Test Math reasoning with puzzle text figures is challenging for current AI foundation models because interpreting these figures requires discerning underlying patterns from sets of shapes, as illustrated in Figure [fig:example_123], and numbers, as in Figure [fig:example_941]. There is plenty of room for improvement.
image **ID:** 123 **Question:** The diagram represents successive rotations, starting from the top down. Which shape comes next? **Choices:** (A) A (B) B (C) C (D) D (E) E **Ground truth:**
**LLaMA-Adapter-V2:** The diagram represents successive rotations, starting from the top down. The next shape that comes next is letter D. **Correct output:**
**LLaVA:** .
**Multimodal Bard:** . . To find the next shape, we need to rotate the cube 90 degrees on the z-axis. .
**GPT-4V:** The diagram shows a pattern of rotating the shape in a specific sequence. To determine the next shape: Starting with the first shape, when we rotate it 90 degrees counterclockwise, we get the second shape. Rotating the second shape 90 degrees counterclockwise gives the third shape. Thus, rotating the third shape 90 degrees counterclockwise will give us the next shape. .
image **ID:** 941 **Question:** Which is the next number in the series? **Ground truth:** **Solution:** $(3968 + 1) = 63^2$ $(63 + 1) = 8^2$ $(8 + 1) = 3^2$ $(3 + 1) = 2^2$
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** . The series is based on the square root of the number plus 1. So, . Then, 63 + 1 = 64, and its square root is 8. Continuing the pattern, 8 + 1 = 9, and its square root is 3. Therefore, .
**GPT-4V:** The pattern in the ladder involves dividing the number by the step number to get the next number. $\text{Step 1:} \quad \red{3968 \div 1 = 3968}$ $\text{Step 2:} \quad \red{3968 \div 2 = 1984}$ $\text{Step 3:} \quad \red{63 \div 3 = 21}$ $\text{Step 4:} \quad \red{8 \div 4 = 2}$ $\text{Step 5:} \quad \red{3 \div 5 = 0.6}$ . However, if we assume : $$\begin{aligned} \red{3968 \div 63} & \red{\approx 63} \\ \red{63 \div 8} & \red{= 7.875 \text{ which isn't a round number.}} \\ \red{8 \div 3} & \red{= 2.6666 \text{ again not a round number.}} \end{aligned}$$ Without a clear and consistent pattern, it’s challenging to determine the next number in the sequence.
### Scatter Plot A scatter plot is a graphical representation of data points on a two-dimensional axis, where each point represents a value from the dataset. includes the reasoning task that requires comprehending scatter plots taken from daily life and academic papers, as shown in Figures [fig:example_105] and [fig:example_542]. Although GPT-4V outperforms other LMMs, such as Multimodal Bard, and even humans in overall accuracy (Figure [fig:tease_scores]), it often fails in the cases where fine-grained understanding is required, as in Figure [fig:example_542].
image **ID:** 105 **Question:** Does Dark Violet have the minimum area under the curve? **Choices:** (A) yes (B) no **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** . .
**GPT-4V:** Based on the scatter plot provided, “area under the curve” is not represented since there is no curve shown. However, if by “area under the curve” we mean the overall distribution and position of the points for "Dark Violet" compared to others, we can analyze the positions of the points. From the scatter plot, the color “Dark Violet” consistently appears to be the lowest on the y-axis across the entire range of the x-axis. Given the choices: (A) yes (B) no The correct answer is: (A) yes. **Correct output:**
image **ID:** 542 **Question:** How many models in the figure achieve an Acc score greater than 60? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** There are two models in the figure that achieve an Acc score greater than 60. **Correct output:**
**Multimodal Bard:** . These models are: \* GPT-4 \* \* \* \* \* \* \* \* LLAMA2-70B The Acc score is the accuracy of the model, which is the percentage of test cases that the model correctly classifies. An Acc score of 60 means that the model correctly classifies 60% of the test cases.
**GPT-4V:** .
### Scientific Scene Answering questions based on scientific scenes poses a challenge in aligning the scientific concepts present in the question text and those in the accompanying figures. GPT-4V demonstrates its superior ability to reason about scientific scenes compared to other models, as evidenced in Figure [fig:tease_scores]. In the example of Figure [fig:example_685], GPT-4V adeptly identifies two organisms in the food web and elucidates their relationship. In another instance, shown in Figures [fig:example_862] and [fig:example_862_2], both Multimodal Bard and GPT-4V are able to use knowledge in the physical domain to effectively ground the symbols and variables depicted in the figure.
image **ID:** 685 **Question:** What would happen to organism c if organism b increased? **Choices:** (A) decrease (B) increase (C) can’t predict (D) stay same **Ground truth:**
**LLaMA-Adapter-V2:** . This is because organism c is dependent on organism b for its food supply, and .
**LLaVA:** . . As a result, .
**Multimodal Bard:** . In the food chain, . This means that . . .
**GPT-4V:** If Organism B (which appears to be some kind of insect) increased in population, it would mean that there is more available food for Organism C (the fish). Therefore, the population of Organism C would likely: (B) increase. **Correct output:**
image **ID:** 862 **Question:** One of the most dramatic videos on the web (but entirely fictitious) supposedly shows a man sliding along a long water slide and then being launched into the air to land in a water pool. Let’s attach some reasonable numbers to such a flight to calculate the velocity with which the man would have hit the water. Figure indicates the launch and landing sites and includes a superimposed coordinate system with its origin conveniently located at the launch site. From the video we take the horizontal flight distance as $D=20.0 \mathrm{~m}$, the flight time as $t=2.50 \mathrm{~s}$, and the launch angle as $\theta_0=40.0^{\circ}$. Find the magnitude of the velocity at launch and at landing. **Ground truth:**
**Multimodal Bard:** Given the following information: \* Horizontal flight distance, $D = 20.0 \text{ m}$ \* Flight time, $t = 2.50 \text{ s}$ \* Launch angle, $\theta_0 = 40.0^{\circ}$ We can find the magnitude of the velocity at launch using the following equation: $$v_0 = \sqrt{v_{0x}^2 + v_{0y}^2}$$ where $v_{0x}$ is the horizontal component of the launch velocity and $v_{0y}$ is the vertical component of the launch velocity. The horizontal component of the launch velocity can be found using the following equation: $$v_{0x} = \frac{D}{t} = \frac{20.0 \text{ m}}{2.50 \text{ s}} = 8.00 \text{ m/s}$$ The vertical component of the launch velocity can be found using the following equation: $$v_{0y} = v_0 \sin \theta_0 = 8.00 \text{ m/s} \sin 40.0^{\circ} = 5.83 \text{ m/s}$$ Therefore, the magnitude of the velocity at launch is: $$v_0 = \sqrt{v_{0x}^2 + v_{0y}^2} = \sqrt{(8.00 \text{ m/s})^2 + (5.83 \text{ m/s})^2} = 10.44 \text{ m/s} \approx 10.4 \text{ m/s}$$ The magnitude of the velocity at landing will be the same as the magnitude of the velocity at launch, since there is no air resistance in this case. .
image **ID:** 862 **Question:** One of the most dramatic videos on the web (but entirely fictitious) supposedly shows a man sliding along a long water slide and then being launched into the air to land in a water pool. Let’s attach some reasonable numbers to such a flight to calculate the velocity with which the man would have hit the water. Figure indicates the launch and landing sites and includes a superimposed coordinate system with its origin conveniently located at the launch site. From the video we take the horizontal flight distance as $D=20.0 \mathrm{~m}$, the flight time as $t=2.50 \mathrm{~s}$, and the launch angle as $\theta_0=40.0^{\circ}$. Find the magnitude of the velocity at launch and at landing. **Ground truth:**
**GPT-4V:** Given: - $D = 20.0$ m - $t = 2.50$ s - $\theta_0 = 40.0^{\circ}$ Using kinematic equations, the horizontal motion is described as: $$D = V_{0x} \times t$$ Where $V_{0x}$ is the horizontal component of the initial velocity. $$V_{0x} = V_0 \times \cos(\theta_0)$$ From the first equation: $$V_0 \times \cos(\theta_0) = \frac{D}{t}$$ Rearranging for $V_0$: $$V_0 = \frac{D}{t \times \cos(\theta_0)}$$ Substitute the given values to get $V_0$: $$V_0 = \frac{20.0 \mathrm{~m}}{2.50 \mathrm{~s} \times \cos(40.0^{\circ})}$$ Using a calculator: $$V_0 = \frac{20.0}{2.50 \times 0.766}$$ $$V_0 = \frac{20.0}{1.915}$$ $$V_0 = 10.44 \mathrm{~m/s}$$ Thus, the magnitude of the launch velocity (which will be the same as the landing velocity, ignoring air resistance) is $10.44 \mathrm{~m/s}$. **Correct output:**
### Synthetic Scene Problems involving synthetic scenes require a nuanced understanding of visual objects, such as the numbers, attributes, and positions of these objects, as shown in Figures [fig:example_52] and [fig:example_177]. Although GPT-4V demonstrates notable advancements over other models, such as Multimodal Bard, it still falls significantly short of human performance, as shown in Figure [fig:tease_scores].
image **ID:** 52 **Question:** Subtract all big red metallic spheres. Subtract all big brown matte things. How many objects are left? **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** : \* 1 big red metallic sphere \* 1 big brown matte cube \* 1 small red metallic sphere \* 1 small brown matte sphere : \* 1 big brown matte cube \* 1 small red metallic sphere \* 1 small brown matte sphere : \* 1 small red metallic sphere \* 1 small brown matte sphere Therefore, .
**GPT-4V:** There are no big red metallic spheres in the image. There is 1 big brown matte object (the cube). Total objects to be subtracted = 0 + 1 = 1 Total objects in the image = 7 Objects left = 7 - 1 = 6 There are 6 objects left. **Correct output:**
image **ID:** 177 **Question:** Is the number of brown tandem bikes that are to the left of the small blue matte car greater than the number of tiny blue biplanes? **Choices:** (A) Yes (B) No **Ground truth:**
**LLaMA-Adapter-V2:** Yes, the number of brown tandem bikes that are to the left of the small blue matte car is greater than the number of tiny blue biplanes. **Correct output:**
**LLaVA:** .
**Multimodal Bard:** . , and . .
**GPT-4V:** Looking at the image, there is 1 brown tandem bike to the left of the small blue matte car. There are no tiny blue biplanes in the image. Thus, the number of brown tandem bikes that are to the left of the small blue matte car is (A) Yes greater than the number of tiny blue biplanes. **Correct output:**
### Table Tables serve as a powerful tool to present and summarize large amounts of data in a comprehensible manner. In particular, GPT-4V has shown significant advancements over other foundation models and even surpasses human performance on table-related reasoning tasks, as shown in Figure [fig:tease_scores]. The example in Figure [fig:example_923] shows a complex table taken from an academic paper. GPT-4V can accurately pinpoint the target cells among numerous rows and columns. Figure [fig:example_265] shows a QA task in which the answer needs to be derived from the table regarding the push-up competition. GPT-4V is the only model that can produce the correct answer.
image **ID:** 923 **Question:** Which model achieves the highest score in terms of Rec? **Choices:** (A) Transformers Agent (GPT-4) (B) LLaMA-Adapter v2-7B (C) LLaVA-7B (D) Otter-9B (E) MM-ReAct-GPT-3.5 (F) LLaVA-13B (LLaMA-2) (G) MM-ReAct-GPT-4 **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** The answer is **(G) MM-ReAct-GPT-4**. MM-ReAct-GPT-4 achieves the highest score of **33.1** in terms of Rec, which is the ability to recognize and understand text. The other models have the following scores in terms of Rec: \* Transformers Agent (GPT-4): 18.2 \* LLaMA-Adapter v2-7B: 16.8 \* LLaVA-7B: 28.0 \* Otter-9B: 28.4 \* MM-ReAct-GPT-3.5: 24.2 \*
**GPT-4V:** The model that achieves the highest score in terms of Rec is “LLaVA-13B (LLaMA-2)” with a score of 39.2%. The correct option is: (F) LLaVA-13B (LLaMA-2) **Correct output:**
image **ID:** 265 **Question:** Justine’s P.E. class participated in a push-up competition, and Justine wrote down how many push-ups each person could do. How many people did at least 60 push-ups? (Unit: people) **Ground truth:**
**LLaMA-Adapter-V2:** Justine’s P.E. class participated in a push-up competition, and Justine wrote down how many push-ups each person could do. . .
**LLaVA:** .
**Multimodal Bard:** There are 4 people who did at least 60 push-ups. , so . . .
**GPT-4V:** For the stem-and-leaf plot provided: 6 represents the tens value, and the digits 3, 6, 7, and 7 represent the ones value. So, the number of push-ups represented by the 6 row are 63, 66, 67, and 67. 7 represents the tens value, and the digits 1, 3, 3, 5, 7, 7, and 8 represent the ones value. So, the number of push-ups represented by the 7 row are 71, 73, 73, 75, 77, 77, and 78. Counting these numbers: 4 from the 6 row + 7 from the 7 row = 11. **Correct output:**
### Other Visual Contexts On the reasoning tasks using *other* visual contexts, GPT-4V achieves a higher overall accuracy than all the other models, as depicted in Figure [fig:tease_scores]. For instance, GPT-4V is the only model that is capable of generating the correct answer to the question regarding a violin plot, as shown in Figure [fig:example_895].
image **ID:** 895 **Question:** Which model has the lowest Audio-Audio Similarity and Text-Audio Similarity scores overall? **Choices:** (A) MusicLDM (mix-up) (B) MusicLDM (original) (C) MusicLDM (BLM) (D) MusicLDM (BAM) (E) MuBERT **Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** Based on the image, .
**Multimodal Bard:** . The violin plot shows that the distribution of audio-audio similarity scores for MusicLDM (BLM) is the most spread out, indicating that there is a lot of variation in the scores. The distribution of text-audio similarity scores for MusicLDM (BLM) is also the most spread out. This suggests that MusicLDM (BLM) is not very consistent in its ability to capture the similarity between audio and text.
**GPT-4V:** The model with the lowest Audio-Audio Similarity and Text-Audio Similarity scores overall is MuBERT. **Correct output:**
## Self-Verification in GPT-4V [app:gpt4v_self] Self-verification is a social psychological theory asserting that people desire others to perceive them as they see themselves. Consequently, individuals will take active measures to ensure that others view them in ways that confirm their stable self-concepts [talaifar2020self](http://arxiv.org/pdf/2405.14152v1). Interestingly, in our experiments, GPT-4V demonstrates an ability similar to *self-verification*. The model can inspect its own behaviors during the course of reasoning and can take active actions to correct its mistakes. Note that self-verification we discuss here differs from several recent works on improving the model’s outputs based on external feedback [peng2023check](http://arxiv.org/pdf/2302.12813v3) or additional generations [thedawn2023yang](None). The examples in Figures [fig:example_6] and [fig:example_28] show that GPT-4V, on its own, can inspect a set of candidate answers and identify the one that is valid and meets all the given constraints. The multi-step reasoning example in Figure [fig:example_615] shows that GPT-4V can verify the validity of (the result of) each reasoning step, and explore alternative approaches if any invalid (intermediate) result is detected (*e.g.,* a negative value for length). Although self-verification does not guarantee an accurate response even after multiple tries, especially when applying GPT-4V to visual perception or mathematical reasoning in intricate scenarios (see Figure [fig:example_644]), it is instrumental in improving the model performance on . We also found that GPT-4V’s self-verification is weaker for non-English tasks, such as Mandarin, as shown in Figure [fig:example_32]. It is also worth noting that self-verification does not emerge in other foundation models we studied, or at least it is not as robust as that of GPT-4V. As shown in Figure [fig:example_44], Multimodal Bard first attempts a natural language solution, followed by a program-assisted one for verification. However, the program-aided solution leads to a different and incorrect prediction. The emergent ability of self-verification highlights GPT-4V’s potential in solving rigorous reasoning and theorem-proving tasks. One of the most exciting research topics for future work is to develop a mechanism that allows the model to activate self-verification consistently at the right time and to use a set of alternative approaches that maximize the success rate of task completion.
image **ID:** 615 **Question:** If two sides of a triangle measure 12 and 7, which of the following cannot be the perimeter of the triangle? **Choices:** (A) 29 (B) 34 (C) 37 (D) 38 **Ground truth:**
**GPT-4V:** The sum of the lengths of any two sides of a triangle must be greater than the length of the third side. Let $x$ be the length of the third side. From the triangle inequality: $$12 + 7 > x => x < 19$$ Similarly: $$7 + x > 12 => x > 5$$ And: $$12 + x > 7 => x > -5 \text{~(\blue{This will always be true since the length cannot be negative})}$$ So, $$5 < x < 19$$ **Correct output:**
image **ID:** 6 **Question:** As shown in the figure, if $CB = 4.0$, $DB = 7.0$, and $D$ is the midpoint of $AC$, then the length of $AC$ is (). **Choices:** (A) $6$ cm (B) $7$ cm (C) $4$ cm (D) $5$ cm **Ground truth:**
**GPT-4V:** From the figure, we can deduce: $$AD = DC \text{~(since $D$ is the midpoint of $AC$)}$$ Now, using the entire line segment $AB$: $$AB = AD + DB$$ $$AB = DC + DB$$ Given $DB = 7.0$ cm, we can rearrange to find: $$AD~(\text{or}~DC) = AB - DB$$ $$AD = AB - 7.0$$ Now, let’s find $AB$: From the given information, $$AB = AC + CB$$ $$AB = AC + 4.0$$ Using the previously deduced equation: $$AD = AC + 4.0 - 7.0$$ $$AD = AC - 3.0$$ Since $D$ is the midpoint of $A$C, $AD$ will be half the length of AC: $$AC = 2(AD)$$ From the above: $$\red{AC - 3.0 = 2(AD)}$$ Given that $AD$ and $DC$ are equal, we can represent $DC$ as: $$DC = AC/2$$ Using the relation $CB = 4.0$ cm and $DB = 7.0$ cm: $$\red{AC/2 + 7.0 = 4.0}$$ $$\red{AC/2 = -3.0}$$ From the original figure: $$AB = AD + DB$$ $$AB = DC + DB$$ Given $DB = 7.0$ cm, we have: $$AD = AB - 7.0$$ $$AD = AC/2 - 7.0$$ Using the relation $AB = AC + CB$: $$AC/2 + 7.0 = AC + 4.0$$ Subtracting $AC/2$ from both sides: $$7.0 = AC/2 + 4.0$$ Subtracting $4$ from both sides: $$AC/2 = 3.0$$ Multiplying both sides by $2$: $$AC = 6.0 \text{~cm}$$ **Correct output:**
image **ID:** 28 **Question:** $\overline{AB}$ is a diameter, $AC=8$ inches, and $BC=15$ inches. Find the radius of the circle. **Choices:** (A) 7.5 (B) 8 (C) 8.5 (D) 17 **Ground truth:**
**GPT-4V:** Given that $AB$ is a diameter and $AC = 8$ inches and $BC = 15$ inches, we can use the Pythagorean theorem to determine the length of $AB$. The right triangle is formed by $A$, the center of the circle, and $C$ makes $AC$ the shorter leg, the radius the longer leg, and $BC$ the hypotenuse. Let $r$ be the radius of the circle. Using the Pythagorean theorem: Substituting the given lengths: $$\red{8^2 + r^2 = 15^2}$$ $$\red{64 + r^2 = 225}$$ $$\red{r^2 = 161}$$ Taking the square root of both sides: $$\red{r = \sqrt{161}}$$ $$\red{AB = 2r = 2\sqrt{161}}$$ $$r = \sqrt{161}/2 = 8.5 \text{~(approximately)}$$ **Correct output:**
image **ID:** 644 **Question:** What is the missing number in the picture? **Choices:** (A) 6 (B) 8 (C) 10 (D) 11 **Ground truth:** **Solution:** $8 + 4 -2 = 10$, $3 + 2 - 3 = 2$, $12 + 1 - 5 = 8$
**GPT-4V:** Let’s analyze the pattern in the given triangles: 1. For the first triangle: $$\red{(8 - 4) \times 2 = 8}$$ 2\. For the second triangle: $$\red{(3 - 2) \times 3 = 3}$$ Applying the same pattern to the third triangle: $$\red{(12 - 1) \times 5 = 55}$$ However, since the answer is in single digits, it seems we take only the unit’s place of the result. So, the missing number is: 1. For the first triangle: $$\red{(8 + 4) / 2 = 6}$$ 2\. For the second triangle: $$\red{(3 + 2) / 3 = 5/3 = 1.67} \text{~(This does not match the number inside)}$$ But let’s continue with this pattern for the third triangle: $$\red{(12 + 1) / 5 = 13/5 = 2.6}$$ 1. For the first triangle: $$\red{(8 \times 4) / 2 = 16}$$ 2\. For the second triangle: $$\red{(3 \times 2) / 3 = 2}$$ 3\. For the third triangle: $$\red{(12 \times 1) / 5 = 12/5 = 2.4}$$
image **ID:** 32 **Question:**
UTF8gbsn 如图,在 $ABCD$ 中,$AB = AC$,$\angle CAB = 40^\circ$,则 $\angle D$ 的度数是()。
(In the figure, within $ABCD$, $AB = AC$, and $\angle CAB = 40^\circ$, then the measure of $\angle D$ is ().) **Choices:** (A) 40° (B) 50° (C) 60° (D) 70° **Ground truth:**
**GPT-4V:**
UTF8gbsn 给定 $AB = AC$,$\triangle ABC$ 是等腰三角形。由于 $\angle CAB = 40^\circ$,所以 (因为在等腰三角形中,两个底角是相等的)。 (Given $AB = AC$, $\triangle ABC$ is an isosceles triangle. Since $\angle CAB = 40^\circ$, (because in an isosceles triangle, the two base angles are equal).) 考虑四边形 $ABCD$,其内角和为 $360^\circ$。已知 $\angle CAB = 40^\circ$,,并且 。 (Considering the quadrilateral $ABCD$, its interior angles sum up to $360^\circ$. We know that $\angle CAB = 40^\circ$, , and .) 我们可以找到 $\angle BCA$ 的角度为: (We can find the angle of $\angle BCA$ as:) ,我们有: (, we have:) () 注意到 。所以: (Notice that . So:) 所以 。。 (Thus, . .) () 。 (.) 因此, ,但是 。 (Therefore, , but .) 所以 (So) $$\red{\angle D = 100^\circ - 40^\circ = 60^\circ}$$
image **ID:** 44 **Question:** Chase wants to buy 4 kilograms of oval beads and 5 kilograms of star-shaped beads. How much will he spend? (Unit: \$) **Ground truth:** **Solution:** $8 + 4 -2 = 10$, $3 + 2 - 3 = 2$, $12 + 1 - 5 = 8$
**Multimodal Bard:** The cost of 4 kilograms of oval beads is $4 * 2 = \$8$. The cost of 5 kilograms of star-shaped beads is $5 * 2 = \$10$. So the total cost is $\$8 + \$10 = \$18$. . ``` python def calculate_cost(oval_beads, star_shaped_beads): oval_bead_cost = 2 star_shaped_bead_cost = 2 total_cost = oval_beads * oval_bead_cost + star_shaped_beads * star_shaped_bead_cost return total_cost print(calculate_cost(4, 5)) # 35 ```
## Self-Consistency for GPT-4V [app:gpt4v_consistency] Self-consistency [wang2022self](http://arxiv.org/pdf/2404.13149v1) is a decoding strategy for chain-of-thought prompting [wei2022chain](http://arxiv.org/pdf/2201.11903v6). A diverse set of reasoning paths is sampled, and the most consistent answer is selected as the final prediction. Moving beyond vanilla greedy decoding, this method resorts to the inherent coherence and reliability of multiple reasoning trajectories to produce a more trustworthy conclusion. Self-consistency has been widely employed in LLMs for complex reasoning tasks, such as math word problems and commonsense reasoning. In our experiments, we validated the effectiveness of using self-consistency for GPT-4V on . Given a question and context, we ran GPT-4V multiple times to obtain a set of different reasoning paths and then selected the most frequent answer as the final prediction. We found that self-consistency is instrumental in rectifying visual perception errors (Figure [fig:example_648]), correcting calculation mistakes (Figure [fig:example_331]), and mitigating hallucinations (Figure [fig:example_430]). In comparison, self-consistency is less effective when GPT-4V has difficulties in interpreting complex visual contexts (Figures [fig:example_157], [fig:example_551]) or extracting salient information from images (Figure [fig:example_965]).
image **ID:** 648 **Question:** What is the minimum value of this function? **Ground truth:**
**GPT-4V (Reasoning Path 1):** The minimum value of the function is observed where the graph is at its lowest point along the $y$-axis. From the graph, the minimum value lies on the $y$-axis between -0.5 and 0. Based on the visual representation, it appears that the minimum value of $y$ is: -0.5. So, the answer is -0.5. **Wrong output:**
**GPT-4V (Reasoning Path 2):** The graph shown represents a function with circular data points plotted across the $x$-axis. To determine the minimum value of the function, we need to locate the lowest point on the graph. From the graph, it is evident that the lowest point (or the minimum value of the function) is at a $y$-coordinate of approximately -1. Therefore, the minimum value of the function is -1. **Correct output:**
**GPT-4V (Reasoning Path 3):** The graph depicts the function using circular points. The minimum value of the function can be observed by identifying the lowest point on the graph. From the graph, the lowest point seems to be at $y$ = -1. Therefore, the minimum value of this function is -1. **Correct output:**
image **ID:** 331 **Question:** Several people compared how many Web pages they had visited. What is the mean of the numbers? **Ground truth:**
**GPT-4V (Reasoning Path 1):** To find the mean of the numbers, you add up all the numbers and then divide by the quantity of numbers. $$\begin{aligned} \text{Mean} &= (84 + 56 + 70 + 68 + 60 + 47 + 75 + 52) ÷ 8\\ &= 462 ÷ 8\\ &= 57.75 \end{aligned}$$ However, since the question requires an integer answer, the mean number of web pages visited is: 58. **Wrong output:**
**GPT-4V (Reasoning Path 2):** To find the mean number of web pages visited: First, sum the number of web pages for all the individuals: Bonnie: 84, Winston: 56, Percy: 70, Derek: 68, Jake: 60, Finn: 47, Carter: 75, Kurt: 52 $$\text{Total} = 84 + 56 + 70 + 68 + 60 + 47 + 75 + 52 = 512$$ Next, divide by the number of individuals (8): $$\text{Mean} = 512 ÷ 8 = 64$$ The mean number of web pages visited is 64. **Correct output:**
**GPT-4V (Reasoning Path 3):** First, I’ll list the number of web pages each person visited: Bonnie: 84, Winston: 56, Percy: 70, Derek: 68, Jake: 60, Finn: 47, Carter: 75, Kurt: 52 Next, I’ll group some numbers together for easier addition: $$\begin{aligned} (84 + 56) &= 140\\ (70 + 68) &= 138\\ (60 + 47) &= 107\\ (75 + 52) &= 127 \end{aligned}$$ Now, I’ll sum these results: $$140 + 138 + 107 + 127 = 512$$ Now, I’ll divide the sum by the number of individuals to get the mean: $$512 ÷ 8 = 64$$ So, the mean number of web pages visited is 64. **Correct output:**
image **ID:** 430 **Question:** The derivative of $f(x)$ at $x=0$ is ------------------------------------------------------------------------ that at $x=5$ **Choices:** (A) larger than (B) equal to (C) smaller than **Ground truth:**
**GPT-4V (Reasoning Path 1):** The graph represents the function $f(x) = |2x - 3| + 1$. The slope (or the derivative) of $f(x)$ is represented by the inclination of the line. At $x = 0$, the slope is negative (the line is declining), and at $x = 5$, the slope is positive (the line is inclining). The magnitude of the slopes appears to be the same; only their directions differ. However, when considering the magnitude of the derivative, the value at $x = 0$ is negative, and the value at $x = 5$ is positive, but they have the same absolute value. **Wrong output:**
**GPT-4V (Reasoning Path 2):** The given function is $f(x) = |2x - 3| + 1$. The absolute value function is piecewise defined, so its derivative will have different values depending on where you take it. For $2x - 3 > 0$, which is $x > \frac{3}{2}$ or 1.5, $f(x) = 2x - 3$. The derivative is $f'(x) = 2$. For $2x - 3 < 0$, which is $x < \frac{3}{2}$ or 1.5, $f(x) = -2x + 3$. The derivative is $f'(x) = -2$. At $x = 0$, the derivative $f'(x) = -2$. At $x = 5$, the derivative $f'(x) = 2$. Therefore, the derivative of $f(x)$ at $x = 0$ is smaller than that at $x = 5$. **Correct output:**
**GPT-4V (Reasoning Path 3):** The function provided is $f(x) = |2x - 3| + 1$. The absolute value function has a V-shaped graph, as shown in the image. The slope (or derivative) of the line to the left of the vertex is negative, and the slope of the line to the right of the vertex is positive. The derivative of $f(x)$ at $x=0$ corresponds to the slope of the line on the left side of the vertex, which is negative (downward slope). The derivative of $f(x)$ at $x=5$ corresponds to the slope of the line on the right side of the vertex, which is positive (upward slope). Since a positive value is always larger than a negative value, the derivative of $f(x)$ at $x=0$ is smaller than that at $x=5$. **Correct output:**
image **ID:** 157 **Question:** What number is shown? **Ground truth:**
**GPT-4V (Reasoning Path 1):** Let’s break down the number shown using base-ten blocks: - There are 6 large cubes, each representing $1,000$. So, $6 \times 1,000 = 6,000$. - There are 7 flats, each representing 100. So, $7 \times 100 = 700$. - There are 4 rods, each representing 10. So, $4 \times 10 = 40$. - There are 3 units, each representing 1. So, $3 \times 1 = 3$. Adding all of these values together: $$6,000 + 700 + 40 + 3 = 6,743$$ The number shown is $6,743$. **Wrong output:**
**GPT-4V (Reasoning Path 2):** The image displays a series of cubes that appear to represent numbers in base 10 using place value. From left to right: 1. There are six large cubes, which represent a value of $6 \times 1000 = 6000$. 2. There are three medium-sized cubes, which represent a value of 3$\times 100 = 300$. 3. There are two small cubes, which represent a value of $2 \times 10 = 20$. 4. The vertical stack of single unit cubes shows nine cubes, which represent a value of $9 \times 1 = 9$. Adding up these values: $$6000 + 300 + 20 + 9 = 6329$$ The number shown is $6329$. **Wrong output:**
**GPT-4V (Reasoning Path 3):** The image displays a series of cubes representing numbers in base 10 using place value. From left to right: 1. There are seven large cubes, which represent a value of $7 \times 1000 = 7000$. 2. There are two medium-sized cubes, which represent a value of $2 \times 100 = 200$. 3. There are nine small cubes, which represent a value of $9 \times 10 = 90$. 4. The vertical stack of single unit cubes shows eight cubes, which represent a value of $8 \times 1 = 8$. Adding up these values: $$7000 + 200 + 90 + 8 = 7298$$ The number shown is $7298$. **Wrong output:**
image **ID:** 551 **Question:** What is the difference between the highest tattoos in male and the least in female? **Ground truth:** **Solution:** 27% (male in 2015) - 13% (female in 2008) = 14%
**GPT-4V (Reasoning Path 1):** The highest percentage of tattoos in males is 31% (in 2015), and the least percentage of tattoos in females is 15% (in 2008). Difference = 31% - 15% = 16% **Wrong output:**
**GPT-4V (Reasoning Path 2):** The highest percentage of males with at least one tattoo is in 2015 at 31%. The lowest percentage of females with at least one tattoo is in 2003, 2008, and 2012, all at 15%. The difference between these values is: 31% - 15% = 16% **Wrong output:**
**GPT-4V (Reasoning Path 3):** To determine the difference between the highest tattoos in males and the least in females: Highest tattoos in males: 31% (in 2015) Least tattoos in females: 15% (in 2003 and 2008) Difference: 31% - 15% = 16% **Wrong output:**
image **Prompt:** What is the percentage of tattoos in females in 2008? **GPT-4V:** . **Prompt:** What is the percentage of tattoos in males in 2015? **GPT-4V:**
image **ID:** 965 **Question:** What’s the lowest number yard line that you can see? **Ground truth:**
**GPT-4V (Reasoning Path 1):** The lowest number yard line visible in the image is .
**GPT-4V (Reasoning Path 2):** From the image provided, the lowest number yard line that is clearly visible is .
**GPT-4V (Reasoning Path 3):** From the image provided, the lowest number yard line that can be seen is .
**GPT-4V (Reasoning Path 4):** From the image, the lowest number yard line that is visible is .
**GPT-4V (Reasoning Path 5):** From the image, the lowest number yard line that is visible is .
image **Prompt:** List the yard line numbers that are visible in the image. **GPT-4V:** **Prompt:** Can you see the yard numbers 40 and 50 in the image? **GPT-4V:**
## Abstaining from Responding [sec:gpt4v_abstain] We observed that both Multimodal Bard and GPT-4 often abstain from responding. Specifically, Multimodal Bard returns “Sorry, I can’t answer this question” (7.0%) or “I can’t process this file” (0.8%), while GPT-4V responds with “Sorry, I cannot help with that.” (7.0%). As illustrated in Figures [fig:example_699] through [fig:example_more_safety], these patterns can be attributed to patterns in large language models that have been characterized as “exaggerated safety behaviors” [rottger2023xstest](None). Further research is necessary to explore these behaviors in large multimodal models.
image **ID:** 699 **Question:** What is the age gap between the center and the rightmost person? (unit: years) **Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
image **ID:** 740 **Question:** What percent of the stands are full? **Choices:** (A) 15 (B) 100 (C) 50 **Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
image **ID:** 779 **Question:** Is the cardiac silhouette less than half the diameter of the diaphragm? **Choices:** (A) Yes (B) No **Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
image **ID:** 821 **Question:** Are most of the people young men? **Choices:** (A) Yes (B) No **Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
image **ID:** 352 **Question:** How many people will fit in the smaller vehicle? **Ground truth:**
image **ID:** 516 **Question:** Are most the stepping stones square? **Choices:** (A) Yes (B) No **Ground truth:**
**GPT-4V:**
## GPT-4V for Multi-Turn Human-AI Interaction [app:gpt4v_chatbot] This section investigates the use of GPT-4V for multi-turn human-AI interaction on , as exemplified in the goal-directed dialog in Figure 1. We found that GPT-4V is effective in engaging multi-turn goal-directed conversations with users. In particular, GPT-4V can make good use of *hints* (*e.g.,* user feedback or responses) to guide the conversion to generate desirable results. For instance, it can (1) rectify visual perception errors based on hints (Figure 2), (2) reassess reasoning steps and calculations (Figure 3), (3) correct misinformation using user-provided domain-specific knowledge (Figure 4), and (4) aggregate intricate contexts over multiple turns in a human-AI conversation (Figures 5 and 6). We also observed failure cases in our evaluation. For instance, GPT-4V struggles to generate correct responses when questions and user hints are ambiguous (Figure 7), or when the model fails to understand abstract shapes and concepts visually (Figure 8). These failures motivate the development of more powerful, conversational foundation models.
A multi-turn dialog between a human user and GPT-4V.
GPT-4V initially produces an incorrect solution due to errors in visually perceiving the number of baseballs. After receiving user feedback, it recalibrates and generates the accurate answer.
GPT-4V correctly identifies the sequence of numbers in the images but struggles to discern the underlying pattern. With feedback from the user, GPT-4V re-evaluates the pattern and produces the correct response to the logical reasoning question.
Initially, GPT-4V applies an incorrect theorem to address the scientific reasoning query. However, after receiving the appropriate domain knowledge hint from the user, it utilizes the correct theorem, combined with key visual cues from the images, to produce the right answer.
Answering this question requires comprehending a complex table regarding AI models (such as identifying relevant rows and columns), and up-to-date knowledge of AI research, and the ability to perform advanced arithmetic reasoning. Without human in the loop, GPT-4V fails to produce the correct answer.
(Continued) In this multi-turn human-AI conversation, the user acts as a tutor to help GPT-4V generate the correct answer . (1) Initially, the user directs GPT-4V to the “Model Size” column after observing its misdirection. Despite the guidance, GPT-4V still struggles. (2) Recognizing the model’s difficulty in identifying the “Model Size” column, the user provides more detailed instructions. This time, GPT-4V is on the right track but still makes some error in its response. (3) With further information from the user, GPT-4V corrects the remaining errors and generate the correct answer.
In this example, even after the user offers helpful hints, GPT-4V struggles to decipher the number of triangles in the given geometric shape. The task is challenging for AI models, including GPT-4V, with a limited ability to understand abstract shapes visually.
## Future Directions [sec:gpt4v_future] Text
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots 2022-09-16 Yu-Chung Hsiao, Fedir Zubach, Maria Wang, Jindong Chen
We present a new task and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 86K question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity.
Show Paper Content
# Introduction Mobile app screenshots have been analyzed using machine learning from multiple aspects. These analyses range from pixel level understanding, e.g., layout structural analyses, UI issue detection and correction [liLearningDenoiseRaw2022](https://doi.org/10.1145/3491102.3502042), to UI element semantics, e.g., icon recognition, button action prediction [sunkaraBetterSemanticUnderstanding2022](None), to even higher-level functional analyses such as accessibility support [liWidgetCaptioningGenerating2020](https://doi.org/10.18653/v1/2020.emnlp-main.443), screen description [wangScreen2WordsAutomaticMobile2021](https://doi.org/10.1145/3472749.3474765), and screen type classification [dekaRicoMobileApp2017](https://doi.org/10.1145/3126594.3126651). Comparatively, the content understanding aspect is relatively understudied. By content, we mean the information displayed on the screen to convey and satisfy the purpose of using the app. Examples include star ratings from restaurant reviews, messages from chat apps, cuisine ingredients from recipe apps, flight status and in-flight amenities from travel planner apps, etc. Having this capacity of understanding is important for two reasons: First, the sole reason for many apps and app categories to exist is to satisfy users’ information need, e.g., weather, map navigation, and news apps. Second, for task completion[^1], which requires the eyes-free agent capacity, the two types of screen understandings — content and action understanding — are inseparable in order to carry out a task successfully. Without knowing a screen state properly, a machine learning agent is unable to self-assess if the action is performed as expected, or unable to provide sufficient feedback to the user to achieve true eyes-free user experience. More intrinsically, from a pure research perspective, we are interested in knowing the limit of machine screen content understanding[^2] and what constitutes the challenges, given that app screenshots are entirely human artifacts made for convenient comprehension. Accordingly, we annotated the RICO dataset [dekaRicoMobileApp2017](https://doi.org/10.1145/3126594.3126651) with 86,025 question-answer pairs, referred to as Screen Question Answering, or, in short, **ScreenQA** annotations later in this work, and released the dataset in the public domain[^3]. The ScreenQA task requires an agent to answer a user’s question by selecting one or multiple UI elements from the given screenshot, as will be formulated in Section [sec:problem_setting]. Question answering is employed as a touchstone to sparsely[^4] verify the quality of screen content understanding. To the best of our knowledge, this is the first large-scale questions answering dataset over mobile app screenshots, and the first one to be publicly available. Much inspired by the SQuAD dataset [rajpurkarSQuAD1000002016a](https://doi.org/10.18653/v1/D16-1264), we hope, by releasing this set of annotations, to encourage the community to advance technologies toward better screen content understanding. We anticipate that the advance of such technologies will benefit beyond just the screen UI and the human computer interaction (HCI) domains. As we will discuss in Section [sec:related_work], other vision-language related multimodal domains share similar challenges with different emphases on respective modalities and contexts. Comparatively, ScreenQA is language and layout heavy, but it also includes visual ingredients such as icons and symbols as concise representations in place of texts, to declutter the UI. It may also include images or art designs that pose challenges to language centric machine learning agents. The remaining paper is organized in the following way: Section [sec:problem_setting] formulates the problem, including the problem description and the evaluation metrics. We discuss relevant prior datasets and annotations in Section [sec:related_work] to put this work into perspective. Section [sec:annotation_method] describes our annotation method. The annotations are then analyzed in Section [sec:annotation_analysis] to provide readers both the qualitative and quantitative views. The paper is concluded in Section [conclusion] with a summary and a remark on future works. [^1]: Also referred to as automation or app control. [^2]: This term is analogous to machine reading comprehension from natural language processing. [^3]: ScreenQA dataset is released at . [^4]: Because questions are not exhaustively asked against a given screenshot. # Problem Setting [sec:problem_setting]
Ambiguous UI element boundaries. Three possibilities are annotated.
Answers to “What’s the temperature on Saturday?”
Semantic groups are the basic unit for ordering, not element coordinates.
We state the problem and define the evaluation metrics in this section. ## Problem statement [sec:problem_statement] The ScreenQA task requires an agent to answer a user’s question by selecting relevant UI elements from a given single screenshot. When it comes with multiple relevant UI elements, a list of such UI elements whose contents *minimally* satisfy the question should be selected and ranked in descending order of relevance to the question, if applicable, or following the common reading order by semantic groups, as will be described in Section 1.2. This assumes that answers are directly selectable from the screen and logical reasoning and calculation are not needed. If the screenshot does not contain the answers to the question, the agent should respond with “\”. This is summarized in Task [task:sreenqa].
**Input:** a question $Q$ and a screenshot $S$ **Output:** an answer list $A$ of UI elements selected from $S$ such that their contents minimally satisfy $Q$. The order of $A$ is further required to be - Ranked in descending order of relevance to $Q$, if applicable. - Otherwise, following the common reading order by semantic groups. If no contents in $S$ can satisfy $Q$, then returns an empty list $A$.
## Properties and terminologies [sec:properties_and_terminologies] The mobile app UI comes with some nuances. It is worth mentioning a few properties below. - View hierarchy, or the structural representation used to render the screen, is not required in Task [task:sreenqa], to be consistent with the human annotation process in Section [sec:annotation_method]. View hierarchy usually provides useful UI element candidates, but it may not always be reliable, for example, when using WebView or screen overlays. In such cases, a human annotator can still answer screen questions entirely from pixels without an issue, so we want to benchmark similarly. We leave the choice of dependency on view hierarchies to the modelers and, hence, do not require it. However, this comes with an ambiguity for UI element boundaries. See an example in Figure 1. We devise a more flexible answer matching process to mitigate such an impact, as will be discussed in Section 1.3.3. - Avoid question answering over long paragraphs. Although it is permissive by Task [task:sreenqa], we discourage annotators from asking such questions during the annotation process. For ScreenQA, we want to focus on learning the relationships between text segments arranged two-dimensionally on the screen, and leave the long paragraph question answering, which investigates the relationships between words, to the traditional NLP domain. - Avoid logical reasoning. This task assumes answers can directly be extracted from the screenshot without reasoning, entailment, counting and comparing numbers. This further exclude yes/no and why questions if not explicitly displayed on the screen. The reason is that we want to separate “able to read” and “able to reason” and focus on the former first without generating an over challenging dataset. A few such excluded examples are: counting items, asking about the weather a few days from now, what are the items cheaper than X dollars, etc. - Ordered by relevance. The task is designed to enable the eyes-free user experience. That is, a user may not be fully aware of how many relevant answers are displayed on the screen. For example, in Figure 2, when a user asks “What’s the temperature on Saturday?”, there are actually two temperatures, high and low, for each day and two Saturdays on the screen. In this case, the two temperatures should just follow the reading order, and the two Saturdays follow the relevance order as a user usually refers to the upcoming Saturday. For a well-designed mobile app, these two usually overlap well and we do not expect a large ambiguity here. - Reading order by semantic groups. Sometimes some UI elements are designed as semantic groups and should be referred to together to keep their semantic meaning. For example, in Figure 3, when a user asks “What are the first two movements of deep squat?”, then the answer should be “Deep Squat, 3 sets, 15x”, followed by “Lunge, 3 sets, 10x”. In other words, the common reading order should be based on semantic groups as the unit, rather than simply sorted by the coordinates of UI elements. Note that we set up the problem this way strategically in order to prioritize its solvability considering the progress of current technologies. However, practically, long vs. short texts and retrieval vs. reasoning are naturally mixed together in the daily usage of mobile apps. We will leave this type of composite problems to the future work. ## Evaluation metrics [sec:metrics] We consider two types of metrics: 1) Average normalized Discounted Cumulative Gain (Average nDCG) [jarvelinCumulatedGainbasedEvaluation2002](https://doi.org/10.1145/582415.582418), which is commonly used in information retrieval and ranking systems, and 2) Average F1 score, which has been employed in closed-domain question answering problems, such as the SQuAD dataset [rajpurkarSQuAD1000002016a](https://doi.org/10.18653/v1/D16-1264). One major difference between our metrics described below and the commonly used definitions is the unit of predictions. We use the element in the answer list $A$, described in Task [task:sreenqa], as the unit to determine a hit or a miss for both metrics. Besides, as UI elements can be ambiguous as mentioned in Section 1.2, we will describe an answer matching algorithm that mitigate such an impact in Section 1.3.3. ### Average nDCG [sec:avg_ndcg] We use a variant of nDCG that allows varying positions (number of returns) as opposed to a typical fixed position. This is because, unlike the search problem, which is fair to evaluate, say, top-10 retrieved documents across queries, ScreenQA can have different needs of answer lengths across different questions. For example, a question like “what is the address of the store” expects a single returned result. A question like “what are the login options?” expects an enumeration of options on the screen that easily go beyond five. Accordingly, we allow $v$arying positions as follows: Given a 1-indexed list $A$, which is the predicted answer for the screen-question pair $(S, Q)$, and a ground truth answer list $A_g$ for $(S, Q)$, the Discounted Cumulative Gain at $v$arying positions (DCG$_v$) is computed by: $$\label{eq:dcg} \mbox{DCG}_v = \sum^{\|A\|}_{i=1} \frac{r_i}{\log_2{(i+1)}},$$ where $\|\cdot\|$ is the size of the list argument, $r_i$ is the relevance score for the $i$-th item of $A$. We assign the relevance score 1 for a hit and 0 for a miss compared with the ground truth $A^g$. The corresponding Ideal Discounted Cumulative Gain (IDCG$_v$) is computed by: $$\label{eq:idcg} \mbox{IDCG}_v = \sum^{\|A^g\|}_{i=1} \frac{1}{\log_2{(i+1)}}.$$ The nDCG$_v$ is then $$\label{eq:ndcg} \mbox{nDCG}_v = \frac{\mbox{DCG}_v}{\mbox{IDCG}_v}.$$ Note that nDCG$_v$ is still between 0 and 1, hence, convenient for comparing scores and computing the average. For a dataset of $N$ examples, each of which is indexed by $i$ and has a predicted answer $A_i$ and $K$ ground truth annotations $A^g_{i, j=1 \dots K}$, the average nDCG$_v$ can be computed by $$\label{eq:avg_ndcg} \mbox{avg}(\mbox{nDCG}_v) = \frac{1}{N}\sum_{i=1}^N \mbox{max}_j [ \mbox{nDCG}_v(A_i, A^g_{i,j} ) ].$$ We choose a variant nDCG as the metric because 1) we want to measure the quality of the ranking. For example, if one incorrectly predicts the result from the first to the third position, the discount factor brings down the score from 1.0 to only 0.5. 2) nDCG has an orthogonal design, which is easier to tweak toward a specific need than the mean average precision (mAP) metric. For example, one can choose to discount faster or slower by changing the base of the denominator $\log_2(i+1)$ and can choose to penalize irrelevant predictions by assigning negative scores. Mean reciprocal rank (MRR) and mAP are much less controllable in these two aspects. One known drawback of nDCG is that it does not naturally penalize excessive predictions after the last relevant item. We therefore use the average F$_1$ score as a complementary view of the agent performance. ### Average F$_1$ [sec:avg_f1] Similar to the definition in SQuAD, the average F$_1$ score is computed as below, following the same notation as in [eq:avg_ndcg]: $$\label{eq:avg_f1} \mbox{avg}(\mbox{F}_1) = \frac{1}{N}\sum_{i=1}^N \mbox{max}_j [ \mbox{F}_1(A_i, A^g_{i,j} ) ].$$ Note that F$_1$ does not concern ranking. For some cases, such as enumeration questions, this is desirable, as the ranking order is merely the reading order, even if the item order is permuted, the answer quality is in general not compromised, hence, reasonable to be assigned the same evaluation score. On the contrary, if relevance ranking is important, such as in Figure 2, then nDCG provides a better view. Since both types of questions exist in the ScreenQA annotations, it is more complete to evaluate against both metrics. Also note that the unit of precision and recall computation is based on items in $A$, unlike SQuAD, which uses words as the unit instead. We describe how to compare items in an answer $A$ with the ground truth $A^g$ in the next section. ### Answer matching [sec:answer_matching] As mentioned in Section 1.2, the segmentation of UI elements provided in the predicted answer list $A$ may not coincide with the UI elements in the ground truth list $A^g$. Yet, if the overall answers are the same, the segmentation difference should not affect the evaluation score. Therefore, we use the following empirical procedure to mitigate such an impact, using an illustrated example (each capitalized character is a word token): $$\begin{aligned} A &= ["AB", "B", "BB", "CBA"] \\ A^g &= ["AB", "BC", "AB"], \end{aligned}$$ 1. Concatenate items in $A$ into a single item list $A^c = [``ABBBBCBA"]$. 2. Iterate through each $g \in A^g$ and check if $g$ is contained in any item in $A^c$. If so, mark the $g$ as HIT ($\cmark$) and mark the corresponding matched word token in the original $A$ and remove the matched part and split the remaining parts in $A^c$. Otherwise, mark the $g$ as MISS ($\xmark$). In this example, when $g = "AB"$, it is a HIT: $$\begin{aligned} A &= [``A_\cmark B_\cmark", ``B", ``BB", ``CBA"] \\ A^c &= [``BBBCBA"] \\ A^g &= [``AB"_\cmark, ``BC", ``AB"]. \end{aligned}$$ Then when $g = ``BC"$, it is a HIT. Note that the item in $A^c$ is split into two because of matching in the middle: $$\begin{aligned} A &= [``A_\cmark B_\cmark", ``B", ``BB_\cmark", ``C_\cmark BA"] \\ A^c &= [``BB", ``BA"] \\ A^g &= [``AB"_\cmark, ``BC"_\cmark, ``AB"]. \end{aligned}$$ Last, when $g = ``AB"$ again, it is a MISS, $A$ and $A^c$ unchanged, hence, omitted: $$\begin{aligned} A^g &= [``AB"_\cmark, ``BC"_\cmark, ``AB"_\xmark]. \end{aligned}$$ 3. Finally, iterate through each $a \in A$. If any $a$ has at least one word token marked as HIT, then the whole $a$ is a HIT, otherwise, a MISS. $$\begin{aligned} A &= [``AB"_\cmark, ``B"_\xmark, ``BB"_\cmark, ``CBA"_\cmark]. \end{aligned}$$ This procedure converts $A$ and $A^g$ into lists of HITs and MISSes. Then the evaluation metrics in [eq:avg_ndcg] and [eq:avg_f1] can be applied. Note that this procedure is not order invariant. This in turn makes the F$_1$ score not entirely order independent if any UI element ambiguity happens. This choice is to avoid the permutation complexity in evaluation. In practice, this is rarely an issue because when the ambiguity happens, the UI elements involved are almost always tightly close to each other, making their order practically fixed. See Case 3 in Figure 1 as an example. # Related Datasets and Annotations [sec:related_work] ScreenQA has two aspects: multimodality and question answering. We discuss related problems and datasets from these two aspects and focus our survey on datasets that are 1) human annotated and 2) released to the public domain. ## Multimodality Mobile app screenshots contain nearly all possible representation of information through pixels. Most commonly, the information is majorly by text, blended with icons, symbols, and images.[^1] We discuss three related multimodal domains. ### Screen UI for mobile apps For data released in the public domain, the RICO dataset [dekaRicoMobileApp2017](https://doi.org/10.1145/3126594.3126651) is, to the best of our knowledge, still the largest collection of mobile app screenshots [dekaEarlyRicoRetrospective2021](https://doi.org/10.1007/978-3-030-82681-9_8). It contains 66k unique screenshots and their corresponding view hierarchies from 9.7k Android apps spanning 27 app categories. Its overall approach extended ERICA [dekaERICAInteractionMining2016](https://doi.org/10.1145/2984511.2984581), which is an interactive trace recording tool and also released 3k traces for 18k unique screenshots from 1k Android apps for the search intent. LabelDroid [chenUnblindYourApps2020](https://doi.org/10.1145/3377811.3380327) and [chenWireframebasedUIDesign2020](https://doi.org/10.1145/3391613) by the same authors released a dataset of 55k UI screenshots from 25 categories of 7.7k top-downloaded Android apps. Annotations and the corresponding problems can be roughly categorized by the scope of the contexts. At the UI element level, [sunkaraBetterSemanticUnderstanding2022](None) annotated 77 icon types by shape, 15 out of which are additionally annotated with 38 semantic types, reaching about total 500k unique annotations. This work is further concerned with how UI elements are associated with companion labels such that the screen understanding between UI elements can be established. CLAY [liLearningDenoiseRaw2022](https://doi.org/10.1145/3491102.3502042) attempted to resolve the layout and view hierarchy denoising problem, annotating 60k RICO screenshots, a total of 1.4M UI elements with bounding boxes and types. [liWidgetCaptioningGenerating2020](https://doi.org/10.18653/v1/2020.emnlp-main.443) annotated 163k free-from descriptions for 61k UI elements from 22k RICO screenshots. At the single-screen level, [wangScreen2WordsAutomaticMobile2021](https://doi.org/10.1145/3472749.3474765) collected text summarizations for screens, consisting of 112k screen descriptions across 22k RICO screenshots. At the multi-screen level, one challenging direction is screen navigation, which requires the understanding of screen states, feasible action spaces of the current screen, and overall task goals. Since multiple types of understandings are involved, this problem is not strictly focused on screen content understanding. PixelHelp [liMappingNaturalLanguage2020b](https://doi.org/10.18653/v1/2020.acl-main.729) contains 187 multi-step instructions over 780 screenshots for four task categories. MoTIF [burnsDatasetInteractiveVisionLanguage2022](https://doi.org/10.48550/arXiv.2202.02312) contains 6k fine-grained instructions mixed with infeasible ones, over for 125 apps spanning 15 app categories. From the data perspective, annotating this type of problem is labor intensive and usually does not scale well. In comparison, the ScreenQA dataset is single-screen, focused on screen contents, and based on the RICO screenshots. ### Document image understanding Document image understanding[^2] concerns understanding documents represented in pixels or scanned, photographed formats. This domain is similar to mobile app screens for its text-heavy and non-sequential nature. The most noticeable dataset is RVL-CDIP [harleyEvaluationDeepConvolutional2015](https://doi.org/10.1109/ICDAR.2015.7333910), a 400k-image subset from IIT-CDIP [lewisBuildingTestCollection2006](https://doi.org/10.1145/1148170.1148307), a collection of low-resolution noisy documents, with balanced 16 document-level classes. FUNSD [jaumeFUNSDDatasetForm2019](https://arxiv.org/pdf/1905.13538) extracted a 199 scanned form images from RVL-CDIP and annotated them with bounding boxes and 4 text-segment-level classes. SROIE [huangICDAR2019CompetitionScanned2019](https://doi.org/10.1109/ICDAR.2019.00244) has 1k scanned receipt images for text localization, OCR, and key information extraction of 4 entity types. CORD [parkCORDConsolidatedReceipt2019](None) contains 11k scanned receipt images, annotated with 9 classes and 54 subclasses for text segments in OCR boxes. These earlier works are more about classification for text segments or for the whole document image. A more recent work, DocVQA [mathewDocVQADatasetVQA2021](https://doi.org/10.1109/WACV48630.2021.00225), uses a question answering format for span/segment extraction, with an annotation of 50k questions over 12k rectified, higher resolution document images. DocVQA is highly related to ScreenQA for its 2D arrangement of texts and for its extractive question answering format. We believe that the techniques developed for screens and document images are cross applicable. ### Visual question answering Visual question answering (VQA) [antolVQAVisualQuestion2015](https://doi.org/10.1109/ICCV.2015.279) and screen UI are oftentimes mentioned together, especially in the latter community, because of their vision-language multimodal nature. However, VQA is distinctively different from screen understanding for two reasons: 1) The visual context for VQA is usually light in, or even free from, any text, while screen UI is the opposite, and 2) The images for VQA are typically photos of natural or daily scenes with objects, while screen UIs are information oriented and arranged in a certain visual structure. There are some VQA variants comparatively closer to screen UI, to mention a few: VQA for texts on objects in photos, e.g., VizWiz [gurariVizWizGrandChallenge2018](https://doi.org/10.1109/CVPR.2018.00380) and TextVQA [singhVQAModelsThat2019](https://doi.org/10.1109/CVPR.2019.00851), and VQA for figures and charts, e.g., DVQA [kafleDVQAUnderstandingData2018](https://doi.org/10.1109/CVPR.2018.00592), FigureQA [kahouFigureQAAnnotatedFigure2018](None), and LEAF-QA [chaudhryLEAFQALocateEncode2020](https://doi.org/10.1109/WACV45572.2020.9093269). These VQA tasks may appear as part of screens but in general are different problems. ## Question answering Question answering tasks can be categorized by 1) open- or closed-domain, 2) answer formats and 3) main capacities to evaluate.[^3] The common answer formats include span [rajpurkarSQuAD1000002016a](https://doi.org/10.18653/v1/D16-1264), entity [talmorWebKnowledgeBaseAnswering2018](https://doi.org/10.18653/v1/N18-1059), multiple choice [mihaylovCanSuitArmor2018](https://doi.org/10.18653/v1/D18-1260), and generation [xiongTWEETQASocialMedia2019](https://doi.org/10.18653/v1/P19-1496). The capacities to evaluate range from reading comprehension [yangWikiQAChallengeDataset2015](https://doi.org/10.18653/v1/D15-1237), multi-hop reasoning [yangHotpotQADatasetDiverse2018](https://doi.org/10.18653/v1/D18-1259), [chenFinQADatasetNumerical2021](https://doi.org/10.18653/v1/2021.emnlp-main.300), logic reasoning [yuReClorReadingComprehension2020](None), and commonsense reasoning [talmorCommonsenseQAQuestionAnswering2019](https://doi.org/10.18653/v1/N19-1421). From this question answering perspective, ScreenQA is a closed-domain question answering task that expects answers by span (or UI element phrase) selection for screen reading comprehension. As described in Section [sec:problem_setting], we instructed the data annotators to avoid multi-hop, mathematical counting, and logic reasoning, in order to focus on the fundamental screen comprehension capacity. [^1]: Also videos, if we consider consecutive screenshots. We leave out the video modality here in the context of annotating the underlying RICO screenshots. [^2]: Also referred to as document analysis and recognition (DAR) or simply document understanding. [^3]: Here we only include one or two examples per format and per capacity. This is by no means to be comprehensive. # Annotation Method [sec:annotation_method]
ScreenQA annotation process.
We perform several steps to collect the ScreenQA annotations, as depicted in Figure 1. Each step is described below.
In-sync VH with occluded UI elements.
In-sync VH for main content, with ghosting VH from hamburger menu.
Out-of-sync VH for main content, though top bar VH is in-sync.
## Pre-filtering [sec:pre-filtering] The pre-filtering stage filters out 1) screenshots from non-English apps[^1], and 2) screenshots whose view hierarchies (VHs) are out of sync with the main contents. It is a known issue that in the RICO dataset, some screenshots and their corresponding view hierarchies are not perfectly in sync: there exists certain time difference between view hierarchy extraction and screenshot capturing. We want to remove those screenshots to ensure that all ScreenQA annotations are not subject to such data noises. Classifying the sync quality is tricky, even for human readers. One may not be able to differentiate between occlusion, ghosting, and the actual out-of-sync. See Figure [fig:vh-sync] for examples. Accordingly, we instructed the annotators to focus on the main content area of the screen and make sure the bounding boxes in that area are not corrupted, as this is where most contents of interest and questions come from. We use 27 annotators to perform this step. Among RICO’s 66k unique screenshots, about 11k screenshots are from non-English apps, and about 13k screenshots have out-of-sync view hierarchies.[^2] With the union of these two filtered out, there remains about 51k screenshots from English apps with in-sync VHs. ## Question annotations [sec:question-annotation] For question annotation, we asked the annotators to frame questions given a screenshot as the context. The annotators were expected to compose 1) natural, daily-life questions as if using the app. 2) The composed questions should inquire information that can directly read off from the screen and 3) should not require logical reasoning, counting and calculation, mathematical comparison, etc. We further required the annotators 4) not to ask questions about any advertisement on the screen. The annotation UI is depicted in Appendix [appendix:question_annotation_ui]. We asked the annotators to compose up to five questions given a screenshot in the first pass. In the second pass, we asked for up to three questions given a screenshot and the questions previously composed. Each pass involved one annotator for each screenshot and whoever annotated the screenshot before is excluded from being assigned to the same screenshot. This ensures that every screenshot is assigned precisely two annotators to compose questions. We chose this sequential process 1) to avoid tricky deduplication of similar questions, and 2) to encourage annotators to diversify their questions. Note that the same set of annotators were involved in the both passes such that each annotator had an opportunity to develop its own question style in the first pass before seeing others’ in the second pass. This makes sure that we still have certain numbers of question styles in the dataset before they converge to each other in repeated passes. We again involved the 27 annotators. The first pass of question annotation generated 46k questions. The second pass added additional 36k questions. These amount to a total of 82k questions, leaving about 15k screenshots with no questions annotated, due to lack of interesting contents. ## Answer annotations [sec:answer-annotation] We used the total 82k questions of 35k screenshots from the previous two-pass question annotation step to further annotate the corresponding answers. The annotator who composed the question is excluded from annotating its own answer to avoid potential biases. The answer annotation UI is shown in Appendix [appendix:answer_annotation_ui]. Given an example, which contains a screenshot and a question, the annotators are tasked to 1. Fix any grammatical errors or typos of the given question without altering its intention. 2. Answer the question, based on the context of the given screenshot, by 1) selecting bounding boxes from the underlying view hierarchy leaf nodes that contain the relevant answers, or drawing bounding boxes if no suitable leaf nodes can be used, and 2) ranking the answers in descending order of relevance if applicable, or by the common reading order. 3. Additionally also provide a full-sentence answer to the question. 4. Consider two exceptions: 1) The question may be incomprehensible or 2) the screenshot does not contain the answer to the question, due to the questioner’s lack of understanding of the app. Then the example should be marked as “invalid question” and “not answerable from the screenshot”, respectively. 5. One answer is annotated for the train split, and three for the validation and the test splits. This is to improve the evaluation quality. The data split details will be described in Section 1.5. The “invalid question” annotations are then filtered out, and the questions that have no other answer annotations are excluded from the overall ScreenQA dataset, as they are considered incorrectly annotated during the question annotation phase. ## Not-answerable question annotations [sec:not-answerable-question-annotation]
Chart showing the fraction of questions with answers and not answerable. Note that validation and test splits on average have roughly 3 answers per question, so there are cases when some annotators considered a question to be not answerable, while others provided an answer to that same question. Specifically, the validation and the test splits have 2.18% and 3.23% of such questions (the segments in red).
Question: ‘What is the name of the application?
Question: ‘What is the contact email for tech support?
Question: ‘What is the date of version 1.3.1?
The questions marked as “not answerable from the screenshot” represent a special category of questions that check model overtriggering (attempting to answer those which are not supposed to be answered). Being able to come to a conclusion that the answer is not present on the screen is an important aspect of screen understanding. Note that it is possible that one annotator considered a question to be not answerable, and another provided an answer to that same question. As described in Section 1.2, the first two passes of question annotations aimed to compose questions that can be answered from the screen, so as expected, the fraction of not answerable questions was small. We then had a third pass of question annotation to raise this fraction to nearly 10%, see Figure 5. For this, we used nearly 5k screenshots selected randomly from those where there were no such questions yet. In this pass, we asked annotators for exactly one additional question per screenshot that had some relation to the information there, but could not be answered. See examples in Figure [fig:no-answer-examples]. Answer annotation was not used for these 5k questions. ## Dataset statistics [sec:dataset-statistics]
| | Screenshots | Questions | |:-----------|------------:|----------:| | Train | $28,378$ | $68,980$ | | Validation | $3,485$ | $8,618$ | | Test | $3,489$ | $8,427$ | | Total | $35,352$ | $86,025$ | ScreenQA dataset split stats.
The ScreenQA dataset contains 35,352 screenshots and 86,025 questions. It is split into train, validation and test sets in approximately 80-10-10 ratio, see Table 1. Note that all questions for the same screenshot belong to only one split. [^1]: This is different from “non-English screenshots”, as translation and dictionary apps could pose confusion. [^2]: This out-of-sync number is different from [liMappingNaturalLanguage2020a](https://doi.org/10.18653/v1/2020.acl-main.729) because we focus on the main content area. # Annotation Analysis [sec:annotation_analysis] We analyze the annotations of questions and answers in this section.
| Category | % | Examples | | | |:---|---:|:---|:---|:---| | 1-2 (lr)3-4 UI selection & config | 18.1 | Which option is selected? | What is the selected ringtone? | | | Quantity number | 11.7 | How many unread messages? | How many pictures are there in Western Europe? | | | App name | 10.4 | What is the name of the application? | What is the app name? | | | Date time | 9.4 | When was “Heal the Living” released? | When is happy hour? | | | Price | 3.4 | How much is the gift bonus in 3rd place? | What is the price? | | | Name of item | 3.3 | What is the name of the drug? | What is the name of chef? | | | User name | 2.8 | What is the name of the user? | What is the username on telegram? | | | Duration | 2.5 | What is the duration of video? | How long is the song? | | | Enum. of avail. options | 2.5 | Which social media options are given there? | What are the options available for logging in? | | | Address and direction | 2.4 | What is the current location? | What is the service zip code? | | | Email address | 2.4 | What is an email address? | What is customer service email? | | | Person’s name | 2.1 | Who sang the song? | What is the last name? | | | Signup/login | 1.6 | Which application can be used to sign up / login? | What are the alternative choices for signing up? | | | Version information | 1.6 | What is the version number? | What is the new feature in version v3.1.3? | | | Weather | 1.5 | What is the range of temperature shown on Sunday? | What is the weather forecast for Sunday? | | | Score & value | 1.4 | What is height/weight of the person? | What is the score? | | | Yes/No | 1.1 | Is there any travel plans? | Is there any favorite? | | | Phone number | 1.0 | What is the phone number? | What is the prefix for the international mobile number? | | | \# of Stars | 0.8 | What is the star rating? | How many stars are given to the product? | | | Share/sharing | 0.8 | Which application can be used to share? | Where can I share this application? | | | Age | 0.8 | How old is ...? | What is the age? | | | Percentage | 0.7 | What is the percentage of ... ? | What is the brightness percentage for foreground? | | | Settings | 0.6 | What is the setting of ... ? | Which settings are switched on? | | | Quantity amount | 0.6 | How much fat is there? | What is the amount? | | | Permission | 0.5 | Which application is asking for permissions? | What permissions are required for MyCarTracks? | | | \# of Likes | 0.5 | How many likes for ... ? | How many likes does ... get? | | | Country | 0.5 | What is the name of the country? | Which country has the +54 code? | | | Distance | 0.5 | What is the visibility distance? | How far is it from ... ? | | | \# of Reviews | 0.4 | What is the number of comments on ... ? | How many comments? | | | Website | 0.3 | What is the url? | What’s the website address? | | | Gender | 0.3 | What is the gender? | Which gender is displayed on the screen? | | | How to | 0.3 | How to start on boot? | How to pronounce his name? | | | Currency | 0.3 | What is the currency? | What is the currency for the price? | | | Unit of measurement | 0.2 | What is the unit of temperature? | What is the unit of weight and length? | | | Language | 0.1 | Which language is used in the setting? | Which language is being translated into which language? | | | Color | 0.0 | What is the UI color? | What is the amount of green color? | | | 1-2 (lr)3-4 Others | 12.8 | What’s the average speed? | What is the user’s middle initial | | | | | What is the spending limit? | Which team has 41 points? | | | 1-2 Total | 100.0 | | | |
Number of composed questions per screenshot.
Number of bounding boxes used to answer the question.
## Question analysis [sec:question-analysis] We collected overall 86k questions over 35k unique screenshots from RICO. Among the 86k questions, there are 47.5k unique questions.[^1] Some screenshots receive more questions because they usually contain more information to be asked about. Yet, the histogram still exhibits a reasonable exponential decay with a mild slope, as depicted in Figure 1. To further understand what questions have been asked, we categorize the questions using regular expressions based on a list of empirically determined question categories. The categories are meant to provide a rough overview of the question annotations and by no means to provide a precise categorization. The distribution and examples by these categories are tabulated in Table [tab:q_cate]. Note that the questions were not composed at the annotators’ full discretion: They are conditioned on the given screenshots. That is to say, the distribution is implicitly influenced by the RICO crawling process. For example, as RICO crawled screen traces from freshly installed apps and did not login an account, a noticeable number of the screen traces end at a login page. This in turn translates to a higher percentage of questions asked about app names, email addresses, permissions to login, etc. ## Answer analysis [sec:answer-analysis] We analyze the answer annotations in two aspects: 1) How often do we need more than one bounding box and its text to answer the question, and 2) How often do human annotators find the view hierarchy useful to provide a minimal answer to the question. Figure 2 illustrates the histogram of number of bounding boxes used in each answer. About 84% of answers contain a single bounding box. Among these single-bounding-box answers, 51% uses a VH leaf node directly, while 49% uses a manually drawn bounding box. If we consider all answers together, excluding cases when there is no answer, still 51% uses VH leaf nodes entirely, while 48% uses manually drawn bounding boxes. That is, for about half of the total number of screenshots, human annotators preferred to manually draw the bounding boxes in order to provide answers that minimally satisfy the question. This observation reflects the necessity not to require the view hierarchy input for ScreenQA as described in Task [task:sreenqa]. Interestingly, there exist some cases, about 0.8% of the questions, that the human annotators used a mixture of VH leaf nodes and manually drawn bounding boxes as their full answer. By inspecting those cases, we found that these usually happen 1) when the answer is an enumeration of “inhomogeneous” options that are organized differently on the screen, such as using email vs. other APIs to login, and 2) when an answer needs multiple parts to be complete, such as a date consisting of year, month, and day scattered on the calendar UI, and a temperature or a measurement requiring a number followed by the corresponding unit. These parts may not be displayed in the same way, resulting in lack of useful VH leaf nodes for some of the parts. [^1]: Note that it is natural and valid to ask the same common questions over various screenshots, for example, “Which option is selected on the screen?” and “What is the email address?” # Baselines [sec:baseline] # Results [sec:result] # Conclusion In this work, we proposed the ScreenQA task. We annotated a large-scale ScreenQA dataset, which contains 86,025 question-answer pairs. Compared to other vision-language multimodal problems, such as document image understanding and visual question answering, ScreenQA poses its unique challenges: rich in text, diverse in apps, and blended with icons and symbols. We hope to use the ScreenQA task and the dataset to encourage the community to look into this screen content understanding problem, as it enables new technologies and new user experiences. # Acknowledgements The authors would like to thank Srinivas Sunkara for his valuable discussions and comments on this manuscript. # Data annotation interfaces for question and answer collection [appendix:annotation_ui] ## Question annotation interface [appendix:question_annotation_ui]
image
The question annotation interface is shown in Figure [fig:question-annotation-ui]. Question annotation was performed in a sequential manner, the later and non-overlapping annotators can see all previous questions to diversify question framing and avoid duplication. We also used the sequential process to provide more feedback and training to the annotators for quality improvement. ## Answer annotation interface [appendix:answer_annotation_ui]
image
The answer annotation interface is shown in Figure [fig:answer-annotation-ui]. Answer annotators were tasked to determine if the question is valid and if the question is answerable from the screen context. If both are positive, the annotators need to answer the questions by 1) selecting or drawing the bounding boxes of UI elements, 2) filling the text for each selected/drawn bounding box on right right, and 3) ranking them appropriately. The annotators were also tasked to review and make necessary corrections if the question has grammatical errors or typos.
DocVQA: A Dataset for VQA on Document Images 2020-07-01 Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa.org
Show Paper Content
# Introduction Research in Document Analysis and Recognition (DAR) is generally focused on information extraction tasks that aim to convert information in document images into machine readable form, such as character recognition [doermann2014handbook](http://arxiv.org/pdf/1509.03456v1), table extraction [kavasidis2019saliency](http://arxiv.org/pdf/1804.06236v1) or key-value pair extraction [palm2017cloudscan](http://arxiv.org/pdf/1708.07403v1). Such algorithms tend to be designed as task specific blocks, blind to the end-purpose the extracted information will be used for. Progressing independently in such information extraction processes has been quite successful, although it is not necessarily true that holistic document image understanding can be achieved through a simple constructionist approach, building upon such modules. The scale and complexity of the task introduce difficulties that require a different point of view. In this article we introduce Document Visual Question Answering (DocVQA), as a high-level task dynamically driving DAR algorithms to conditionally interpret document images. By doing so, we seek to inspire a “purpose-driven” point of view in DAR research.
image

Q: Mention the ZIP code written?

A: 80202

Q: What date is seen on the seal at the top of the letter?

A: 23 sep 1970

Q: Which company address is mentioned on the letter?

A: Great western sugar Co.

Example question-answer pairs from DocVQA. Answering questions in the new dataset require models not just to read text but interpret it within the layout/structure of the document.
In case of Document VQA, as illustrated in Figure 1, an intelligent reading system is expected to respond to ad-hoc requests for information, expressed in natural language questions by human users. To do so, reading systems should not only extract and interpret the textual (handwritten, typewritten or printed) content of the document images, but exploit numerous other visual cues including layout (page structure, forms, tables), non-textual elements (marks, tick boxes, separators, diagrams) and style (font, colours, highlighting), to mention just a few. Departing from generic VQA [vqa2](https://arxiv.org/pdf/1612.00837) and Scene Text VQA  [textvqa](http://arxiv.org/pdf/1811.11903v1), [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) approaches, the document images warrants a different approach to exploit all the above visual cues, making use of prior knowledge of the implicit written communication conventions used, and dealing with the high-density semantic information conveyed in such images. Answers in case of document VQA cannot be sourced from a closed dictionary, but they are inherently open ended. Previous approaches on bringing VQA to the documents domain have either focused on specific document elements such as data visualisations [dvqa](http://arxiv.org/pdf/1810.02358v2), [kahou2017figureqa](http://arxiv.org/pdf/2109.02226v1) or on specific collections such as book covers [mishra2019ocr](None). In contrast to such approaches, we recast the problem to its generic form, and put forward a large scale, varied collection of real documents. Main contributions of this work can be summarized as following: - We introduce DocVQA, a large scale dataset of $12,767$ document images of varied types and content, over which we have defined $50,000$ questions and answers. The questions defined are categorised based on their reasoning requirements, allowing us to analyze how DocVQA methods fare for different question types. - We define and evaluate various baseline methods over the DocVQA dataset, ranging from simple heuristic methods and human performance analysis that allow us to define upper performance bounds given different assumptions, to state of the art Scene Text VQA models and NLP models. # Related Datasets and Tasks Machine reading comprehension (MRC) and open-domain question answering (QA) are two problems which are being actively pursued by Natural Language Processing (NLP) and Information Retrieval (IR) communities. In MRC the task is to answer a natural language question given a question and a paragraph (or a single document) as the context. In case of open domain QA, no specific context is given and answer need to be found from a large collection (say Wikipedia) or from Web. MRC is often modelled as an extractive QA problem where answer is defined as a span of the context on which the question is defined. Examples of datsets for extractive QA include SQuAD 1.1 [squad](http://arxiv.org/pdf/1606.02270v2), NewsQA [newsqa](None) and Natural Questions [naturalquestions](http://arxiv.org/pdf/2105.00811v1). MS MARCO [ms_marco](http://arxiv.org/pdf/1611.09268v3) is an example of a QA dataset for abstractive QA where answers need to be generated not extracted. Recently Transformer based pretraining methods like Bidirectional Encoder Representations from Transformers (BERT) [bert](None) and XLNet [xlnet](http://arxiv.org/pdf/1906.08237v2) have helped to build QA models outperforming Humans on reading comprehension on SQuAD [squad](http://arxiv.org/pdf/1606.02270v2). In contrast to QA in NLP where context is given as computer readable strings, contexts in case of DocVQA are document images. Visual Question Answering (VQA) aims to provide an accurate natural language answer given an image and a natural language question. VQA has attracted an intense research effort over the past few years [vqa2](https://arxiv.org/pdf/1612.00837), [agrawal2017c](None), [johnson2017clevr](http://arxiv.org/pdf/1612.06890v1). Out of a large body of work on VQA, scene text VQA branch is the most related to our work. Scene text VQA refers to VQA systems aiming to deal with cases where understanding scene text instances is necessary to respond to the questions posed. The ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) and TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) datasets were introduced in parallel in 2019 and were quickly followed by more research [singh2019strings](http://arxiv.org/pdf/1904.08920v2), [gao2020multi](http://arxiv.org/pdf/2003.13962v1), [wang2020general](http://arxiv.org/pdf/2002.10215v2). The ST-VQA dataset [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) has $31,000\texttt{+}$ questions over $23,000\texttt{+}$ images collected from different public data sets. The TextVQA dataset [textvqa](http://arxiv.org/pdf/1811.11903v1) has $45,000\texttt{+}$ questions over $28,000\texttt{+}$ images sampled from specific categories of the OpenImages dataset [OpenImages2](http://arxiv.org/pdf/1809.05929v7) that are expected to contain text. Another dataset named OCR-VQA [mishra2019ocr](None) comprises more than 1 million question-answer pairs over 207K+ images of book covers. The questions in this dataset are domain specific, generated based on template questions and answers extracted from available metadata.
Industry-wise distribution of the documents.
Year wise distribution of the documents.
Various types of documents used.
Scene text VQA methods [m4c](http://arxiv.org/pdf/1911.06258v3), [gao2020multi](http://arxiv.org/pdf/2003.13962v1), [textvqa](http://arxiv.org/pdf/1811.11903v1), [gomez2020multimodal](http://arxiv.org/pdf/2006.00923v2) typically make use of pointer mechanisms in order to deal with out-of-vocabulary (OOV) words appearing in the image and provide the open answer space required. This goes hand in hand with the use of word embeddings capable of encoding OOV words into a pre-defined semantic space, such as FastText [bojanowski2017enriching](http://arxiv.org/pdf/2102.02270v2) or BERT [bert](None). More recent, top-performing methods in this space include M4C [m4c](http://arxiv.org/pdf/1911.06258v3) and MM-GNN [gao2020multi](http://arxiv.org/pdf/2003.13962v1) models. Parallelly there have been works on certain domain specific VQA tasks which require to read and understand text in the images. The DVQA dataset presented by Kafle  [kafle2020answering](http://arxiv.org/pdf/1908.01801v2), [dvqa](http://arxiv.org/pdf/1810.02358v2) comprises synthetically generated images of bar charts and template questions defined automatically based on the bar chart metadata. The dataset contains more than three million question-answer pairs over 300,000 images. FigureQA [kahou2017figureqa](http://arxiv.org/pdf/2109.02226v1) comprises over one million yes or no questions, grounded on over 100,000 images. Three different types of charts are used: bar, pie and line charts. Similar to DVQA, images are synthetically generated and questions are generated from templates. Another related QA task is Textbook Question Answering (TQA) [textbookqa](http://arxiv.org/pdf/2010.00562v1) where multiple choice questions are asked on multimodal context, including text, diagrams and images. Here textual information is provided in computer readable format. Compared to these existing datasets either concerning VQA on real word images, or domain specific VQA for charts or book covers, the proposed DocVQA comprise document images. The dataset covers a multitude of different document types that include elements like tables, forms and figures , as well as a range of different textual, graphical and structural elements. # DocVQA In this section we explain data collection and annotation process and present statistics and analysis of DocVQA. ## Data Collection **Document Images:** Images in the dataset are sourced from documents in UCSF Industry Documents Library[^1]. The documents are organized under different industries and further under different collections. We downloaded documents from different collections and hand picked pages from these documents for use in the dataset. Majority of documents in the library are binarized and the binarization has taken on a toll on the image quality. We tried to minimize binarized images in DocVQA since we did not want poor image quality to be a bottleneck for VQA. We also prioritized pages with tables, forms, lists and figures over pages which only have running text. The final set of images in the dataset are drawn from pages of $6,071$ industry documents. We made use of documents from as early as 1900 to as recent as 2018. ( [fig:doc_year_distr]). Most of the documents are from the 1960-2000 period and they include typewritten, printed, handwritten and born-digital text. There are documents from all 5 major industries for which the library hosts documents — tobacco, food, drug, fossil fuel and chemical. We use many documents from food and nutrition related collections, as they have a good number of non-binarized images. . See  [fig:industry_distr] for industry wise distribution of the $6071$ documents used. The documents comprise a wide variety of document types as shown in [fig:doc_type_distr]. **Questions and Answers:** Questions and answers on the selected document images are collected with the help of remote workers, using a Web based annotation tool. The annotation process was organized in three stages. In stage 1, workers were shown a document image and asked to define at most 10 question-answer pairs on it. We encouraged the workers to add more than one ground truth answer per question in cases where it is warranted.
The 9 question types and share of questions in each type.
Workers were instructed to ask questions which can be answered using text present in the image and to enter the answer verbatim from the document. This makes VQA on the DocVQA dataset an extractive QA problem similar to extractive QA tasks in NLP [squad](http://arxiv.org/pdf/1606.02270v2), [newsqa](None) and VQA in case of ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1).
Top 15 most frequent questions.
Top 15 most frequent answers.
Top 15 non numeric answers.
Questions with a particular length.
Answers with a particular length.
/papers/doc_ai_databases/arXiv-2007.00398v3_md/Images/contexts with a particular length
The second annotation stage aims to verify the data collected in the first stage. Here a worker was shown an image and questions defined on it in the first stage (but not the answers from the first stage), and was required to enter answers for the questions. In this stage workers were also required to assign one or more question types to each question. The different question types in DocVQA are discussed in [sec:stats_analysis]. During the second stage, if the worker finds a question inapt owing to language issues or ambiguity, an option to flag the question was provided. Such questions are not included in the dataset. If none of the answers entered in the first stage match exactly with any of the answers from the second stage, the particular question is sent for review in a third stage. Here questions and answers are editable and the reviewer either accepts the question-answer (after editing if necessary) or ignores it. The third stage review is done by the authors themselves. ## Statistics and Analysis [sec:stats_analysis] The DocVQA comprises $50,000$ questions framed on $12,767$ images. The data is split randomly in an $80-10-10$ ratio to train, validation and test splits. The train split has $39,463$ questions and $10,194$ images, the validation split has $5,349$ questions and $1,286$ images and the test split has $5,188$ questions and $1,287$ images. As mentioned before, questions are tagged with question type(s) during the second stage of the annotation process.  [fig:question_types] shows the 9 question types and percentage of questions under each type. A question type signifies the type of data where the question is grounded. For example, ‘table/list’ is assigned if answering the question requires understanding of a table or a list. If the information is in the form of a key:value, the ‘form’ type is assigned. ‘Layout’ is assigned for questions which require spatial/layout information to find the answer. For example, questions asking for a title or heading, require one to understand structure of the document. If answer for a question is based on information in the form of sentences/paragraphs type assigned is ‘running text’. For all questions where answer is based on handwritten text, ‘handwritten’ type is assigned. Note that a question can have more than one type associated with it. (Examples from DocVQA for each question type are given in the supplementary.) In the following analysis we compare statistics of questions, answers and OCR tokens with other similar datasets for VQA — VQA 2.0 [vqa2](https://arxiv.org/pdf/1612.00837), TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) and ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) and SQuAD 1.1 [squad](http://arxiv.org/pdf/1606.02270v2) dataset for reading comprehension. Statistics for other datasets are computed based on their publicly available data splits. For statistics on OCR tokens, for DocVQA we use OCR tokens generated by a commercial OCR solution. For VQA 2.0, TextVQA and ST-VQA we use OCR tokens made available by authors of LoRRA [textvqa](http://arxiv.org/pdf/1811.11903v1) and M4C [m4c](http://arxiv.org/pdf/1911.06258v3) as part of the MMF [mmf](https://github.com/facebookresearch/mmf) framework.

Word clouds of words in answers (left) and words spotted on the document images in the dataset (right)
[fig:compare_question_lengths] shows distribution of question lengths for questions in DocVQA compared to other similar datasets. The average question length is is $8.12$, which is second highest among the compared datasets. . In DocVQA$35,362$ ($70.72\%$) questions are unique.  [fig:top_questions] shows the top $15$ most frequent questions and their frequencies. There are questions repeatedly being asked about dates, titles and page numbers. A sunburst of first 4 words of the questions is shown in [fig:sunburst_4grams]. It can be seen that a large majority of questions start with “what is the”, asking for date, title, total, amount or name. Distribution of answer lengths is shown in [fig:compare_answer_lengths]. We observe in the figure that both DocVQA and SQuAD 1.1 have a higher number of longer answers compared to the VQA datasets. The average answer length is $2.17$. $63.2\%$ of the answers are unique , which is second only to SQuAD 1.1$(72.5\%)$. The top $15$ answers in the dataset are shown in [fig:top_anwers]. We observe that almost all of the top answers are numeric values, which is expected since there are a good number of document images of reports and invoices. In [fig:top_non_numeric_Answers] we show the top $15$ non numeric answers. These include named entities such as names of people, institutions and places. The word cloud on the left in [fig:wordcloud] shows frequent words in answers. Most common words are names of people and names of calendar months. In [fig:compare_document_lengths] we show the number of images (or ‘context’s in case of SQuAD 1.1) containing a particular number of text tokens. Average number of text tokens in an image or context is the highest in the case of DocVQA($182.75$). It is considerably higher compared to SQuAD 1.1 where contexts are usually small paragraphs whose average length is $117.23$. In case of VQA datasets which comprise real world images average number of OCR tokens is not more than $13$. Word cloud on the right in [fig:wordcloud] shows the most common words spotted by the OCR on the images in DocVQA. We observe that there is high overlap between common OCR tokens and words in answers.
Distribution of questions by their starting 4-grams. Most questions aim to retrieve common data points in documents such as date, title, total mount and page number.
# Baselines [sec:baselines] In this section we explain the baselines we use, including heuristics and trained models. ## Heuristics and Upper Bounds [sec:heuristics] Heuristics we evaluate are: (i) **Random answer:** measures performance when we pick a random answer from the answers in the train split. (ii) **Random OCR token:** performance when a random OCR token from the given document image is picked as the answer. (iii) **Longest OCR token** is the case when the longest OCR token in the given document is selected as the answer. (iv) **Majority answer** measures the performance when the most frequent answer in the train split is considered as the answer. We also compute following upper bounds: (i) **Vocab UB:** This upper bound measures performance upper bound one can get by predicting correct answers for the questions, provided the correct answer is present in a vocabulary of answers, comprising all answers which occur more than once in the train split. (ii) **OCR substring UB:** is the upper bound on predicting the correct answer provided the answer can be found as a substring in the sequence of OCR tokens. The sequence is made by serializing the OCR tokens recognized in the documents as a sequence separated by space, in top-left to bottom-right order. (iii) **OCR subsequence UB:** upper bound on predicting the correct answer, provided the answer is a subsequence of the OCR tokens’ sequence. ## VQA Models [sec: vqa models] For evaluating performance of existing VQA models on DocVQA we employ two models which take the text present in the images into consideration while answering questions – Look, Read, Reason & Answer (LoRRA) [textvqa](http://arxiv.org/pdf/1811.11903v1) and Multimodal Multi-Copy Mesh(M4C) [m4c](http://arxiv.org/pdf/1911.06258v3). **LoRRA:** follows a bottom-up and top-down attention [topdown_bottomup](https://arxiv.org/pdf/1707.07998) scheme with additional bottom-up attention over OCR tokens from the images. In LoRRA, tokens in a question are first embedded using a pre-trained embedding (GloVe [glove](http://arxiv.org/pdf/1608.02094v1)) and then these tokens are iteratively encoded using an LSTM [lstm](http://arxiv.org/pdf/2103.15232v1) encoder. The model uses two types of spatial features to represent the visual information from the images - (i) grid convolutional features from a Resnet-152 [resnet](https://arxiv.org/pdf/1512.03385) which is pre-trained on ImageNet [imagenet](http://arxiv.org/pdf/1903.10412v1) and (ii) features extracted from bounding box proposals from a Faster R-CNN [faster-r-cnn](http://arxiv.org/pdf/1506.01497v3) object detection model, pre-trained on Visual Genome [visual_genome](http://arxiv.org/pdf/1602.07332v1). OCR tokens from the image are embedded using a pre-trained word embedding (FastText [fasttext](http://arxiv.org/pdf/2102.02270v2)). An attention mechanism is used to compute an attention weighed average of the image features as well the OCR tokens’ embeddings. These averaged features are combined and fed into an output module. The classification layer of the model, predicts an answer either from a fixed vocabulary (made from answers in train set) or copy an answer from a dynamic vocabulary which essentially is the list of OCR tokens in an image. Here the copy mechanism can copy only one of the OCR tokens from the image. Consequently it cannot output an answer which is a combination of two or more OCR tokens. **M4C**: uses a multimodal transformer and an iterative answer prediction module. Here tokens in questions are embedded using a BERT model [bert](None). Images are represented using (i) appearance features of the objects detected using a Faster-RCNN pretrained on Visual Genome [visual_genome](http://arxiv.org/pdf/1602.07332v1) and (ii) location information - bounding box coordinates of the detected objects. Each OCR token recognized from the image is represented using (i) a pretrained word embedding (FastText), (ii) appearance feature of the token’s bounding box from the same Faster R-CNN which is used for appearance features of objects (iii) PHOC [phoc](http://arxiv.org/pdf/1712.07487v1) representation of the token and (iv) bounding box coordinates of the token. Then these feature representations of the three entities (question tokens, objects and OCR tokens) are projected to a common, learned embedding space. Then a stack of Transformer [attention_is_all_you_need](http://arxiv.org/pdf/2107.08000v1) layers are applied over these features in the common embedding space. The multi-head self attention in transformers enable both inter-entity and intra-entity attention. Finally, answers are predicted through iterative decoding in an auto-regressive manner. Here the fixed vocabulary used for the closed answer space is made up of the most common answer words in the train split. Note that in this case the fixed vocabulary comprises of answer words, not answers itself as in the case of LoRRA. At each step in the decoding, the decoded word is either an OCR token from the image or a word from the fixed vocabulary of common answer words. In our experiments we use original LoRRA and M4C models and few variants of these models. Document images in DocVQA usually contain higher number of text tokens compared to images in scene text VQA datasets. Hence we try out larger dynamic vocabularies (i.e. more OCR tokens are considered from the images) for both LoRRA and M4C. For both the models we also evaluate performance when no fixed vocabulary is used. Since the notion of visual objects in real word images is not directly applicable in case of document images, we also try out variants of LoRRA and M4C where features of objects are omitted. ## Reading Comprehension Models [sec:RC_models] In addition to the VQA models which can read text, we try out extractive question answering / reading comprehension models from NLP. In particular, we use BERT [bert](None) question answering models. BERT is a method of pre-training language representations from unlabelled text using transformers [attention_is_all_you_need](http://arxiv.org/pdf/2107.08000v1). These pretrained models can then be used for downstream tasks with just an additional output layer. In the case of extractive Question Answering, this is an output layer to predict start and end indices of the answer span. # Experiments [sec:experiments] In this section we explain evaluation metrics and our experimental settings and report results of experiments. ## Evaluation Metrics [sec:evaluation] Two evaluation metrics we use are Average Normalized Levenshtein Similarity (ANLS) and Accuracy (Acc.). ANLS was originally proposed for evaluation of VQA on ST-VQA [st-vqa_challenge](None). The Accuracy metric measures percentage of questions for which the predicted answer matches exactly with any of the target answers for the question. Accuracy metric awards a zero score even when the prediction is only a little different from the target answer. Since no OCR is perfect, we propose to use ANLS as our primary evaluation metric, so that minor answer mismatches stemming from OCR errors are not severely penalized.
| | | | | | |:-------------------|:-----:|:-----:|:-----:|:-----:| | | val | | test | | | Baseline | ANLS | Acc. | ANLS | Acc. | | Human | \- | \- | 0.981 | 94.36 | | Random answer | 0.003 | 0.00 | 0.003 | 0.00 | | Rnadom OCR token | 0.013 | 0.52 | 0.014 | 0.58 | | Longest OCR token | 0.002 | 0.05 | 0.003 | 0.07 | | Majority answer | 0.017 | 0.90 | 0.017 | 0.89 | | Vocab UB | \- | 31.31 | \- | 33.78 | | OCR substring UB | \- | 85.64 | \- | 87.00 | | OCR subsequence UB | \- | 76.37 | \- | 77.00 | Evaluation of different heuristics and upper bounds. Predicting random answers or majority answer do not even yield 1% accuracy. Answers are a substring of the serialized OCR output in more than 85% of the cases.
| | | | | | | | | |:----------------------|:----------------:|:------------:|:-------------------:|:---------:|:-----:|:---------:|:-----:| | | | | | val | | test | | | Method | Objects’ feature | Fixed vocab. | Dynamic vocab. size | ANLS | Acc. | ANLS | Acc. | | LoRRA `\cite`{=latex} | | | 50 | **0.110** | 7.22 | **0.112** | 7.63 | | | | | 50 | 0.041 | 2.64 | 0.037 | 2.58 | | | | | 50 | 0.102 | 6.73 | 0.100 | 6.43 | | | | | 150 | 0.101 | 7.09 | 0.102 | 7.22 | | | | | 500 | 0.094 | 6.41 | 0.095 | 6.31 | | M4C `\cite`{=latex} | | | 50 | 0.292 | 18.34 | 0.306 | 18.75 | | | | | 50 | 0.216 | 12.44 | 0.219 | 12.15 | | | | | 50 | 0.294 | 18.75 | 0.310 | 18.92 | | | | | 150 | 0.352 | 22.66 | 0.360 | 22.35 | | | | | 300 | 0.367 | 23.99 | 0.375 | 23.90 | | | | | 500 | **0.385** | 24.73 | **0.391** | 24.81 |
## Experimental setup [sec: experimental setup] For measuring human performance , we collect answers for all questions in test split, with help a few volunteers from our institution. In all our experiments including heuristics and trained baselines, OCR tokens we use are extracted using a commercial OCR application. For the heuristics and upper bounds we use a vocabulary $4,341$ answers which occur more than once in the train split. For LoRRA and M4C models we use official implementations available as part of the MMF framework [mmf](https://github.com/facebookresearch/mmf). The training settings and hyper parameters are same as the ones reported in the original works. The fixed vocabulary we use for LoRRA is same as the vocabulary we use for computing vocabulary based heuristics and upper bounds. For M4C the fixed vocabulary we use is a vocabulary of the $5,000$ most frequent words from the answers in the train split. For QA using BERT, three pre-trained BERT models[^2] from the Transformers library [huggingface](http://arxiv.org/pdf/1910.03771v5) are used. The models we use are bert-base-uncased, bert-large-uncased-whole-word-masking and bert-large-uncased-whole-word-masking-finetuned-squad. We abbreviate the model names as bert-base, bert-large and bert-large-squad respectively. Among these, bert-large-squad is a pre-trained model which is also finetuned on SQuAD 1.1 for question answering. In case of extractive question answering or reading comprehension datasets ‘contexts’ on which questions are asked are passages of electronic text. But in DocVQA‘contexts’ are document images. Hence to finetune the BERT QA models on DocVQA we need to prepare the data in SQuAD style format where the answer to a question is a ‘span’ of the context, defined by start and end indices of the answer. To this end we first serialize the OCR tokens recognized on the document images to a single string, separated by space, in top-left to bottom-right order. To approximate the answer spans we follow an approach proposed in TriviaQA [triviaqa](None), which is to find the first match of the answer string in the serialized OCR string. The bert-base model is finetuned on DocVQA on 2 Nvidia GeForce 1080 Ti GPUs, for 2 epochs, with a batch size of 32. We use Adam optimizer [adam](None) with a learning rate of $5e-05$. The bert-large and bert-large-squad models are finetuned on 4 GPUs for 6 epochs with a batch size of 8, and a learning rate of $2e-05$.
| | | | | | | |:-----------------|:----------------|:---------:|:-----:|:---------:|:-----:| | | | val | | test | | | Pretrained model | DocVQA finetune | ANLS | Acc. | ANLS | Acc. | | bert-base | | 0.556 | 45.6 | 0.574 | 47.6 | | bert-large- | | 0.594 | 49.28 | 0.610 | 51.08 | | bert-large-squad | | 0.462 | 36.72 | 0.475 | 38.26 | | bert-large-squad | | **0.655** | 54.48 | **0.665** | 55.77 | Performance of BERT question answering models. A BERT~LARGE~ model which is fine tuned on both SQuAD 1.1 [squad](http://arxiv.org/pdf/1606.02270v2) and DocVQA performs the best.
## Results [sec:results] Results of all heuristic approaches and upper bounds are reported in [tab:human_heuristics]. We can see that none of the heuristics get even a $1\%$ accuracy on the validation or test splits.
image image image

Q: What is the underlined heading just above the table?

GT: Indications for implantation

M4C best: indications for implantation

BERT best: total aneurism

Human: indications for implantation

Q: What is the Extension Number as per the voucher?

GT: (910) 741-0673

M4C best: 963.12

BERT best: (910) 741-0673

Human: (910) 741-0673

Q: How many boxed illustrations are there ?

GT: 9

M4C best: 4

BERT best: 4

Human: 9

*OCR substring UB* yields more than $85\%$ accuracy on both validation and test splits. It has a downside that the substring match in all cases need not be an actual answer match. For example if the answer is “2" which is the most common answer in the dataset, it will match with a “2" in “2020" or a “2" in “2pac”. This is the reason why we evaluate the *OCR subsequence UB*. An answer is a sub sequence of the serialized OCR output for around $76\%$ of the questions in both validation and test splits. Results of our trained VQA baselines are shown in [tab:vqa_results]. First rows for both the methods report results of the original model proposed by the respective authors. In case of LoRRA the original setting proposed by the authors yields the best results compared to the variants we try out. With no fixed vocabulary, the performance of the model drops sharply suggesting that the model primarily relies on the fixed vocabulary to output answers. Larger dynamic vocabulary results in a slight performance drop suggesting that incorporating more OCR tokens from the document images does little help. Unlike LoRRA, M4C benefits from a larger dynamic vocabulary. Increasing the size of the dynamic vocabulary from $50$ to $500$ improves the ANLS by around $50\%$. And in case of M4C, the setting where features of objects are omitted, performs slightly better compared to the original setting.
Best baselines from VQA space and reading comprehension space pitted against the human performance for different question types. We need models which can understand figures and text on photographs better. We need better handwriting recognizers too!
Results of the BERT question answering models are reported in [tab:bert_results]. We observe that all BERT models perform better than the best VQA baseline using M4C(last row in [tab:vqa_results]). The best performing model out of all the baselines analysed is the bert-large-squad model, finetuned on DocVQA. Answers predicted by this model match one of the target answers exactly for around $55\%$ of the questions. In [fig:performance_question_type] we show performance by question type. We compare the best models among VQA models and BERT question answering models against the human performance on the test split. We observe that the human performance is uniform while the models’ performance vary for different question types. In [fig:qualitative_results] we show a few qualitative results from our experiments. # Conclusion We introduce a new data set and an associated VQA task with the aim to inspire a “purpose-driven” approach in document image analysis and recognition research. Our baselines and initial results motivate simultaneous use of visual and textual cues for answering questions asked on document images. This could drive methods that use the low-level cues (text, layout, arrangements) and high-level goals (purpose, relationship, domain knowledge) in solving problems of practical importance. **Acknowledgements** We thank Amazon for supporting the annotation effort, and Dr. R. Manmatha for many useful discussions and inputs. This work is partly supported by MeitY, Government of India, the project TIN2017-89779-P, an Amazon AWS Research Award and the CERCA Programme. # Screen grabs of Annotation Tool [appendix:screen grabs] As mentioned in Section 3.1 in the main paper, annotation process involves three stages. In  [fig:ann_stage1],  [fig:ann_stage2] and [fig:ann_stage3] we show screen grabs from stage 1, stage 2 and stage 3 of the annotation process respectively.
image
image
image
# Examples of Question Types [appendix:question_types] We define 9 question types, based on the kind of reasoning required to answer a question. Question types are assigned at the second stage of the annotation. We discuss the question types in Section 3.2. in the main paper. Examples for types *form*, *yes/no* and *layout* are shown in [fig:question_type_examples yesno and layout]. Examples for a question based on a handwritten date in a form (types *form* and *handwritten*) are shown in [fig:question_type_examples handwritten date form]. An example for a question based on information in the form of sentences or paragraphs ( type *running text*) is shown in [fig:question_type running text]. Examples for types *photograph* and *table* are shown in [fig:question_types photo and table]. An example for a question based on a plot (type *figure*) is shown in [fig:question_type_examples figure]. In all examples a crop of the original image is shown below the original image, for better viewing of the image region where the question is based on.
image

Q: Is it an existing item ?

Question types: form and yes/no

A: yes

Q: What is the date given at the top left?

Question types: layout

A: 03/17/98

image

Q: What is the date written next to RSM approval?

Question types: form and handwritten

A: 3-17-98

image

Q: If the request needs to be warehoused by RJR, what needs to be done ?

Question types: running text

A: write to RJR

image

Q: Whose picture is given?

Question types: photograph and layout

A: Dr. Dwayne G. Westfall

Q: What is the average sucrose % for N level 501+ ?

Question types: table

A: 15.9

image

Q: What is the highest value for “Intake, mg/1000kcal" plotted on the ‘X’ axis of the graph?

Question types: figure

A: 300

# Additional Qualitative Examples [appendix:Additional Qualitative Examples] Here we show more qualitative results from our baseline experiments. These results supplement the Results section (Section 5.3 ) in the main paper. Remember that BERT [bert](None) question answering model is designed to answer questions asked on sentences or paragraphs of text ( reading comprehension). In [fig:qual : bert wines] we show two examples where the model answers questions outside the ambit of reading comprehension style question answering. In [fig:qual : m4c_wins] we show examples where the M4C [m4c](http://arxiv.org/pdf/1911.06258v3) model outperforms the BERT model to answer questions based on text seen on pictures or photographs. Such questions are similar to questions in TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) or ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) datasets where M4C model yield state-of-the-art results. In [fig:qual : inconsistent] we show an example where both the models yield inconsistent results when posed with questions of similar nature, highlighting lack of reasoning behind answering. In [fig: qual: reasoning] we show two examples where both the M4C and BERT model fail to answer questions which require understanding of a figure or a diagram. In [fig: qual: ocr error] we show how OCR errors have resulted in wrong answers although the models manage to ground the questions correctly.
image image

Q: What is the total cost for Fat cell size (Mt. SInai) in the -05 year ?

GT: $35,864

M4C best: 4400

BERT best: $35 , 864

Human: $35,864

Q: What is the first recipe on the page?

GT: hawaiian fruit cake

M4C best: island desserts (continued from cake

BERT best: hawaiian fruit cake

Human: hawaiian fruit cake

image image

Q: What is written inside logo in the bottom of the document?

GT: let yourself grow!

M4C best: yourself grow!

BERT best:  < no  prediction>

Human: let yourself grow!

Q: What Tobacco brand of GPI is shown in the picture?

GT: Prince

M4C best: prince

BERT best:  < no  prediction>

Human: prince

image

Q: What was the committee strength for the first meeting?

GT: 6

M4C best: 6

BERT best: 6

Human: 6

Q: What was the committee strength for the last meeting?

GT: 5

M4C best: 6

BERT best: 6

Human: 5

image image

Q: What is the position above "vice chairman" ?

GT: chairman

M4C best: legal counsel

BERT best: legal counsel

Human: chairman

Q: What is the highest value shown on the vertical axis?

GT: 99.99

M4C best: 50

BERT best: 32

Human: 99.99

image image

Q: What is the name of the passenger?

GT: dr. william j. darby

M4C best: larry

BERT best: larry

Human: dr. william j. darry

Q: What is the date present in the memo ?

GT: 1/7/77

M4C best: 1 7 77

BERT best: 1 / 7

Human: 1/7/77

[^1]: [^2]: