Document AI gathers several tasks, such as Document Classification, Document Information Extraction, Document reconstruction, Document Captioning, Document summarization, and Document Question Answering.
What is notable is that (1) Multipage VrDU datasets have recently emerged and are steadily increasing
, indicating a shift in the field towards this type of task. These datasets encompass a wide variety of document types
, including traditional PDF documents, charts, tables, web pages, arXiv papers, diagrams, and application pages, showcasing the diversity of the field. Additionally (2), there is a growing emphasis on datasets that cover tasks requiring abstract or numerical reasoning
, which demands higher levels of cognitive processing. Finally (3), there is an increasing focus on incorporating multiple types of data through various instructions
in these datasets.
Question Answering
Question answering is a prevalent task in natural language processing (NLP) where a model must provide a natural language response to a question based on a given passage
. This concept extends to images, evolving into Visual Question Answering (VQA)
. Visual Question Answering (VQA) entails answering questions posed in natural language about the content of images, combining computer vision and NLP to interpret and respond to specific queries about visual elements.
Those VQA datasets are a little tricky: we can always wonder who is answering the question for visual question answering: the LLM-only, or the LLM thanks to the visual representation it got as input?
So it is really important to take a dataset that can be only answered with the image(s) it contains. One way to evaluate that is to compare the performance on such datasets with the visual representation given to the LLM and without.
[Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/pdf/2406.16860) presented an overview of some dataset and the performance on them with and without visual encoder
. This shows the dataset really requiring vision and the ones not:
![](/literature_review/doc_ai/vlm/bench_cat.png)
For instance, AI2D (datasets with questions on figures) displays less than a 5% gap between vision enabled and disabled, suggesting that these benchmarks may not significantly depend on visual input and rather heavily rely on the base LLM.
While VQA typically involves a single question and answer per image
, Conversational VQA (Conv VQA) features a series of questions and answers within a single conversation
, enabling multiple interactions about the same image. Also, in VQA, images can vary widely in format:
- Conventional VQA focuses on realistic or synthetic photos
, with questions about object recognition, attributes, and spatial relationships within the image.
- Scene Text VQA deals with realistic photos that include textual elements within the scene
, such as a restaurant sign, focusing on recognizing and understanding text associated with objects in the image.
- Chart VQA addresses images of charts, such as bar graphs, line charts, or pie charts
, with questions often involving trend recognition, value comparisons, and identifying specific chart properties.
- Diagram VQA, also known as Infographic VQA, involves interpreting diagrammatic images that explain processes or relationships
. Diagrams are complex graphics comprising a graphic space, various elements, and their interrelationships. Questions in this category typically ask about components and connections depicted in the diagrams.
- Document VQA relates to images of various document types, including business PDFs, web pages, forms, receipts, and invoices
. VQA tasks with documents resemble Machine Reading Comprehension (MRC) tasks, involving questions about textual content that are answered using text extracted from the document. Document-based VQA incorporates image processing to analyze visual input alongside textual queries.
- Multipage VQA involves answering questions that require understanding and integrating information across multiple pages of a document, such as a presentation or report, rather than focusing on a single page per question
, which is typical for traditional VQA tasks.
- Open-domain VQA involves answering questions based on a broad database of documents, images, tables, texts, and charts
. The model includes a retriever component that identifies the specific document likely to contain the answer
before generating a response.
Image classification
Image classification is a core task in computer vision that involves sorting images into predefined categories based on their content
. When applied to documents, this process narrows down to identifying various types of document images, such as charts, diagrams, and other visual elements. For instance, chart image classification focuses on recognizing and categorizing charts extracted from documents into specific predefined categories. This task is complex due to the wide range of chart types, each with distinct features and structures. Likewise, diagrams and other document visuals are classified based on their characteristics and the information they convey
, thereby improving the automation and comprehension of document content across numerous applications.
Information Extraction
Information extraction (IE) is a prevalent task in natural language processing (NLP). It involves the automatic extraction of structured information from unstructured or semi-structured sources
like text documents, web pages, or databases. This process includes identifying and extracting specific types of entities
(such as names of people, organizations, and locations) and their relationships from the text. When applied to documents, this task begins by identifying relevant elements within the document, such as text blocks, images, charts, or specific visual patterns. These identified elements are then annotated with meaningful labels or metadata that describe their content or function within the document.
Document Reconstruction
Document reconstruction is a specialized task that shares similarities with optical character recognition (OCR). Its aim is to reconstruct an image page of a document in a manner that preserves both its physical appearance and informational content
. For textual content, this process involves assembling text blocks of the image and converting the image document in a textual format. For VrD, which contain specific layout, this reconstruction task aims converting the image document into structured formats like Markdown or HTML
, facilitating clear organization and presentation of the content. This means that the input data are a set of images (screenshots of a scientific paper, of a webapp, of a webpage, …) and the output data are the correspond Markdown / LaTex / HTML code of these screenshots. In the case of graphical documents such as charts, reconstruction extends to extracting and reformatting underlying data into non-image formats, such as tables or Markdown
.
Captioning
Captioning is a common task in computer vision that involves creating descriptive text for images, charts, diagrams, and tables
. This process provides context and summarizes the visual content, making it accessible and understandable.
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
2024-05-20
Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. However, most TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets using translation engines, the translation-based protocol encounters a substantial ``Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it does not adequately tackle challenges related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we address the task of multilingual TEC-VQA and provide a benchmark with high-quality human expert annotations in 9 diverse languages, called MTVQA. To our knowledge, MTVQA is the first multilingual TEC-VQA benchmark to provide human expert annotations for text-centric scenarios. Further, by evaluating several state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4V, on our MTVQA dataset, it is evident that there is still room for performance improvement, underscoring the value of our dataset. We hope this dataset will provide researchers with fresh perspectives and inspiration within the community. The MTVQA dataset will be available at https://huggingface.co/datasets/ByteDance/MTVQA.
Show Paper Content
# Introduction
In the era of burgeoning AI, especially in
LLMs/MLLMs [gpt4v](http://arxiv.org/pdf/2311.15732v2), [gpt4v_explore](http://arxiv.org/pdf/2312.15011v1), [team2023gemini](http://arxiv.org/pdf/2405.12107v1), [anthropic2024claude](http://arxiv.org/pdf/2007.04626v3), [reid2024gemini](http://arxiv.org/pdf/2312.17661v1), [bai2023qwen](http://arxiv.org/pdf/2309.16609v1), [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1), [young2024yi](http://arxiv.org/pdf/2304.11090v4), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [hu2024mplug](None), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [chen2024far](http://arxiv.org/pdf/2404.16821v2), [dong2024internlm](http://arxiv.org/pdf/2404.06512v1), [li2024mini](http://arxiv.org/pdf/2305.16318v2), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/),
**Te**xt-**C**entric **V**isual **Q**uestion
**A**nswering (**TEC-VQA**) [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [hu2024mplug](None)
has served as a *de facto* gold proxy to evaluate AI models in the
domain of text-centric scene understanding. Compared with general
VQA [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [pham2024viocrvqa](http://arxiv.org/pdf/2404.18397v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mishra2019ocr](http://arxiv.org/pdf/2010.02582v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), [masry-etal-2022-chartqa](https://doi.org/10.18653/v1/2022.findings-acl.177), [zhu2016visual7w](http://arxiv.org/pdf/2306.04938v1), [krishna2017visual](http://arxiv.org/pdf/1602.07332v1), [antol2015vqa](http://arxiv.org/pdf/1309.1125v1), [marino2019ok](http://arxiv.org/pdf/1906.00067v2), [sheng2021human](http://arxiv.org/pdf/1810.02358v2), [liu2024visual](http://arxiv.org/pdf/2402.11690v1), [gao2015you](http://arxiv.org/pdf/1505.05612v3), [gan2020large](http://arxiv.org/pdf/2302.02502v2), [liu-etal-2021-visually](https://doi.org/10.18653/v1/2021.emnlp-main.818),
TEC-VQA places greater emphasis on answering questions that require
understanding textual information within images. It provides a
streamlined avenue for individuals without specialized expertise to
articulate their requirements and access applications in text-centric
visual environments. However, the majority of advancements in TEC-VQA
have predominantly concentrated on high-resource languages, *e.g.*,
English [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2),
Chinese [qi-etal-2022-dureadervis](https://doi.org/10.18653/v1/2022.findings-acl.105), [gao2015you](http://arxiv.org/pdf/1505.05612v3),
Japanese [shimizu2018visual](http://arxiv.org/pdf/1810.02358v2), [nguyen2023vlsp2022](http://arxiv.org/pdf/1810.02358v2) and
*etc.*, thus restricting the applicability of AI models to the global
community, particularly populations speaking low-resource languages.
To tackle the problem of language diversity, several seminal
studies [raj-khan-etal-2021-towards-developing](https://doi.org/10.18653/v1/2021.findings-emnlp.151), [pfeiffer-etal-2022-xgqa](https://doi.org/10.18653/v1/2022.findings-acl.196), [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176)
in the general VQA field, leverage off-the-shelf translation engines to
expand existing question-answer pairs from high-resource languages to
their multilingual counterparts including low-resource ones. However,
when applied to TEC-VQA, this translation-based approach may fall prey
to the “*Visual-textual misalignment*” problem as only the text in
question-answer pairs can be processed, while the visual text present in
the images is overlooked. Not to mention issues such as nuanced meaning,
contextual distortion, language bias, and question type diversity
further render the transferability of the translation protocol
infeasible for TEC-VQA. The *status quo* begs for a question: “*How can
we address the visual-textual misalignment problem for multilingual
TEC-VQA and what we stand in the MLLM era?*”
| | |
|:--:|:--:|
|
In this work, to answer the question above, we establish MTVQA, a novel
and high-quality multilingual TEC-VQA benchmark, where all images are
collected from real-world and meticulously annotated by human experts in
nine languages: Arabic (AR), Korean (KO), Japanese (JA), Thai (TH),
Vietnamese (VI), Russian (RU), French (FR), German (DE), and
Italian (IT). More concretely, to ensure the visual-textual alignment at
most, the annotation process follows the raise-then-correct paradigm,
where a group of human annotators raises several distinct questions,
ranging from simple content extraction to text-related reasoning, and
subsequently provides answers. These QA pairs are then double-checked by
another group to ensure accuracy and consistency. Consequently, as
illustrated in
Fig. [fig:leng_statistics], 6,678
training images and 21,829 question-answer pairs, as well as 2,116 test
images and 6,778 question-answer pairs are obtained, covering several
fine-grained scenarios, such as menus, logos, maps, bills, PPTs,
research papers, and *etc*. To our best knowledge, MTVQA is the first
TEC-VQA dataset to provide native human annotations for multilingual
text-rich scenarios, especially for low-source languages. Furthermore,
we investigate recent representative MLLMs, including GPT-4V, Gemini,
QwenVL *etc*., by juxtaposing experimental results regarding their
performance on our newly proposed MTVQA. Both for general MLLMs and
document-focused ones, the results unequivocally demonstrate that
opportunities for improvement persist within these MLLMs when applied in
multilingual text-rich scenarios.
In summary, the main contributions of this paper can be categorized into
three points:
- We introduce the MTVQA dataset, to the best of our knowledge, which
is the first multilingual TEC-VQA benchmark to provide human expert
annotations for text-centric scenarios.
- We benchmark the state-of-the-art MLLMs on our new dataset and show
there is still room for performance improvement for these models
under multilingual text-rich scenarios.
- We propose a set of baselines for multilingual TEC-VQA tasks.
# Related Work
## LLMs/MLLMs for text-centric VQA
Recent advancements in
LLMs/MLLMs [gpt4v](http://arxiv.org/pdf/2311.15732v2), [gpt4v_explore](http://arxiv.org/pdf/2312.15011v1), [team2023gemini](http://arxiv.org/pdf/2405.12107v1), [anthropic2024claude](http://arxiv.org/pdf/2007.04626v3), [reid2024gemini](http://arxiv.org/pdf/2312.17661v1), [bai2023qwen](http://arxiv.org/pdf/2309.16609v1), [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1), [young2024yi](http://arxiv.org/pdf/2304.11090v4), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [hu2024mplug](None), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [chen2024far](http://arxiv.org/pdf/2404.16821v2), [dong2024internlm](http://arxiv.org/pdf/2404.06512v1), [li2024mini](http://arxiv.org/pdf/2305.16318v2), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/)
have revolutionized VQA tasks, as demonstrated by the remarkable
zero-shot performance of these models. Notably, the high
generalizability of LLMs/MLLMs, when explicitly trained on visual text
understanding datasets and fine-tuned with instructions, has
significantly enhanced their application in text-centric VQA
scenarios [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3), [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3), [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), [hu2024mplug](None).
For example, LLaVAR [zhang2023llavar](http://arxiv.org/pdf/2306.17107v2),
UniDoc [feng2023unidoc](http://arxiv.org/pdf/2308.11592v2), which extend
LLaVA [liu2024visual](http://arxiv.org/pdf/2402.11690v1) into the realm of document
understanding, pioneering the text-centric VQA of MLLMs by training them
to predict texts and coordinates from document images. Furthermore,
DocPedia [feng2023docpedia](http://arxiv.org/pdf/2311.11810v3) operates visual input in the
frequency domain rather than in space, which enables higher input
resolution without increasing the input sequence. Lately,
mPLUG-DocOwl [mPLUG-DocOwl](None),
Qwen-VL [bai2023qwen](http://arxiv.org/pdf/2309.16609v1), and
TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) leverage publicly
available document-related VQA datasets to further enhance the
text-centric VQA capabilities. Despite the promising results achieved by
existing LLMs/MLLMs in text-centric VQA tasks, their focus on
high-resource languages such as English or Chinese has posed challenges
in achieving reasonable performance for low-resource languages. This is
primarily due to the lack of data or benchmarks for these low-resource
languages.
## Multilingual text-centric VQA Benchmarks
VQA has garnered significant attention in recent years, with numerous
studies, datasets, and benchmarks being proposed to advance the
field [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [pham2024viocrvqa](http://arxiv.org/pdf/2404.18397v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mishra2019ocr](http://arxiv.org/pdf/2010.02582v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), [masry-etal-2022-chartqa](https://doi.org/10.18653/v1/2022.findings-acl.177), [zhu2016visual7w](http://arxiv.org/pdf/2306.04938v1), [krishna2017visual](http://arxiv.org/pdf/1602.07332v1), [antol2015vqa](http://arxiv.org/pdf/1309.1125v1), [marino2019ok](http://arxiv.org/pdf/1906.00067v2), [sheng2021human](http://arxiv.org/pdf/1810.02358v2), [liu2024visual](http://arxiv.org/pdf/2402.11690v1), [gao2015you](http://arxiv.org/pdf/1505.05612v3), [gan2020large](http://arxiv.org/pdf/2302.02502v2), [liu-etal-2021-visually](https://doi.org/10.18653/v1/2021.emnlp-main.818).
Many datasets have been created that encompass scene text of various
domains, including natural
images [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), scanned
documents [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2),
book and movie covers [mishra2019ocr](http://arxiv.org/pdf/2010.02582v1). One notable
limitation of these datasets is their predominant focus on
English [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2)
or other high-resource languages such as
Chinese [qi-etal-2022-dureadervis](https://doi.org/10.18653/v1/2022.findings-acl.105), [gao2015you](http://arxiv.org/pdf/1505.05612v3) and
Japanese [shimizu2018visual](http://arxiv.org/pdf/1810.02358v2), [nguyen2023vlsp2022](http://arxiv.org/pdf/1810.02358v2), which
restricts the applicability of VQA systems for low-resource languages
such as Thai and Vietnamese.
There is a recent effort toward extending VQA tasks to a wider range of
languages [gupta2020unified](http://arxiv.org/pdf/2204.14264v2), [pfeiffer-etal-2022-xgqa](https://doi.org/10.18653/v1/2022.findings-acl.196), [vivoli2022must](http://arxiv.org/pdf/1902.05660v1), [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176), [li2023empirical](http://arxiv.org/pdf/1810.02358v2), [raj-khan-etal-2021-towards-developing](https://doi.org/10.18653/v1/2021.findings-emnlp.151)
by providing a multilingual VQA datasets. For example,
[gao2015you](http://arxiv.org/pdf/1505.05612v3) created a free-form bilingual VQA dataset
(FM-IQA) contains over 150,000 images and 310,000 freestyle Chinese
question-answer pairs and their English translations.
[raj-khan-etal-2021-towards-developing](https://doi.org/10.18653/v1/2021.findings-emnlp.151) developed a
large-scale multilingual and code-mixed VQA dataset (MuCo-VQA)
supporting five languages. Of more relevance are the works xGQA (8
languages) [pfeiffer-etal-2022-xgqa](https://doi.org/10.18653/v1/2022.findings-acl.196) and MaXM (7
languages) [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176), which apply
translation-based protocols to expand VQA data beyond English. However,
the translation-based multilingual VQA datasets inherently face issues,
such as the “Visual-textual misalignment” problem, where only the text
in question-answer pairs is processed, while the visual text in images
is overlooked. Additionally, the nuanced meaning and context are often
distorted; language bias introduced by machine translation models, and
the coverage of certain question types is limited, as highlighted
by [changpinyo-etal-2023-maxm](https://doi.org/10.18653/v1/2023.findings-emnlp.176). Moreover, none of the
previous multilingual datasets focus on text-centric scenarios where
multilingual text frequently occurs.
Our benchmark distinguishes itself by focusing on multilingual
text-centric VQA scenarios using human expert annotations. To the best
of our knowledge, the MTVQA benchmark is the first dataset to provide
native human annotations for such scenarios. It covers 9 languages,
thereby facilitating the training and evaluation of multilingual models
in diverse linguistic contexts. Additionally, our dataset can gauge the
VQA system’s ability for not only high-resource languages but also those
that are typically underrepresented in current
datasets [biten2019scene](http://arxiv.org/pdf/2304.01603v1), [singh2019towards](http://arxiv.org/pdf/1811.11903v1), [mathew2021docvqa](http://arxiv.org/pdf/2111.05547v1), [mathew2022infographicvqa](http://arxiv.org/pdf/2104.12756v2), [gao2015you](http://arxiv.org/pdf/1505.05612v3).
The MTVQA benchmark addresses a significant gap in existing datasets by
catering to the crucial needs of low-resource languages through
annotations from native speakers across multiple languages. Our
pioneering efforts distinctly position the MTVQA benchmark as a unique
multilingual VQA resource, advancing the frontier of machine learning
research.
# MTVQA Benchmark
The MTVQA Benchmark covers 9 languages: Arabic (AR), Korean (KO),
Japanese (JA), Thai (TH), Vietnamese (VI), Russian (RU), French (FR),
German (DE), and Italian (IT). In this section, we describe in detail
how we establish the MTVQA benchmark, including the collection of raw
image data and two-round human expert annotations, which are independent
of each other.
## Data Collection
Our purpose is to develop a multilingual VQA benchmark capable of
evaluating the QA performance of MLLMs in multilingual text-centric
scenarios, thus the raw data collection process is mainly oriented
towards text-centric images from natural scenarios and document
scenarios. To ensure the diversity and quality of data, we collect not
only the raw image data from publicly available datasets, including the
multilingual scene text recognition images from
MLT2019 [nayef2019icdar2019](http://arxiv.org/pdf/1909.07145v1) and PowerPoint slides (PPTs)
sourced from the internet, but also the data from countries of each
language. Furthermore, the collected data includes multiple fine-grained
scenarios (Fig. [fig:data_distribution]),
such as menus, logos, maps, bills, PPTs, research papers, and *etc*. As
a result, we gather a total of 1,220 images from document scenarios and
876 images from natural scenarios in the test set of the MTVQA
benchmark. To ensure the visual-textual alignment, for text-rich images
lacking text and language annotations, we subject them to a standardized
data cleaning process, which includes text recognition and language
classification. Afterward, we organize all the text-rich images we have
obtained into language-specific groups, preparing them for the
subsequent stage of data annotation.
## Human Expert Annotation
In order to obtain informative and accurate text-related QA pairs on the
language-specific grouped images, we recruit a group of annotators with
expertise from local regions of each language. It is worth noting that
all these annotators are native speakers of their respective languages,
ensuring their deep understanding and proficiency in the linguistic
nuances and cultural context necessary for precise annotations.
Considering the subjective nature of the text-image understanding task,
we have implemented a further division within the annotation team. This
division involves separating the team into two independent groups, with
one group dedicated to generating and responding to questions based on
the provided images, while the other group focuses on evaluating and
correcting the QA pair results. This raise-then-correct paradigm ensures
a comprehensive and reliable assessment of the text-image understanding
process. Additionally, each language’s annotation results undergo a 10%
sampling inspection by a quality inspector. If the QA pairs fail to meet
the criteria, they are sent back for re-annotation. Prior to commencing
the formal human expert annotation task, all annotators undergo unified
training and receive annotation examples. The brief diagram of the
two-round annotation process is shown in Figure
[fig:anno_process] and we
elaborate on it in the following subsections.
**First Round Questioning and Answering.**For the first round of
annotation tasks, we assigned 3 annotators for each language to manually
generate original QA results. Given a text-centric image from our
collection, annotators are first required to read the texts in the image
and analyze other contents in the image in a comprehensive and detailed
manner. They must then raise 4 meaningful and distinct questions based
on the content in the image and give the answers. All annotators adhere
to the following criteria: (1) the first three questions should satisfy
that answering these questions requires direct reading of the textual
information in the image, (2) the fourth question requires reasoning
about the text in the image to answer (3) the questions and answers must
be reasonably correct and consistent with the content of the image, and
(4) the answer should be as concise as possible and free of
nonsense (*e.g.*, when the question is “When is the volunteer
recruitment period”, the answer should be “9:00-16:00” rather than “The
volunteer recruitment period is 9:00-16:00”). It’s worth mentioning that
our requirement for concise answers is to make the evaluation process
more friendly and more reliable, cause we try to keep the evaluation
metrics unaffected by extraneous content in the answer sentence.
**Second round Evaluation and Correction.**To reduce the effect of human
subjective cognitive bias on our MTVQA benchmark and get high-quality
question-answer pairs, we assigned 2 annotators for each language for
the annotation evaluation and correction process. Based on the provided
images and the first-round annotation results, the annotators must
follow these rules of judgment and steps for the annotation: (1) Whether
the question is related to the text in the image. If not, discard the
current question-answer pair, (2) Whether the answer is correct. If not,
modify the answer, and (3) Whether the answer repeats the content from
the question. If so, remove the repeated content to ensure a concise
answer.
## Data Statistics
We instruct the annotators to complete the above human expert annotation
work towards the text-centric VQA tasks and construct the MTVQA
benchmark consisting of 8,794 images and 28,607 question-answer pairs
that cover the 9 languages. The MTVQA benchmark is divided into a
training set containing 6,678 images and 21,829 question-answer pairs,
and a test set containing 2,116 images and 6,778 question-answer pairs.
The detailed data distribution can be seen in Figure
[fig:data_distribution]. To
visualize the vocabulary richness of our benchmark, we calculate the
word frequencies for each language and present them in the form of word
clouds as shown in Figure
[fig:word_cloud]. In Figure
[fig:leng_statistics] we
demonstrate the statistics of the question and answer lengths using
GPT-4o tokenizer.
# Experiments
## Baseline Models
For the MTVQA benchmark, we evaluate the following instruction-tuned
general MLLMs, (1) **Open-source MLLMs:**
InternVL-V1.5 [chen2023internvl](http://arxiv.org/pdf/2312.14238v3),
InternLM-Xcomposer2-4KHD [dong2024internlm](http://arxiv.org/pdf/2404.06512v1),
Mini-Gemini-HD-34B [li2024mini](http://arxiv.org/pdf/2305.16318v2),
Llava-Next-34B [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/),
DeepSeek-VL [lu2024deepseek](http://arxiv.org/pdf/2402.17510v1),
YI-VL-34B [young2024yi](http://arxiv.org/pdf/2304.11090v4),
TextSquare [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3),
TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) and mPLUG-DocOwl
1.5 [hu2024mplug](None); (2) **Closed-source MLLMs:** GPT-4V,
Gemini Ultra, QwenVL Max, QwenVL Plus, Claude3 Opus, Claude3 Sonnet and
GLM4V. For the closed-source MLLMs, we use the chat version through the
official APIs, while for the open-source MLLMs, we utilize the instruct
versions. It is noted that all the model weights of the open-source
MLLMs evaluated in our experiments could be downloaded from the
HuggingFace Model Hub. For the open-source MLLMs, the model size varies
from 7b to 34b.
## Implementation Details
We conduct the evaluation experiments over the baseline MLLMs with their
default settings, ignoring the effect of generation configuration on the
results. To make the output of MLLMs more evaluation-friendly, we design
the following prompt format to limit the output length: “Answer the
question using a word or phrase in the language of the question. +
\”, where \ represents the corresponding question
of the input image. The extra prefixes added to the raw question could
limit the answer to be as concise as possible. Besides, we utilize the
InternLM-Xcomposer2-4KHD [dong2024internlm](http://arxiv.org/pdf/2404.06512v1) as the
backbone for the instructional fine-tuning experiment on the MTVQA
training set. In the instructional fine-tuning process, we follow the
default training settings [dong2024internlm](http://arxiv.org/pdf/2404.06512v1) with “HD-16”
and train on MTVQA training set for 1 epoch.
## Evaluation Results
**Zero-shot testing** To demonstrate the quantitative comparison results
in the above MLLMs, we follow
TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1) with accuracy as the
evaluation metric. That is, the model output is only counted as correct
if it contains the ground truth. The complete evaluation results are
shown in Table 2, where Claude3 Opus achieves the
highest average accuracy of 25.7$\%$ on the 9 languages. It indicates
that the multilingual text-centric VQA tasks remain a big challenge,
even for the state-of-the-art open-source and closed-source MLLMs. From
the metrics across languages, both open-source and closed-source models
performed significantly better on Indo-European languages using the
Latin alphabet, including DE, FR, and IT in our benchmark, compared to
other languages, which results from the distribution of realistically
available training data and the genetic relationship of different
languages. In addition, all closed-source models except GLM4V outperform
the open-source model overall across the nine languages, which may be
due to the contribution of pre-training on multilingual data. We also
found that the document-focused MLLMs, like
TextSquare [tang2024textsquare](http://arxiv.org/pdf/2307.04087v3) and
TextMonkey [liu2024textmonkey](http://arxiv.org/pdf/2403.14252v1), do not significantly
outperform other open-source models on the metrics of these 9 languages.
**Instruction tuning** As shown in Table
2, the instruction tuning
experiment on MTVQA benchmark brings a 8.5$\%$ improvement in average
accuracy. With respect to specific languages, French sees the largest
improvement of 14.2$\%$ in accuracy, while Russian has the smallest
improvement of 1.7$\%$ in accuracy. The results demonstrate that MLLMs
vary in their ability to understand and learn from text-centric data in
different languages, leaving great potential for future research of
multilingual text-centric MLLMs pre-training.
# Limitation
The current iteration of MTVQA exhibits certain constraints that warrant
attention. Primarily, the linguistic diversity incorporated is not
exhaustive; several lesser-spoken languages remain unrepresented. Future
enhancements will aim to broaden the multilingual scope of the dataset.
Additionally, the dataset currently offers a singular canonical response
for each question. Recognizing the multifaceted nature of the inquiry,
subsequent versions will endeavor to include a spectrum of plausible
answers to reflect the varied perspectives inherent to each question.
# Conclusion
In this paper, we introduce MTVQA, a multilingual TEC-VQA benchmark
featuring high-quality human expert annotations in 9 diverse languages.
We believe that MTVQA is the first benchmark of its kind to provide
fully manual annotations specifically tailored to text-centric
scenarios. The results obtained from both closed- and open-source MLLMs
on our MTVQA dataset indicate that there is still room for improving
their performance in multilingual text-centric scenarios. Although the
current version of MTVQA has constraints regarding linguistic diversity
and singular responses per question, we are confident that this dataset
can still inspire researchers within the TEC-VQA community with new
perspectives and ideas.
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
2024-04-25
Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang
Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.
Show Paper Content
# Introduction
As an important information source, charts can intuitively visualize
data in various visual presentation forms and have become an
indispensable part of information dissemination, business
decision-making, and academic research [chartsurvey](chartsurvey).
With the rapid growth of multimodal data, automatically comprehending
charts has become a pressing need and received increasing attention from
the research
community [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct), [onechart](onechart).
Recently, Multimodal Large Language Models (MLLMs) have shown strong
capability in comprehending images and following
instructions [gpt4](gpt4), [llava](llava), [mplugowl](mplugowl), [llava1.5](llava1.5), [sphinx](sphinx), [mplugowl2](mplugowl2), [xcomposer](xcomposer), [xcomposer2](xcomposer2), [xcomposer2-4k](xcomposer2-4k).
Based on these MLLMs, some recent
works [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct), [paperowl](paperowl)
further build chart understanding models by collecting and constructing
versatile chart comprehension datasets and performing supervised
fine-tuning.
However, despite their remarkable success, current chart understanding
models still face three main limitations: (1) Considerable amount of
parameters makes training and deployment challenging. For example,
ChartLlama [chartllama](chartllama) is a model with 13 billion
parameters, which is hard to deploy on a single consumer GPU with less
than 26GB of VRAMs. (2) They are prone to errors when tackling questions
involving numerical calculations [chartast](chartast), which are
difficult to directly answer without any reasoning steps. (3) They
struggle with efficiently encoding for high-resolution images since the
standard vision transformer would produce lengthy feature sequences.
To overcome such limitations in chart understanding, we propose an
efficient and powerful MLLM, namely **TinyChart**. As shown in
Figure 1, through the efficient visual encoding
and Program-of-Thoughts learning strategy, TinyChart achieves
state-of-the-art performances on various chart understanding benchmarks
with only 3B parameters, while excelling in faster inference throughput.
For efficient visual encoding, we propose to merge visual tokens based
on the observation that chart images often contain large areas of color
and white spaces. Inspired by [tome](tome), we adopt a
parameter-free Visual Token Merging module inside each vision
transformer layer, which aggregates the most similar visual tokens and
gradually reduces the length of the visual feature sequence, thus making
it possible to efficiently encode high-resolution chart images. This
enables the model to maintain high-resolution chart image input while
controlling the computation load.
Moreover, inspired by [pot](pot), we propose
Program-of-Thoughts learning that enhances the model’s ability to
resolve mathematical problems. According to statistics on
ChartQA [chartqa](chartqa), 42% of questions for charts require
numerical answers, and most existing models struggle to perform
numerical question answering [matcha](matcha), [chartast](chartast). To learn
chart understanding more efficiently, we train the model to generate
Python programs for the computation problems step by step. The programs
are then passed to a Python interpreter to produce the final answer. To
support Program-of-Thoughts learning, we further construct the
ChartQA-PoT dataset based on ChartQA [chartqa](chartqa). The QA
pairs in our ChartQA-PoT are constructed in two ways: (1) Template-based
PoT construction, which generates questions and programs by filling in
the manually written templates based on chart data. (2) GPT-based PoT
construction, which leverages `gpt-3.5-turbo` [gpt3.5](gpt3.5) to
generate programs based on human-written questions. Experimental results
show that Program-of-Thoughts learning can significantly improve the
question-answering, especially numerical question answering ability of
TinyChart.
The main contributions of this work are as follows:
- We introduce TinyChart, an efficient multimodal chart understanding
model, which outperforms several 13B MLLMs and achieves
state-of-the-art performances on a variety of chart understanding
benchmarks, while excelling in faster inference speed at the same
time.
- We propose a Program-of-Thoughts (PoT) learning strategy to enhance
the model in learning numerical calculation and carefully build a
PoT dataset ChartQA-PoT.
- We adopt Visual Token Merging for efficient vision encoding, which
significantly reduces the length of vision feature sequences and
enables the model to encode high-resolution chart images with
constrained computing resources.
# Related Work
## Chart Understanding
Chart understanding requires the model to comprehend chart contents and
accomplish related tasks specified by the instructions. This field
encompasses low-level recognition tasks, such as data
extraction [deplot](deplot), and high-level tasks, such as
question-answering (QA) [chartqa](chartqa), [plotqa](plotqa), [dvqa](dvqa),
summarization [chart2text](chart2text), [chart2text-8k](chart2text-8k), and
re-rendering [chartllama](chartllama). As charts often contain OCR
text pivotal for data interpretation, and many instructions require the
model to perform numerical calculations, chart understanding demands
robust text recognition capabilities and computational reasoning from
the model. Early
approaches [lorra](lorra), [plotqa](plotqa), [deplot](deplot), [chartstamp](chartstamp), [mpmqa](mpmqa), [qc_cap](qc_cap)
rely on pipeline methods that use off-the-shelf OCR tools or component
detectors to transform charts into data tables and other textual
representations. They then employ language models to complete the
specified tasks. These pipeline approaches, limited by their inability
to optimize jointly, were hampered by error accumulation. Recent
studies [unichart](unichart), [matcha](matcha), [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct), [mmc](mmc)
have shifted towards end-to-end methods based on multimodal large
language models. These studies adopt the structure of multimodal large
language
models [llava](llava), [llava1.5](llava1.5), [mplugowl](mplugowl), [mplugowl2](mplugowl2), [sphinx](sphinx) and
enhance chart understanding abilities through supervised
fine-tuning [instructgpt](instructgpt) with substantial chart
instruction data [chartllama](chartllama), [chartast](chartast), [chartinstruct](chartinstruct).
Although these models demonstrate improvement in performance, their
extensive parameter size prevents them from being easily trained or
deployed under resource-constrained scenarios. In this paper, we
demonstrate that a 3B MLLM is enough to achieve state-of-the-art
performance on several chart understanding tasks. Meanwhile, it has been
well observed that these models are prone to numerical
errors [matcha](matcha), [chartinstruct](chartinstruct), [chartast](chartast).
Though [chartast](chartast) try to construct executable command
lines in JSON format based on a template to eliminate numerical errors,
we argue that it is insufficient to fully address this issue for two
reasons: 1) The executable command lines in JSON format produced
by [chartast](chartast) relies on a specific computational
backend, which limits their potential versatility. 2) Template-based
programs can only cover rather limited scenarios. Instead, we construct
the Program-of-Thoughts learning dataset with the combination of both
templates and GPT-generated programs. This allows the model to more
effectively learn how to solve numerical problems.
## Multimodal Large Language Model
Multimodal large language models (MLLM) exhibit strong capabilities in
visual understanding and instruction
following [gpt4](gpt4), [gemini](gemini). They typically comprise
transformer-based visual encoders, large language models, and
vision-language
connectors [llava](llava), [llava1.5](llava1.5), [tinyllava](tinyllava), [mplugowl](mplugowl), [mplugowl2](mplugowl2), [xcomposer](xcomposer), [xcomposer2](xcomposer2), [mplug-octopus](mplug-octopus).
These models are generally trained on extensive general image-text data
for cross-modal alignment and instruction fine-tuning. Although some
studies have showcased a degree of OCR capability in these multimodal
large language models [ocr_mllm](ocr_mllm), [trie](trie), their performance
on document and chart understanding benchmarks remains suboptimal due to
their low input resolution [ureader](ureader), [xcomposer2-4k](xcomposer2-4k).
Efforts in the general document domain have attempted to improve the
fine-grained understanding capabilities of MLLMs by increasing
resolution [qwenvl](qwenvl), segmenting
images [ureader](ureader), [sphinx](sphinx), [docowl1.5](docowl1.5), [xcomposer2-4k](xcomposer2-4k),
utilizing frequency domain signals [docpedia](docpedia), and
introducing additional high-resolution
encoders [cogagent](cogagent). However, these models often suffer
from low efficiency, primarily due to the excessive length of the
high-resolution visual sequences. The visual token merging method
adopted in this paper can significantly reduce the length of visual
feature sequences and relax the computational requirements with
high-resolution input.
# TinyChart
## Model Architecture
Figure [fig:overview] shows the overview
framework of our proposed TinyChart. It follows the typical architecture
of the multimodal large language model (MLLM), which consists of a
vision transformer encoder, a vision-language connector, and a large
language model. To encode high-resolution visual input effectively, we
insert the visual token merging module inside each vision transformer
layer.
### Vision Transformer Encoder
The vision transformer encoder aims to encode chart images into vision
features. A standard vision transformer [vit](vit) first
resizes the input image $I$ into a fixed resolution and crops the image
into patches. Then the patches are treated as vision tokens and
processed with transformer encoder layers [transformer](transformer).
Suppose the input image $I^{N\times N}$ is in resolution $N \times N$,
and the patch size is $P \times P$, the length of vision tokens would be
$(N // P)^2$. Since the standard transformer layer does not reduce the
sequence length, the vision transformer finally produces a vision
feature in length $(N // P)^2$. In practice, when $N$ is large, the
vision feature can be very long and inefficient for the language model
to handle.
Since key information (such as OCR words) in a chart can be
unrecognizable in low-resolution images [docowl1.5](docowl1.5),
high-resolution input is essential for chart understanding. However,
charts typically contain a large number of color blocks and blank
spaces, where patches are visually similar. To achieve efficient and
effective chart understanding, we apply Visual Token
Merging [tome](tome) in each transformer layer. The process of
Visual Token Merging is shown in
Figure [fig:tokenmerge]. By merging the $r$
most similar token pairs, it reduces the length of the vision feature by
$r$ in each layer. We measure the similarity between two tokens using
the cosine distance between Keys from self-attention
following [tome](tome). As shown in the lower part of
Figure [fig:tokenmerge], Vision Token
Merger finds the top-$r$ similar token pairs through bipartite graph
matching. It first divides the vision tokens into two disjoint sets.
Then, for each token in one set, it finds the most similar tokens in the
other set and draws an edge between the two tokens. After that, it only
keeps the top-$r$ most similar edges and merges the features of the two
endpoints through average pooling. Note that not only spatially adjacent
visual tokens are subject to merging. Non-adjacent tokens can also be
merged if they belong to different subsets and are similar enough.
The visual token merging operation aggregates tokens with a similar
feature into one. Therefore, it will reduce the proportion of this
visual feature in the attention calculation in the following transformer
layer, since the number of this feature has decreased. To solve this
issue, we let the attention operation consider the actual number of
patches $s$ represented by each token as follows: $$\begin{aligned}
\mathrm{Attention}=\mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d}} + \log s \right) V
\end{aligned}$$
Where $Q$, $K$, $V$ denotes the query, key, and value of self-attention
which are linear projected from the hidden
states [transformer](transformer). By adding $\log s$ inside
$\mathrm{softmax}$, the token that merged from $s$ patches are
duplicated by $s$ times in the attention
calculation [tome](tome).
### Vision-Language Connector
The vision language connector aims to project the vision features into
the embedding space of the large language model.
Following [llava1.5](llava1.5), [tinyllava](tinyllava), we implement the
vision-language connector as a multiple-layer perceptron with one hidden
layer and GeLU [gelu](gelu) activation.
### Large Language Model
The large language model aims to comprehend both visual features and
language instructions, and then generate responses to accomplish chart
understanding tasks. It is implemented as a transformer
decoder [transformer](transformer) with a causal attention mask. The
training objective of the model is language modeling. Assuming the
visual features is $V$, the language instruction is $L$, and the
response is $R$, then the loss function is defined as follows:
$$\begin{aligned}
\mathcal{L}=\frac{1}{T}\sum_{i=1}^T\mathrm{LLM}(R_i|V, L, R_{[fig:pot_construction], the
Template-based PoT is constructed based on human-written templates
containing placeholders for both questions and code. The template
questions involve common numerical operations such as calculating the
sum, average, minimal, and maximum values. We adopt the 40 template
questions proposed by PlotQA [plotqa](plotqa) and manually write
their corresponding template Python code to solve them. As shown in the
top-left part of
Figure [fig:pot_construction], the
template code consists of several variable assignment operations with
NumPy [numpy](numpy) functions to perform calculations. The
beginning steps usually involve extracting the relevant data from the
chart and assigning them to variables. The final computed result is
stored in a variable named "Answer". For each placeholder in the
template, we identify all possible values from the data table of the
chart and randomly select one to fill in the placeholder. After removing
incorrect or unreasonable filled-ins using rule-based methods, we
finally successfully construct 119,281 (question, PoT pairs) over 17,498
images from ChartQA.
### GPT-based PoT
Although the template-based method allows for the construction of a
large number of question-answer pairs, the diversity of these pairs is
limited due to the fixed templates. To improve the generalization
ability of PoT learning, we have additionally built GPT-generated PoT
data by leveraging the powerful command-following and code-generation
capabilities of large language models. Specifically, we prompt
`gpt-3.5-turbo` [gpt3.5](gpt3.5) to generate PoT answers similar
to the template PoT format for questions annotated in ChartQA using
in-context examples. As shown in
Figure [fig:pot_construction], since
`gpt-3.5-turbo` does not accept image input, we also provide the data
table corresponding to the chart as text input to `gpt-3.5-turbo`. We
screen the quality of the generated PoT answers by running them through
a Python interpreter. If the annotated PoT answer can not run on the
Python interpreter, or if the answer obtained is different from the
annotated one in ChartQA, then the corresponding PoT Answer is deleted.
In the end, we construct 21,303 (question, PoT Answer) pairs on 15,521
chart images.
## Multitask Learning
We perform multitask learning to train our TinyChart model. We collect a
chart understanding dataset that contains 1.36M samples for supervised
fine-tuning. It covers various chart understanding tasks including chart
question answering, chart-to-text generation, chart-to-table generation,
and chart instruction following.
Table 1 shows the collection of our
training dataset. We mix data in different tasks together to jointly
train the model, and use task-specified instructions to enable the model
to differentiate between them. The training objective is language
modeling on response tokens as presented in
Eq.[eq:loss]. Note that in ablation studies,
we train solely with benchmark datasets due to limited computational
resources.
# Experiment
## Implementation Details
TinyChart is initialized from TinyLlava [tinyllava](tinyllava),
which utilizes the SigLIP [siglip](siglip) as the vision encoder
and Phi-2 [phi1.5](phi1.5) as the large language model. The
origin input resolution of the vision encoder is
384$\times$``{=html}384. We extend the input resolution to
512$\times$``{=html}512 and 768$\times$``{=html}768 and
apply visual token merging with $r=20$ and $r=84$ in each transformer
layer respectively. We train the entire model for 3 epochs with a batch
size of 512. The learning rate is set to $1e-4$, with a warmup in the
beginning 3% steps, and then decays to 0 at the end of training. The
total training process costs 3 days on 32 Tesla V100 GPUs with 32 GB
VRAMs.
## Evaluation Benchmarks
ChartQA [chartqa](chartqa) aims to generate a short answer to the
question based on the chart content. It includes a lot of questions that
require numerical calculation. We report the relaxed accuracy that
allows numerical error within 5% as the metric
following [chartqa](chartqa), [chartllama](chartllama), [chartast](chartast). Note that our
TinyChart with Program-of-Thoughts learning can perform ChartQA in the
following four settings:
- **Direct**: the model produces short answers directly.
- **PoT**: the model produces Python code. The answer is then
calculated through the Python interpreter.
- **Combine**: the model produces Python code for questions that
require calculation, and Direct answers for others. We determine
whether a question requires calculation with a simple rule-based
keyword detector. If the question contains one of the calculative
keywords[^1], the detector will treat it as a computational question
and prompt the model to generate a PoT answer. Otherwise, the model
is instructed to produce a Direct answer. Additionally, if the
generated program of a calculative question encounters syntax
errors, we let the model produce Direct answers for this question in
the Combine setting.
- **Oracle** We further introduce the Oracle setting for ChartQA
evaluation. Under this setting, we always choose the correct one
between the Direct and PoT answers after evaluating under both
settings. It is the upper bound of the combination across the two
answers.
We evaluate TinyChart under the Combine setting by default.
Chart-to-Text aims to generate a chart summarization based on chart
content. We evaluate the model with the Pew
benchmark [chart2text](chart2text), and report
BLEU4 [bleu](bleu) as the metric.
Chart-to-Table aims to extract the underlying data table presented by
the chart. We evaluate the performance of Chart-to-Table with the data
table annotation provided by ChartQA [chartqa](chartqa)
following [chartllama](chartllama), [chartast](chartast). We report
RMS$_{F1}$ [deplot](deplot) as the metric.
Different from ChartQA, OpenCQA [opencqa](opencqa) evaluates the
ability of models to generate free-form answers to the chart-related
questions. We report BLEU4 [bleu](bleu) as the metric
following [chartinstruct](chartinstruct), [chartast](chartast).
ChartX [chartx](chartx) is a recently proposed benchmark that
contains more chart types. We evaluate the ChartX cognition tasks since
they are more challenging. It covers Question Answering, Chart
Description Generation, Chart Summary Generation, and Chart Redrawing.
We report the GPT-Accuracy for QA and GPT-score for the remaining 3
tasks as the metrics following ChartX [chartx](chartx).
## Main Results
Table [tab:main_result] shows an
extensive comparison between TinyChart and existing multimodal large
language models on 4 chart understanding benchmarks. Our TinyChart model
achieves state-of-the-art performance on ChartQA, Chart-to-Text,
Chart-to-Table, and OpenCQA, while excels in larger inference
throughput. Specifically, with the input resolution set at
768$\times$``{=html}768, TinyChart achieves an accuracy of 83.60
on ChartQA [chartqa](chartqa), surpassing several closed-source
models including GPT-4V, Gemini-Ultra, and
Qwen-VL-Max [qwenvl](qwenvl). It also outperforms previous
open-source SOTA ChartAst [chartast](chartast) on chart
understanding.
We find that previous models performed poorly on the ChartQA human
subset, with none of them achieving over 70%. In contrast, the
performance on the ChartQA-augmentation has approached 93.9%. This is
because the questions posed by human annotators involve more
computational problems [chartqa](chartqa) and are more
challenging. By leveraging the Program-of-Thoughts learning, TinyChart
achieves performance of 73.34% on ChartQA-human, which is an improvement
of 7.44% over the previous state-of-the-art
ChartAst [chartast](chartast). This demonstrates the effectiveness
of our proposed learning method based on the Program-of-Thoughts.
We observed that models with higher input resolutions generally perform
better on chart understanding tasks. However, encoding high-resolution
charts leads to a decrease in inference speed (e.g., Qwen-VL vs.
Llava1.5, DocOwl1.5 vs. UReader, ChartAst vs. ChartLlama). By leveraging
visual token merging, TinyChart is able to accept higher-resolution
input images with a limited increase in computing demands, thus
achieving better performance. Due to the smaller model size and the
efficient visual token merging strategy, TinyChart achieves
significantly larger inference throughput compared to previous models.
In summary, these results demonstrate that TinyChart can achieve
efficient chart understanding with enhanced performance and faster
inference.
Table 1 shows the performance
comparison under different settings. Note that the performance of
ChartAst under the Combine setting is from [chartast](chartast),
which leverages a combination of Direct answer and executive JSON to
produce the final answer. The results indicate that our TinyChart model
could achieve SOTA performance on the Direct answer. By combining with
PoT answers, TinyChart could make further improvements. In addition,
since the combination of Direct and PoT answers is very simple, the
performance under the Combine setting falls behind the Oracle setting a
lot. Further study can be conducted to better combine the two answers.
We divide the questions in ChartQA test set [chartqa](chartqa)
into two categories: calculative questions (761 of 2500) and
non-calculative questions (1739 of 2500) by checking whether they
contain calculative keywords mentioned above.
Table [tab:cal_questions] shows the
performance of TinyChart@768 on these two types of questions under
different settings. We observe that PoT significantly improves the
performance on calculative questions compared to Direct settings (78.98
vs. 56.64) and thus it shows overall performance gains (80.84 vs.
76.36). And the simple combination of Direct and PoT strategies further
makes improvements.
To further assess the generalizability of TinyChart, we compare our
model with end-to-end General MLLM and Chart MLLM on ChartX-Cognition
benchmark [chartx](chartx), since it covers visually diverse
chart types. We use TinyChart@768 to perform inference on ChartX without
additional fine-tuning. As shown in
Table 2, benefiting from our
Program-of-Thoughts learning method, TinyChart achieves a 33.35
GPT-Accuracy on the QA task, even surpassing the GPT-4V model. Though it
falls behind GPT-4V in Summary, Description, and Redrawing tasks,
TinyChart still performs better than open-source Chart MLLMs including
ChartLlama and ChartAst. It indicates that TinyChart has a strong
capability to generalize across various chart types.
## Ablation Studies
To verify the effectiveness of visual token merging and
program-of-thoughts learning, we conduct ablation studies in
Table [tab:ablation].
The upper block in
Table [tab:ablation] shows the performance
of the model with and without the use of Program-of-Thoughts training
data. Comparing Row 2 with Row 1, we observe that training solely with
template-based PoT improves the model’s ability to generate direct
answers (71.12 vs. 70.72). This improvement is attributed to PoT
learning enhances the model’s reasoning abilities. At this point, the
PoT answers produced by the model are less accurate than direct answers
(55.44 vs. 71.12), which may be due to the inability of template-based
PoT to cover all questions. However, when we ask the model to generate
PoT answers for questions that require calculation and combine with
direct answers, it outperforms solely direct answers (73.00 vs. 71.12).
This indicates that PoT answers have advantages in computational
problems. After incorporating GPT-based PoT into training, the
performance of PoT answering surpasses direct answering (76.88 vs.
72.44), and both direct (72.44 vs. 71.12) and combined answering (79.48
vs. 73.00) show further improvements. These results confirm the
effectiveness of our proposed Program-of-Thoughts learning method,
suggesting that it not only strengthens the model’s computational
capabilities but also enhances overall problem-solving capability.
The middle block in
Table [tab:ablation] compares the
performance with and without using visual token merging when the input
resolution is 512$\times$``{=html}512, and with different
numbers of tokens to merge in each layer. Comparing Row 4 and Row 3,
increasing the input resolution from 384 to 512 significantly improves
the model’s performance on three chart understanding benchmarks,
demonstrating that high resolution is crucial for comprehending chart
images. However, a direct increase in resolution leads to a substantial
drop in the inference throughput (2.38 it/s vs. 3.73 it/s). The reason
is that, given high-resolution images, the standard vision transformer
produces a lengthy visual feature sequence that is then processed by the
large language model. This brings considerable computational expenses.
By adopting the visual token merging, we can control the length of the
visual feature sequence by regulating the number of tokens to merge at
each layer, and, thereby achieving efficient high-resolution encoding.
When setting $r$=20, we attain an inference throughput nearly equal to
that with an input resolution of 384$\times$``{=html}384 (3.65
it/s vs. 3.73 it/s), while providing the performance benefits of higher
resolutions.
To further highlight the advantages of visual token merging, we increase
the input resolution to 768 in the bottom block of
Table [tab:ablation]. At this point, the
length of the visual feature sequence is 2,916, which could not be
trained using 32GB V100 due to insufficient VRAM. However, after
employing the visual token merging module with $r$=84, the input
sequence length is reduced to 732 and we can perform training normally.
In this setting, the model’s inference throughput is 3.14 it/s, and
demonstrates a certain performance advantage in ChartQA (81.04 vs.
80.76) and Chart-to-Table (88.90 vs. 87.81). It illustrates that by
utilizing visual token merging, we are able to leverage
higher-resolution chart images under constrained resources, thereby
improving performance.
## Visualization
To investigate the effects of visual token merging, we visualized the
token merging results at the final layer of the vision transformer. In
Figure [fig:vis_tokenmerge], we
visualize the top ten groups with the largest numbers of tokens. Each
group is outlined with a different color. The visualization reveals that
these largest groups typically correspond to blank or colored areas. By
compressing these areas down to a single token for encoding, our visual
token merging module can thus reduce the length of the encoded sequence
without losing much information, thereby achieving efficient visual
encoding.
## Case study
We conduct case studies with TinyChart when conducting chart question
answering, chart-to-table, chart-to-text, and chart redrawing in
Figure [fig:vis_cases],
[fig:table_cases],
[fig:summary_cases], and
[fig:redraw_cases].
In Figure [fig:vis_cases], we present a case
study on ChartQA. As shown in
Figure [fig:vis_cases] (a-c), much key
information within the chart is provided by visually situated texts
within the image, which requires the model to have the ability to
process high-resolution images. Since ChartLlama only supports 336
resolutions, it struggles to retrieve accurate information in these
charts. In contrast, thanks to the visual token merging, our TinyChart
can accept higher-resolution inputs without introducing excessive
computations. Thus it can successfully find clues related to the
questions. Meanwhile, ChartLlama suffers from numerical errors when
faced with calculative questions in
Figure [fig:vis_cases] (d-e), and our PoT
(Program-of-Thoughts) learning method can accurately solve these
problems. These examples further illustrate the advantages of our
methods. For chart-to-table extraction, we find that our TinyChart model
can successfully extractive values from several visually diverse charts
in Figure [fig:table_cases] (a-c), thanks to
its excellent text recognition ability with high-resolution input.
However, as shown in
Figure [fig:table_cases] (d), the model
struggles to estimate the values of data points in the absence of OCR
words. It seems that the model could make reasonable predictions based
on surrounding points, but hardly provide accurate values based on the
coordinate axis. This indicates that the model still lacks the ability
to understand spatial relationships across large areas. From
Figure [fig:summary_cases], we observe
that the model can understand the data presented in the chart and
generate descriptions and summaries in natural language. Though it can
retrieve the data values correctly, we find it sometimes produces
contents that do match the chart as shown in
Figure [fig:summary_cases] (c-d). This
may be due to the inherent limitations of hallucination in
MLLMs [chair](chair), [pope](pope), [wang2023evaluation](wang2023evaluation), [amber](amber), and may be
alleviated by addressing
hallucinations [vcd](vcd), [opera](opera), [jiang2024hallucination](jiang2024hallucination), [less_eos](less_eos).
We present four cases of chart redrawing in
Figure [fig:redraw_cases]. As shown in
Figure [fig:redraw_cases] (a-c), our
TinyChart model can generate Python code to redraw visually diverse
chart types including lines, heatmaps, and rings. However, it can be
hard to draw unseen chart types such as 3D bar charts
(Figure [fig:redraw_cases] (d)). This may
be mitigated by improving the coverage of different chart types in
training data through automatic data construction
techniques [chartllama](chartllama), [chartx](chartx).
[^1]: sum, mean, average, ratio, mode, divide, dividing, differ,
subtract, add, division, times, absolute, minus, exceed, below,
less, fewer, bigger, biggest, greater, higher, longer, tallest,
lowest, number, how many colors, what is the value
# Conclusion
This paper introduces TinyChart, a chart understanding Multimodal Large
Language Model with 3 billion parameters. To address the inefficiency of
lengthy visual token sequences with high-resolution images, TinyChart
injects a visual token merging module that merges similar vision tokens
together, thereby enabling efficient encoding of high-resolution images.
To tackle the challenges of learning numerical computations, we propose
a Program-of-Thoughts learning method that trains the model to generate
Python programs to answer questions. Our TinyChart model achieves
state-of-the-art (SOTA) performance on multiple chart understanding
benchmarks, surpassing existing 13 billion parameter chart MLLMs, and
outperforms closed-source models like GPT-4V on ChartQA. Extensive
ablation studies confirm the effectiveness of our methods. Our code and
model are released at
.
# ChartQA-PoT Details
## Dataset Statistic
We build ChartQA-PoT based on the images and questions in the training
split of ChartQA [chartqa](chartqa). ChartQA-PoT consists of two
subsets: Template-based PoT and GPT-based PoT. We present the statistics
over ChartQA-PoT in
Table [tab:chartqa_pot]. We find that
answers provided by `gpt-3.5-turbo` are longer than template-based PoT,
since they cover more diverse scenarios.
We further present the first 2-gram words of the questions after
removing stop words in Template-based PoT and GPT-based PoT in
Figure 3. It is observed that GPT-PoT covers
more diverse questions for ‘what’ type questions, and questions in
Template-based PoT are more evenly distributed across all question
types.
## Instructions for GPT-based PoT
Figure [fig:gpt_prompt] shows the
instructions for constructing GPT-based PoT answers. Note that we prompt
`gpt-3.5-turbo` to provide Python code consisting of assignment
statements and avoid using loops or judgment statements. This can
simplify the program and reduce syntax errors. We also provide meta
information including the chart title, type, and colors to
`gpt-3.5-turbo` since some questions rely on this information to answer.
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
2024-04-19
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
Show Paper Content
# Introduction
Recent research on multimodal large language models (MLLMs) has achieved
significant advancements in the text-centric visual
question-answering(VQA) domain
[Text-MLLM-1](None), [Text-MLLM-2](None), [Text-MLLM-3](None), [docpedia](None), with
several closed-source state-of-the-art (SOTA) models leading the way.
Two representative examples are GPT4V [gpt4v](http://arxiv.org/pdf/2312.04344v2) and Gemini
[gemini-pro](http://arxiv.org/pdf/2312.17661v1), which have demonstrated remarkable
performance and have even surpassed human-level capabilities in certain
aspects. Nevertheless, as illustrated in Figure
1, the performance of open-source
models still lags significantly behind that of pioneering closed-source
models. This phenomenon can be attributed to various factors, including
model architecture, the scale of model parameters, image resolution, the
volume of pretraining and instruction tuning data, and training
strategies, among others.
The performance of TextSquare in various VQA tasks compared
to existing models. (a) shows the comparison with state-of-the-art
closed-source models (Gemini and GPT4V ), and (b) shows the comparison with the
leading open-source models. The numbers in parentheses after the model
names in the legend indicate the average performance ranking across 10
text-centric multimodal benchmarks. TextSquare is marginally superior to
GPT4V. Best viewed on screen.
Many pioneering studies [allava](None), [bonito](None), [sharegpt4v](None), [llavar](None)
have recently conducted data-centric research into the challenges of
insufficient instruction tuning data. For instance, Monkey
[monkey](None) initially employed expert models to generate
descriptions of different aspects of images, which were then summarized
by GPT-4 to produce high-quality and detailed image caption data. For
better text-based knowledge injection, For better text-based knowledge
injection, LLaVAR [llavar](None) and TG-Doc
[tg-doc](None) used GPT-4 to generate conversations for
text-rich images by integrating OCR results into the instructions. In
order to improve the image caption ability for MLLMs, ShareGPT4V
[sharegpt4v](None) constructs a high-quality image caption
dataset through GPT4V. While these efforts have achieved remarkable
success, they also left some challenges unresolved. Image caption data
and VQA data belong to different domains, with inconsistencies in the
granularity and scope of image content presentation. Furthermore, the
scale of synthetic data remains relatively small, preventing MLLMs from
fully realizing their potential. The exploration of methods that
leverage large-scale text-centric VQA data for instruction tuning of
existing open-source models remains limited.
To bridge the gap, this paper proposes a strategy termed Square for
obtaining massive, high-quality text-centric VQA data from sophisticated
and versatile closed-source MLLMs, resulting in the construction of a
dataset (Square-10M) comprising tens of millions of instances for
instruction tuning. Specifically, the method consists of four steps:
Self-Questioning, Answering, Reasoning, and Evaluation. The
self-questioning step involves utilizing the MLLM’s capabilities in
text-image analysis and understanding to generate questions related to
the textual content of images. The answering step involves answering
these generated questions, leveraging various prompting techniques such
as Chain-of-Thought and few-shot prompting. The reasoning step entails
probing the model for the reasoning behind its answers, leveraging the
powerful reasoning abilities of MLLMs. The evaluation step involves
evaluating the question-answer pairs, assessing the validity of the
questions and their relevance to the textual content of the images, as
well as the correctness of the answers, thereby improving data quality
and mitigating hallucinations. Overall, Square comprehensively leverages
the capabilities of MLLMs in various aspects, significantly enhancing
the data quality.
Besides, enriching the diversity of images is also crucial. We collect a
diverse set of text-rich images from various public sources, including
natural scenes, charts, tables, receipts, books, slides, PDFs,
documents, products, and web images. Subsequently, deduplication is
performed on this collection. By applying the Square method to these
images, Square-10M is constructed.
Based on Square-10M, we achieve several remarkable results with
extensive and rigorous experiments. First, as shown in Figure
1, our model (TextSquare) achieves
comparable or superior performance to advanced closed-source models and
substantially outperforms recent state-of-the-art open-source models on
various benchmarks. It is notable that the image resolution of
TextSquare is $700$ and the parameters are $8.6$B. Second, our
experiments validate the beneficial impact of reasoning data on VQA
tasks, demonstrating its ability to enhance model performance while
mitigating hallucinations. With reasoning data for instruction tuning,
TextSquare has a strong reasoning capability to provide elaborate
explanations for VQA scenarios. Last but not least, by leveraging the
dataset’s massive scale, we unveil the relationships between instruction
tuning data scale, training convergence loss, and model performance.
Whereas a few instruction tuning data can motivate MLLM well, it is not
sufficient. Large amounts of high-quality data can still significantly
reduce convergence loss and improve performance. The performance of
TextSquare grows and the loss of convergence decreases while
continuously scaling up the instruction tuning data, which also
demonstrates the effectiveness of our dataset.
In summary, the main contributions of this paper can be categorized into
four points:
- A high-quality dataset (Square-10M) comprising tens of millions of
instances for text-centric VQA instruction tuning is constructed by
comprehensively collecting text-rich images from various scenarios
and employing the Square (Self-Questioning, Answering, Reasoning,
and Evaluation) strategy on closed-source MLLMs.
- Leveraging Square-10M, TextSquare achieves a significant
outperformance of existing open-source models and even comparable or
superior performance to SOTA closed-source models on various
benchmarks, e.g., +0.9% on ChartQA, +2.1% on WTQ, +4.3% on SROIE.
Notably, TextSquare outperforms GPT4V in overall rankings across ten
text-centric benchmarks (ranking 2.2 *v.s.* 2.4).
- Reasoning data is demonstrated to be beneficial in improving model
performance and mitigating hallucinations in text-centric VQA
scenarios, as it can deliver rich question-specific contextual
information.
- Through extensive experiments, we reveal the relationships between
data scale, convergence loss, and model performance for text-centric
VQA instruction tuning, which demonstrates the effectiveness and
necessity of Square-10M.
# Related Work
## Multi-modal Large Language Models
Recent work has increasingly focused on introducing visual knowledge
into LLMs [MLLM-1](None), [MLLM-2](http://arxiv.org/pdf/2308.12966v3), [MLLM-3](None). General attempts
connect a visual encoder and an LLM with intermediate modules like
Projector [llava](None), Q-Former [blip2](None),
Perceiver Resampler [flamingo](None), etc, and go through
pre-training alignment and instruction fine-tuning for vision-language
understanding.
Recently, several researches
[Text-MLLM-1](None), [Text-MLLM-2](None), [docpedia](None), [structextv2](None), [vary](None), [omniparser](None), [layoutllm](None), [hrvda](None)
propose to enhance MLLMs’ capabilities in understanding textual elements
(OCR, text-centric VQA, etc). Among them, mPLUG-DocOwl
[Text-MLLM-1](None) creates novel instruction-following
datasets to enhance the tuning process. TextMonkey
[MLLM-3](None) adopts shifted window attention and filters out
significant tokens. DocPedia [docpedia](None) and HRVDA
[hrvda](None) enlarges input resolution to bridge the gap
between MLLMs and visual document understanding.
Despite the extraordinary progress of existing open-source MLLMs, they
still suffer from the huge gap against SOTA closed-source models like
GPT4V [gpt4v](http://arxiv.org/pdf/2312.04344v2) and Gemini Pro [gemini-pro](http://arxiv.org/pdf/2312.17661v1).
In this paper, we propose to mitigate this gap by training with
large-scale and high-quality instruction-following data.
## Text-Centric Visual Question Answering
Text-Centric Visual Question Answering aims to understand the
interactions between the textual and the visual elements in the image.
Donut [donut](None) first proposes an end-to-end training
method based on a Transformer without OCR. Pix2Struct
[pix2struct](None) introduces a variable-resolution input
representation to adapt to document images. DoCo [doco](None)
enhances the visual representation of the image encoder in LVLMs by
aligning the document object of multi-modal inputs. BLIVA
[bliva](None) enlarges the input token space by concatenating
learned query embeddings and encoded patch embeddings. Several studies
[Text-MLLM-2](None), [tg-doc](None), [llavar](None) have performed data-centric
attempts in this regard. UniDoc [Text-MLLM-2](None) construct
600k document-oriented image-text pairs from PowerPoint presentations.
LLaVAR [llavar](None) and TG-Doc [tg-doc](None) prompt
text-only GPT-4 to generate conversations for text-rich images by
integrating OCR results into the instructions. These researches are
restricted to small-scale annotations or generation based on uni-modal
inputs.
## Generating Instruction-Tuning Data via LLMs
The success of LLMs has inspired recent work to employ them as training
data generators
[sharegpt4v](None), [allava](None), [self-instruct](None), [synthetic-prompting](None). In
this regard, we anchor on generating instruction-following data.
Self-Instruct [self-instruct](None) took the initial step
towards synthesizing instructions via language models and improving the
instruction-following capabilities. Llama-GPT4
[llama-gpt4](None) uses GPT-4 to generate instruction-following
data for LLM fine-tuning. Synthetic Prompting
[synthetic-prompting](None) leverages a few handcrafted
examples to prompt LLMs to generate more examples. Bonito
[bonito](None) converts unannotated text into task-specific
training datasets for instruction tuning. Recently, ALLAVA
[allava](None) employs GPT4V to generate reasoning instructions
and detailed answers from unlabeled images. All of the above attempts
suffer from the low quality of the generated data and are typically
performed on a small scale. In contrast, we collect massive text-centric
images (*i.e.*, tens of millions) and devise comprehensive generating
methods and filtering rules to ensure the quantity and quality of the
instruction tuning dataset.
Overview of Square-10M: the distribution of images, the
average tokens of the QAs, etc.
# Square-10M: A Massive and High-quality Text-Centric VQA Instruction Tuning Dataset
Square-10M is synthesized by our proposed Square pipeline, *i.e.*,
Self-Questioning, Answering, Reasoning, and Evaluation.
## Overview of Square
Figure 3 presents an overview of our proposed
Square. Square generally consists of three stages for synthesizing
high-quality instruction tuning data for text-centric VQA: (1) Data
Collection for collecting large-scale images with textual elements of
diverse properties. (2) Data Generation involves self-questioning,
answering, and reasoning of the collected data. In this phase, the MLLM
is prompted to generate VQA pairs based on the given image, as well as
the reasoning behind its answers. (3) Data Filtering for self-evaluation
of the generated content, aiming to discard meaningless questions and
erroneous answers by employing the evaluation capabilities of MLLMs.
The above procedures result in our Square-10M dataset, standing out with
its massive and high-quality text-centric VQA pairs and reasoning
context. To be more specific, a total of 3.8 million images with rich
textual elements are collected from diverse sources. After that 20
million question-answer pairs are obtained from Data Generation.
Finally, 9.1 million QA pairs as well as the reasoning context are
distilled with our Square strategy. A more precise analysis of
Square-10M is depicted in Figure
2.
## Data Collection
The data collection strategy is driven by the primary objective of
encompassing a broad range of real-world text-rich scenarios. To this
end, we collect 3.8 million unlabeled text-rich images (Figure
2). These images exhibit diverse
properties. For instance, Chart and Table focus on textual elements with
intense statistical information; Slide, Screenshot, and WebImage are
designed for the interaction between text and prominent visual messages;
Document/PDF, Receipt, and e-commerce contain images with fine and dense
text; Street-View is derived from natural scenes. The collected images
form a mapping of the textual elements in the real world and constitute
the foundation of our research on text-centric VQA.
Pipeline for the proposed Square strategy. Gemini’s
versatile multi-modal comprehension capabilities are utilized to
synthesize Square-10M, which consists of four stages, self-questioning,
answering, reasoning, and evaluation.
## Data Generation: Self-Questioning, Answering, and Reasoning
We build our Square-10M dataset by employing the multi-modal
understanding capabilities of Gemini Pro, one of the most advanced LLMs.
For each image selected from a specific data source, Gemini Pro is
instructed to generate VQA pairs and reasoning context through the
subsequent three stages:
**Stage 1: Self-Questioning.** In this stage, Gemini Pro is prompted to
generate profound, meaningful, and non-trivial questions about the given
image. We ask Gemini Pro to first comprehensively analyze the image and
then raise questions based on its understanding, as shown in Figure
3. Considering that advanced MLLMs
typically have weaker understanding capabilities of the textual elements
than visual elements, we also prepend the extracted text to the prompt
by employing expert OCR models.
**Stage 2: Answering.** Gemini Pro is then instructed to give
appropriate answers to the generated questions. We leverage various
prompting techniques to enrich the contextual information and improve
the reliability of the generated answers, such as Chain-of-Thought and
few-shot prompting. Figure
3 shows an example prompt for generating
answers to a given question.
**Stage 3: Reasoning.** We require Gemini Pro to elaborate on the
detailed reasons behind its answers. Such an effort enforces Gemini Pro
to think more about the connections between the questions and the visual
elements, thus reducing hallucinations and providing accurate answers.
Moreover, the generated reasons could serve as extra contextual
information specific to individual questions, favoring possible research
on the mechanism behind in-context learning. We present an example
prompt for self-reasoning in Figure
3.
## Data Filtering: Self-Evaluation and Answering Consistency
Despite the effectiveness of Self-Questioning, Answering, and Reasoning,
the generated image-text pairs could face hallucinatory content,
meaningless questions, and erroneous answers. We thus devise filtering
rules based on the Evaluation capabilities of LLMs to select
high-quality VQA pairs. The whole filtering system is established upon
three aspects.
**Self-Evaluation of MLLMs.** We prompt Gemini Pro as well as other
advanced MLLMs to judge whether the generated questions are meaningful
and whether the answers are good enough to correctly address the
questions.
Figure 3 depicts an example prompt for
self-evaluation.
**Multi-Prompt Consistency.** Besides direct evaluation of the generated
content, we manually augment the prompt and context space in Data
Generation. A correct and meaningful VQA pair should be semantically
consistent when provided with different prompts. Specifically, in the
stage of Answering we provide Gemini Pro with different but semantically
similar prompts to answer the given question. Then we discard the VQA
pairs if the generated answers are not stable in semantics. An example
is given in Figure 3.
**Multi-Context Consistency.** Similar to Multi-Prompt Consistency, we
further validate the VQA pairs by prepending the question with varied
context information. Given the generated question, three types of
answers are produced by Gemini Pro with different contexts: (1)
Answering with reasoning. Gemini Pro answers the question with a
detailed explanation prepended (*i.e.*, content generated in the stage
of Reasoning). (2) In-Context answering. Gemini Pro answers the question
with chain-of-thought or few-shot prompts prepended. (3) Naive
answering. Gemini Pro answers the question with no extra context. We
then discard the VQA pairs if the generated answers are not semantically
consistent.
# TextSquare: A Text-Centric Multimodal Large Language Model
## Model Architecture
The model architecture of TextSquare follows the paradigm established by
InternLM-Xcomposer2 [internlm-xcomposer2](None), including
three integral components: (1) A Vision Encoder modified from OpenAI
CLIP ViT-L-14-336 [clip](None), where the resolution is
increased to 700 for improved performance. (2) A LLM based on InternLM-2
[internlm2](None), utilizing InternLM2-7B-ChatSFT as the
practical variant. (3) A Projector, which semantically aligns the vision
token and the text token.
## Supervised Fine-Tuning with Square-10M
TextSquare is achieved by performing Supervised Fine-Tuning (SFT) with
Square-10M. The SFT process comprises three stages: In the first stage,
we unfreeze all the three components (*i.e.*, the Vision Encoder, the
LLM, and the Projector) and train the model in a resolution of 490. In
the second stage, the input resolution is increased to 700 and only the
Vision Encoder is trained to adapt to the resolution change. In the
third stage, we further perform full-parameter fine-tuning in the
resolution of 700. TextSquare demonstrates that with our Square-10M
dataset, a model with 8B parameters and normal-size image resolution can
achieve extraordinary performance on text-centric VQA, surpassing most
available MLLMs and even the closed-source SOTA models.
# Experiment
## Implementation Details
The training data contains Square-10M and in-domain datasets (consistent
with Monkey’s SFT data). The training process is divided into three
phases, using the same data and the AdamW [adamw](None)
optimizer with 64 A100-80G GPUs. In the first phase, we fine-tune
InternLM-Xcomposer2 with full parameters, and the learning rate
decreases from 1e-5 to 1e-6, taking about 9520 GPU hours; In the second
phase we scale up the image resolution to 700, and train only VIT, with
the learning rate decreasing from 1e-4 to 1e-5, taking about 7280 GPU
hours; In the third stage, we perform full-parameter fine-tuning at 700
image resolution, and the learning rate drops from 1e-5 to 1e-6,
spending about 12350 GPU hours.
## Benchmark Evaluation
We report the results on Scene Text-centric VQA, Document-oriented VQA,
Table VQA, Text-centric KIE, OCRBench, and General VQA for a
comprehensive comparison of the performance of our model with existing
models. The metrics of each benchmark are listed in Table
[benchmark] in the Supplementary
Material.
**Document-Oriented Benchmark.** While the documents have a clean
background, dense text and complex typography pose distinct challenges.
To effectively evaluate our model, we select representative benchmarks
including DocVQA [docvqa](None), ChartQA
[chartqa](None), and InfographicVQA
[infographicvqa](None). The results, detailed in Table
[table-text-bench], show that
TextSquare outperforms all the open-source models in these three
document-oriented VQA tasks with an average improvement of $3.5$%,
specifically, DocVQA $84.3$% *vs.* $81.6$% (Cogagent and mPLUG-DocOwl
1.5), ChartQA $79.4$% *vs.* $72.7$% (Intern-Xcomposer2), InfographicVQA
$51.5$% *vs.* $50.4$% (mPLUG-DocOwl 1.5). On the ChartQA dataset,
TextSquare outperforms GPT4V and Gemini Pro by a slight margin. Note
that TextSquare employs an image resolution of 700, which is smaller
than most document-oriented MLLMs. Our model relies on comprehensively
high-quality VQA information specific to the text in the document,
improving its ability to recognize and understand various document
elements such as text, diagrams, infographics, and so on. If the image
resolution is further increased, it is believed that the model
performance will be further improved, as demonstrated by Monkey et al.
**Scene Text-centric Benchmark.** The ability to answer text-based
questions in images becomes an important aspect of the answering task as
textual information is usually present in real-world scenes. In the
evaluation, we utilize two datasets: TextVQA [textvqa](None)
and AI2D [ai2d](None). As shown in Table
[table-text-bench], in this
scenario, although TextSquare achieves SOTA performance on the AI2D
dataset, there is no major improvement over our baseline
Intern-Xcomposer2, which may be due to the fact that Intern-Xcomposer2
has been adequately optimized with high-quality in-domain data.
**Table VQA Benchmark.** Due to the complex structure of tables and the
dense text, the understanding of the content of tables remains a
challenging issue. In order to evaluate the performance of the
comprehension of table content and structure, we choose two widely
utilized datasets, Wiki Table Questions (WTQ) [wtq](None) and
Table Fact (TabFact) [tabfact](None), as shown in Table
[table-text-bench]. On the Table
VQA benchmarks, TextSquare achieves optimal performance among the
leading models with an average $3.0$% improvement. This demonstrates
that our model has reached a new level of table understanding, where
high-quality generated table VQA and reasoning data play a key role.
**Text-centric KIE Benchmark.** Text-centric key information extraction
tasks are frequently encountered in the information processing of
various types of products, certificates, and receipts. We select a
receipt information extraction dataset (SROIE) [sroie](None)
and a product information extraction dataset (POIE)
[poie](None), and the KIE task is converted to the VQA task.
TextSquare achieves optimal performance in both datasets, with a major
average lift of $14.8$% (shown in Table
[table-text-bench]). It is worth
noting that there is no training set of POIE added to the training set
and there is not much data in the domain of product scenarios. This
illustrates the extensive textual comprehension capabilities of our
model.
**OCRBench.** OCRBench [ocrbench](None) is a comprehensive
benchmark consisting of 29 OCR-related assessments, with text
recognition, formula recognition, text-centric VQA, KIE, etc. TextSquare
achieves optimal performance in OCRBench except for the closed-source
models and becomes the first MLLM that exceeds $600$ points with about
$10$B parameters. It indicates that the model performs well in both
text-centric perception and comprehension tasks, especially in text
recognition, where little in-domain data is included in the training
set.
| | | | | | |
|:---|:--:|:--:|:--:|:--:|:--:|
| Method | General VQA and Hallucination Evaluation | | | | |
| | VizWiz | VQAv2 | GQA | POPE$^{adv}$ | Average |
| Qwen-VL [MLLM-2](http://arxiv.org/pdf/2308.12966v3) | 35.2 | 79.5 | 59.3 | \- | \- |
| Monkey [monkey](None) | 61.2 | 80.3 | 60.7 | 80.3$^*$ | 70.6 |
| Cogagent [cogagent](None) | 36.7$^*$ | **83.7** | 62.3$^*$ | 85.9 | 67.2 |
| DocOwl 1.5 [docowl-1.5](None) | 43.5$^*$ | 68.0$^*$ | 48.5$^*$ | 79.7$^*$ | 59.9 |
| Llava Next 34B [llava-next](http://arxiv.org/pdf/2404.05465v1) | 63.8 | **83.7** | **67.1** | 83.4 | 74.5 |
| GPT4V [gpt4v](http://arxiv.org/pdf/2312.04344v2) | 64.9$^*$ | 77.2 | 48.4$^*$ | 79.6$^*$ | 67.5 |
| Gemini Pro [gemini-pro](http://arxiv.org/pdf/2312.17661v1) | 42.8$^*$ | 71.2 | 52.2$^*$ | 84.5$^*$ | 62.7 |
| Xcomposer2 [internlm-xcomposer2](None) | 58.9$^*$ | 81.8 | 64.5 | 78.5 | 70.9 |
| TextSquare (ours) | **71.4** | 78.0 | 64.5 | **86.6** | **75.1** |
Quantitative comparison of our model with existing MLLMs on
representative General VQA and hallucination evaluation benchmarks.
VizWiz and POPE are relevant to both VQA and hallucination. Following
Cogagent, we evaluate the adversarial part of POPE.
**General VQA and Hallucination Evaluation Benchmark.** General VQA
requires the ability to learn both visual and textual information and a
deep understanding of their inter-relationships. For general VQA, we
validate on four benchmarks: VizWiz [vizwiz](None), VQAv2
[vqav2](None), GQA [gqa](None), and POPE
[pope](None). The VizWiz and POPE benchmarks are also relevant
for hallucination evaluation. The results are shown in Table
1. On VQAv2 and GQA, TextSquare
does not have a significant degradation compared to InternLM-Xcomposer2
and still maintains comparable performance. TextSquare exhibits superior
capabilities in VizWiz and POPE, outperforming the closest competing
method by an average of $3.6$%. These results highlight the
effectiveness of our approach, which is also able to mitigate model
hallucinations in particular with large-scale instruction tuning. We
observe that it is partly attributed to the high-quality reasoning data
that provides detailed explanations for VQA.
## Qualitative Analysis
As illustrated in Figure
4, TextSquare has a formidable
capability to provide plausible explanations of the answers to questions
in a variety of text-centric VQA scenarios. Figure
4(a) shows that TextSquare has
simple arithmetic capabilities. Figure
4(b) shows the ability to understand
textual content and provide approximate location in dense text. Figure
4(c) shows the comprehension of
table structure and the ability to extract contextual information
relevant to the question.
Qualitative results of VQA and reasoning for various
text-centric scenarios.
| Model | OCRBench | DocVQA | ChartQA | InfoVQA | WTQ | SROIE | Average |
|:---------------|:--------:|:------:|:-------:|:-------:|:----:|:-----:|:-------:|
| Xcomposer2$^*$ | 571 | 74.8 | 73.2 | 41.6 | 40.3 | 44.7 | 54.9 |
| TextSquare | 622 | 84.3 | 79.4 | 46.2 | 49.7 | 53.2 | 62.6 |
Ablation study on Incorporating Square-10M for Instruction Tuning.
## Ablation Study
**The Effect of Incorporating Square-10M for Instruction Tuning.**
In order to verify the effectiveness of Square-10M, we fine-tune the
baseline model InternLM-Xcomposer2 on the public text-centric VQA
instruction tuning dataset (consistent with Monkey’s training data). As
shown in Table, TextSquare substantially outperforms Xcomposer2$^*$
(fine-tuned) on various text-centric VQA benchmarks by $7.7$%, which
corroborates that Square-10M can fully exploit MLLM’s ability in
text-centric VQA scenarios and that a large amount of high-quality
instruction tuning data has a major improvement in performance.
**The Effect of Evaluation Step of the Square Strategy.** As shown in
Table [Tab1], there is a distinct improvement in
model performance after incorporating the evaluation of the generated
VQA data, which verifies that the evaluation step of the Square strategy
improves the quality of VQA instruction tuning data.
**The Effect of VQA Reasoning Data on Model Performance and
Hallucination Evaluation.** From Table
[Tab2], we can find that VQA Reasoning data is
helpful in both improving VQA performance and mitigating hallucinations.
Specifically, in terms of enhancing VQA performance, there is a 1.4% and
1.3% gain on DocVQA and ChartQA. In terms of mitigating hallucinations,
there is a $2.7$% and $3.2$% gain on POPE and WizViz.
The relationship between instruction tuning dataset scale,
convergence loss, and model performance in text-centric VQA scenarios.
Figure (a) and Figure (b) show the relationship between data scale and
convergence loss, distinguished by a scaling of the horizontal
coordinate of Figure (b) with log10. Figure (c) and Figure (d) show
the relationship between data scale and model performance, distinguished
by a scaling of the horizontal coordinate of figure (e) with log10.
## Relationships between Instruction Tuning Data Scale, Convergence Loss, and Model Performance
To explore the relationship between instruction tuning data scale,
convergence loss, and model performance based on the merged large-scale
Square-10M and the in-domain instruction tuning dataset, we conduct 10
sets of experiments for different data scales. The average performance
of the models is evaluated on DocVQA, ChartQA, InfoVQA, WTQ, and SROIE.
As shown in Figure 5(a)(b), the convergence loss of the
model continues to decrease as the data scale grows, whereas the rate of
decrease becomes progressively slower. The relationship between the
convergence loss and the instruction tuning data scale approximately
conforms to a logarithmic function. Similarly, from Figure
5(c)(d), it can be seen that as the
instruction tuning data grows, the model performs better and better, but
the rate of growth continues to slow down. Their relationship is also
approximately in accordance with a logarithmic function. Holistically,
there is a corresponding scaling law in the instruction tuning phase in
text-centric VQA scenarios, where model performance is proportional to
the logarithm of the scale of data. It can guide the construction of
potentially larger datasets and predict model performance.
# Limitation
Although our approach achieves remarkable results in various scenarios,
there are some limitations. Firstly, large-scale data requires plenty of
GPUs for long-time training, which greatly increases the training
consumption. Second, while the Square strategy improves the quality of
synthetic data, it still cannot reach the human level.
# Conclusion
In this paper, we present the Square strategy for constructing a
high-quality text-centric instruction tuning dataset(Square-10M).
Leveraging this dataset, TextSquare significantly surpasses recent
open-source models and even achieves performance comparable to GPT4V
across various benchmarks. Furthermore, we derive the relationship
between instruction tuning dataset scale, convergence loss, and model
performance in order to pave the way for constructing even much larger
datasets. Our approach provides a data-centric perspective that revisits
the role of instruction-tuning data in text-centric VQA, confirming that
both the quantity and quality of data are crucial to model performance.
We believe that there is a promising direction on how to further improve
the data quantity and quality for closing the gap between open-source
models and the leading ones.
# Supplementary Material
## Summary of the Evaluation Benchmarks
We summarize the evaluation benchmarks used in this paper in Table
[benchmark].
Best Practices and Lessons Learned on Synthetic Data for Language Models
2024-04-11
Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai
The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.
Show Paper Content
# Introduction
The rapid advancement of artificial intelligence (AI) technologies has
led to their widespread adoption across numerous domains, from assistant
agents (e.g., ACT-1, from Adept AI[^1]) and software development (e.g.,
Devin, from Cognition Lab[^2]) to healthcare
[singhal2022large](https://arxiv.org/abs/2212.13138) and finance
[zheng2022ai](http://arxiv.org/pdf/2106.01901v1). However, the success of AI models heavily
relies on the availability of large, diverse, and high-quality datasets
for training and evaluation. Acquiring such datasets can be a
significant challenge due to data scarcity
[babbar2019data](http://arxiv.org/pdf/2208.00147v1), privacy concerns
[abay2019privacy](http://arxiv.org/pdf/1801.01594v2), and the sheer cost of data collection
and annotation [gilardi2023chatgpt](http://arxiv.org/pdf/2303.15056v2). Pessimists predict
that we will run out of fresh text data in 2050 and image data in 2060
[villalobos2022will](https://arxiv.org/abs/2211.04325).
Synthetic data has emerged as a promising solution to address these
challenges [nikolenko2021synthetic](http://arxiv.org/pdf/1909.11512v1). Synthetic data
refers to artificially generated data that mimics the characteristics
and patterns of real-world data, but is created through algorithms
[saxton2019analysing](https://openreview.net/forum?id=H1gR5iR5FX), generative models
[borisov2022language](https://arxiv.org/abs/2210.06280), [meng2022generating](http://arxiv.org/pdf/2004.13952v2), or even
simulations [vezhnevets2023generative](https://arxiv.org/abs/2312.03664), [liu2023training](https://arxiv.org/abs/2305.16960),
rather than being directly created by humans. By leveraging synthetic
data, we can not only overcome the limitations of real-world data but
also unlock the potential to develop more robust, reliable, and fair AI
models [lucini2021real](http://arxiv.org/pdf/2208.07943v1), [lu2023machine](https://arxiv.org/abs/2302.04062).
One of the many benefits of synthetic data is that it can be generated
at scale, providing an abundant supply of training and testing data for
AI models. This is particularly valuable in domains where real-world
data is scarce or difficult to obtain (e.g., weather data covering all
conditions [li2023seeds](https://arxiv.org/abs/2306.14066), [lam2023learning](http://arxiv.org/pdf/2402.00059v1)). Second,
synthetic data can be tailored to specific requirements, such as
ensuring a balanced representation of different classes by introducing
controlled variations (e.g., up-weighting low-resource languages in
multilingual language learning [przystupa2019neural](https://doi.org/10.18653/v1/W19-5431)).
This level of control over data characteristics can improve model
performance and generalization. Third, synthetic data can help mitigate
privacy concerns by creating anonymized or de-identified datasets that
do not contain sensitive personal
information [howe2017synthetic](https://arxiv.org/abs/1710.08874), [el2020practical](http://arxiv.org/pdf/2401.06883v1). This is
crucial in domains such as healthcare, where patient privacy is of
utmost importance [dahmen2019synsys](http://arxiv.org/pdf/2304.03243v1), [wei2019generative](http://arxiv.org/pdf/1910.05827v1).
Despite its promise, synthetic data also presents challenges that need
to be addressed. One of them is ensuring the factuality and fidelity of
synthetic data [wood2021fake](https://doi.org/10.1109/ICCV48922.2021.00366), [heusel2017gans](https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html), as models
trained on false, hallucinated or biased synthetic data may fail to
generalize to real-world
scenarios [van2023synthetic](http://arxiv.org/pdf/2305.09235v2), [guarnera2020deepfake](http://arxiv.org/pdf/2004.10448v1).
Researchers must develop sophisticated generative models and evaluation
metrics to create synthetic data that accurately reflects the complex
patterns and relationships found in real-world data. Another challenge
is the potential for synthetic data to amplify biases or introduce new
biases if not carefully designed and
validated [barbierato2022methodology](http://arxiv.org/pdf/2203.04462v1), [gupta2021transitioning](https://arxiv.org/abs/2105.04144).
We believe rigorous testing and fairness assessments are necessary to
mitigate these risks.
In this paper, we track the current state of synthetic data research and
discuss current best practices and lessons learned. The rest of the
paper is organized as follows.
Section [sec:training] provides an overview of
synthetic data generation techniques and their applications in model
training, presenting case studies and empirical evidence.
Section [sec:evaluation] discusses the
usefulness of synthetic data in evaluation.
Section [sec:limitation_risks]
discusses the challenges and limitations of synthetic data, and in
Section [sec:future] we outline potential
solutions and future research directions.
[^1]: ACT-1:
[^2]: Devin:
# Synthetic Data in Training [sec:training]
Synthetic data, which is generated by mimicking authentic data collected
from the real world, has proven to be an effective and relatively
low-cost alternative of real data. This section explores several notable
domains that leverage synthetic training data.
# Synthetic Data in Evaluation [sec:evaluation]
Synthetic data is widely used in evaluations of different perspectives:
#### Factuality.
AI systems may generate information or responses that are not grounded
in factual knowledge or data, leading to the creation of misleading or
false content, formally known as
*hallucination* [ji2023survey](http://arxiv.org/pdf/2311.05232v1). Factuality evaluation
aims to ensure the consistency of the knowledge in the AI system’s
output with the knowledge provided by its training data and knowledge
base [ji2023survey](http://arxiv.org/pdf/2311.05232v1), [zhang2023siren](https://arxiv.org/abs/2309.01219). Early
statistical-based hallucination evaluation methods relied on n-grams to
directly calculate the overlap of vocabulary between the input and
output content [dhingra2019handling](https://doi.org/10.18653/v1/P19-1483), [wang2020towards](https://doi.org/10.18653/v1/2020.acl-main.101).
However, these methods have limitations, as they only consider lexical
overlap and do not account for semantics or sentence meaning
[ji2023survey](http://arxiv.org/pdf/2311.05232v1), making them unsuitable for evaluating
more complex forms of hallucination. Subsequent assurance methods
shifted from statistical approaches to model-based methods, which are
more robust compared to token-difference-based methods
[honovich2021q2](https://doi.org/10.18653/v1/2021.emnlp-main.619). While these model-based evaluation
methods are more advanced than their predecessors, they still have
limitations. For example, the models can only output the degree of
hallucination and may struggle to pinpoint specific errors
[falke2019ranking](https://doi.org/10.18653/v1/P19-1213).
[feng-etal-2023-factkb](https://doi.org/10.18653/v1/2023.emnlp-main.59) propose to combine LLMs
generation with random walks on knowledge graphs to generate synthetic
evaluation data for factuality, which is aware of entities and relations
on the graphs. [Wei2024LongformFI](https://api.semanticscholar.org/CorpusID:268724304) created a synthetic
dataset called LongFact for long-form factuality evaluation and used
Google Search as the grounding source and LLM for the automated
judgement, to achieve human-level accuracy but with significally lower
cost [min2023factscore](http://arxiv.org/pdf/2402.05629v3).
#### Safety.
Red teaming is a powerful technique for evaluating the safety and
robustness of AI
models [ganguli2022red](https://arxiv.org/abs/2209.07858), [casper2023explore](https://arxiv.org/abs/2306.09442). By generating
diverse and realistic scenarios designed to elicit unaligned or harmful
outputs [casper2023red](http://arxiv.org/pdf/2302.10894v3), red teaming can expose
vulnerabilities and weaknesses in AI
systems [perez2022red](https://aclanthology.org/2022.emnlp-main.225). For example,
[perez2022discovering](http://arxiv.org/pdf/2211.04476v2) use LMs to generate datasets for
evaluating the behavior of other LMs. They end up producing 154
high-quality datasets which are verified by humans, and discover new
cases of inverse scaling where LMs get worse with size.
[hubinger2024sleeper](https://arxiv.org/abs/2401.05566) leverage synthetic data to trigger
backdoor attacks to LMs at scale; they find LMs can exhibit deceptive
behavior and create a false impression of safety under such attacks, and
standard “safety training” could not remove such deception easily. These
methods demonstrate the feasibility of using AI assistance to scale up
human oversight [bowman2022measuring](https://arxiv.org/abs/2211.03540) over complex
problems and unseen domains.
#### Assisting human evaluation.
Recent studies have shown that in many cases, synthetic judgements from
large-scale LMs (LLMs) can serve as qualified, fast, and low-cost
alternatives to actual human
evaluation [doi:10.1073/pnas.2305016120](https://doi.org/10.1073/pnas.2305016120). Using GPT-4 as
the judge, Alpaca Eval [alpaca_eval](https://github.com/tatsu-lab/alpaca_eval) and MT
Bench [zheng2023judging](https://arxiv.org/pdf/2306.05685) are two popular benchmarks that
measure the comprehensive abilities of LM-based ChatBot. In coding
tasks, synthetic environment is a common choice to aid human evaluation,
as humans can make the assessment more efficiently via actual executions
and analysis on running logs. [gu2024cruxeval](https://arxiv.org/abs/2401.03065) propose
CRUXEval, a code execution reasoning benchmark consisting of 800 Python
functions generated by CodeLLaMA-34B. Similarly,
[liu2024codemind](https://arxiv.org/abs/2402.09664) introduce CodeMind, a framework to
gauge the code reasoning abilities of LLMs on Independent Execution
Reasoning (IER), Dependent Execution Reasoning (DER), and Specification
Reasoning (SR). All these evaluations based on synthetic data show
strong correlation with real human judgements.
# Challenges and Limitations of Synthetic Data [sec:limitation_risks]
While synthetic data offers numerous benefits and applications, it is
crucial to acknowledge and address the potential challenges and
limitations associated with its use. This section delves into three
significant concerns surrounding synthetic data:
#### Misuse of synthetic data might proliferate misinformation.
The potential misuse of synthetic data is a significant concern that
must be addressed to ensure the responsible development of AI systems.
Current AI models become increasingly capable of generating human-like
data ranging from text [reid2024gemini](https://arxiv.org/abs/2403.05530), [team2023gemini](https://arxiv.org/abs/2312.11805),
images [saharia2022photorealistic](http://arxiv.org/pdf/2205.11487v1), [ramesh2022hierarchical](https://arxiv.org/abs/2204.06125),
songs [^1], to even videos (e.g., OpenAI SORA [^2]). This can be
particularly dangerous when synthetic data is used to impersonate real
people, manipulate public opinion, or influence political processes.
Moreover, the dissemination of synthetic data-driven misinformation can
erode trust in legitimate information sources, making it increasingly
difficult for people to distinguish between truth and
falsehood [byman2023deepfakes](http://arxiv.org/pdf/2209.09111v1), [rid2020active](http://arxiv.org/pdf/2005.13466v2). To
mitigate these risks, it is crucial for researchers, developers, and
policymakers to establish clear guidelines and best practices for the
ethical generation and use of synthetic data, including robust
mechanisms for detecting and countering synthetic
misinformation [groh2022deepfake](http://arxiv.org/pdf/2105.06496v2). By proactively
addressing these challenges, we can harness the benefits of synthetic
data while minimizing its potential for harm.
#### Synthetic data might cause ambiguity in AI alignment.
The increasing use of synthetic data in aligning AI models (e.g.,
Constitutional AI [bai2022constitutional](https://arxiv.org/abs/2212.08073)) can introduce
significant ambiguity and uncertainty. The goal of AI alignment is to
ensure that AI systems behave in ways that are aligned with human values
and intentions. However, synthetic data, which is artificially generated
rather than collected from real-world sources, may not accurately
represent the nuances and complexities of human values and
preferences [zhou2024real](https://arxiv.org/abs/2403.05020). This discrepancy can lead to
AI models learning from data that is
biased [feng2023pretraining](https://arxiv.org/abs/2305.08283), [liu2021mitigating](https://ojs.aaai.org/index.php/AAAI/article/view/17744),
ungrounded [liu2022mind](https://arxiv.org/abs/2210.05359), [patel2021mapping](https://openreview.net/forum?id=gJcEM8sxHK), or
misrepresentative of real-world
scenarios [weidinger2021ethical](https://arxiv.org/abs/2112.04359), [ji2023survey](http://arxiv.org/pdf/2311.05232v1). As a
result, AI systems trained on synthetic data may exhibit behaviors that
are misaligned with human expectations, potentially leading to
unintended consequences or even harmful
actions [zou2023universal](https://arxiv.org/abs/2307.15043), [anderljung2023frontier](https://arxiv.org/abs/2307.03718).
Moreover, the ambiguity introduced by synthetic data can make it
challenging to interpret and understand the decision-making processes of
AI models [lightman2023let](https://arxiv.org/abs/2305.20050), further complicating the
task of ensuring alignment. To mitigate these risks, it is crucial for
researchers to carefully consider the limitations and potential
drawbacks of using synthetic data in alignment research and to develop
robust methods for validating and testing AI models trained on such
data.
#### Training with synthetic data makes evaluation decontamination harder.
The use of synthetic data in model training poses significant challenges
to fair evaluation. Evaluation benchmarks are often created by referring
to public text sources, such as coursework websites or forums.
Consequently, it is arguable that all publicly available benchmark test
cases might occasionally be included in the pre-training data of
LLMs [hoffmann2022empirical](http://arxiv.org/pdf/2309.08777v2), [gao2020pile](https://arxiv.org/abs/2101.00027). The use of
synthetic data exacerbates this issue rather than mitigating it.
Although the community has proposed several techniques to detect such
evaluation contamination, such as *min-$k$%
prob* [shi2023detecting](https://arxiv.org/pdf/2310.16789), which checks the probabilities
of $k$ long-tail tokens, these token-level decontamination methods are
inadequate when the model is trained with synthetic data. Synthetic data
might include rephrased versions of the benchmark
data [oren2023proving](https://arxiv.org/abs/2310.17623), [mattern2023membership](https://arxiv.org/abs/2305.18462), rendering
token-level decontamination ineffective. In addition to developing more
advanced evaluation contamination detection techniques, we recommend
that model developers invest in creating and maintaining in-house and
protected evaluation benchmarks. These proprietary benchmarks should be
carefully safeguarded to prevent leakage and ensure the integrity of the
evaluation process.
[^1]: Make songs with Suno AI:
[^2]: OpenAI Sora:
# Directions for Future Work [sec:future]
As the field of synthetic data continues to evolve, there are several
promising directions for future research and development. This section
outlines three key areas that warrant further exploration:
#### Synthetic data scaling.
The impressive performance of many over-trained small language models
(e.g., Mistral series models [jiang2023mistral](https://arxiv.org/abs/2310.06825), and
Gemma series models [team2024gemma](https://arxiv.org/abs/2403.08295), *inter alia*)
demonstrates the necessity of training with large amount of tokens (even
passing the compute-optimal chinchilla
law [rae2021scaling](https://arxiv.org/abs/2112.11446)). However, whether we have similar
conclusions on the training with synthetic data is still an open
question, as the quality of synthetic data may not be as consistent as
real-world data [yu2024large](http://arxiv.org/pdf/2306.15895v2). Future research should
investigate the scaling laws for synthetic data and determine the
optimal balance between the quantity and quality of synthetic samples.
This exploration could help us understand the most effective strategies
for leveraging synthetic data in training large-scale language models,
potentially leading to more efficient and cost-effective
approaches [muennighoff2024scaling](http://arxiv.org/pdf/2202.03371v1).
#### Further improving quality and diversity of synthetic data.
While existing methods for generating synthetic data have shown promise,
there is still room for improvement in terms of creating high-quality,
attributed synthetic samples that closely mimic real-world data. Future
research should focus on developing new advanced techniques (or based on
existing ones such as Generative Adversarial Networks
(GANs) [goodfellow2020generative](http://arxiv.org/pdf/1810.12576v1) or Diffusion
Models [ho2020denoising](https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html), *inter alia*) that can control
and manipulate specific attributes of the generated data, enabling the
creation of diverse and customizable synthetic datasets. Additionally,
researchers should explore methods that can incorporate domain-specific
knowledge to ensure the generated data adheres to the underlying
constraints and patterns present in the target domain (e.g., via
Retrieval Augmented Generation
(RAG) [lewis2020retrieval](https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html), [borgeaud2022improving](https://proceedings.mlr.press/v162/borgeaud22a.html)) while
maintaining the data quality. By advancing the state-of-the-art in
attributed synthetic data generation, we can unlock new opportunities
for privacy-preserving analysis [assefa2020generating](http://arxiv.org/pdf/2111.12984v1),
and model training across various fields, from healthcare (e.g.,
synthetic medical
images [frid2018synthetic](http://arxiv.org/pdf/1803.01229v1), [wei2019generative](http://arxiv.org/pdf/1910.05827v1)) and
finance (e.g., simulated trading
trajectories [zheng2022ai](http://arxiv.org/pdf/2106.01901v1)) to social
sciences [argyle2023out](http://arxiv.org/pdf/2209.06899v1), [park2023generative](http://arxiv.org/pdf/2208.04024v1) and beyond.
#### Towards high-fidelity and more efficient scalable oversight.
As AI models become increasingly complex and autonomous, it becomes
challenging to monitor and assess their behavior using traditional
oversight methods that rely on human supervision or real-world
data [amodei2016concrete](https://arxiv.org/abs/1606.06565). Future research should
explore the use of synthetic data for high-fidelity scalable oversight
of these advanced systems. Existing methods typically simulate a certain
scenario in social iterations, such as
debate [leike2018scalable](https://arxiv.org/abs/1811.07871),
reflection [zhang2023exploring](https://arxiv.org/abs/2310.02124), or
revisions [liu2023training](https://arxiv.org/abs/2305.16960) to obtain synthetic data,
while new approaches could cover more comprehensive scenarios and more
modalities [sun2023aligning](https://arxiv.org/abs/2309.14525), as recent studies have
found many issues of simulation that only covers a narrowed
down [cheng-etal-2023-compost](https://doi.org/10.18653/v1/2023.emnlp-main.669) or
over-simplified [zhou2024real](https://arxiv.org/abs/2403.05020) scenes. Looking forward,
another growing direction could be how to achieve scalable oversight
more efficiently—given that we have the full control over the synthetic
data generation, we can probably provide more targeted oversights with
less synthetic data. As the need for effective AI governance and
regulation grows, synthetic data will play an increasingly vital role in
enabling more trustworthy scalable oversight mechanisms that promote
robust, accountable, and safe deployment of AI technologies for the
benefit of
society [askell2021general](https://arxiv.org/abs/2112.00861), [bowman2022measuring](https://arxiv.org/abs/2211.03540).
#### The emergent self-improvement capability.
We typically choose the most capable model to generate synthetic data,
as its generation is of higher quality. However, an intriguing question
arises: can a model generate synthetic data that is better than the data
it was trained on, thus enabling it to improve itself? This concept of
self-improvement through synthetic data generation is an exciting avenue
for future research. If a model can generate higher-quality data than
its original training set, it could potentially bootstrap its own
performance by iteratively learning from the enhanced synthetic
data [chen2024selfplay](https://arxiv.org/pdf/2401.01335). This self-improvement
capability could lead to the emergence of more advanced AI systems that
can autonomously refine their skills and knowledge over
time [burns2023weak](https://arxiv.org/abs/2312.09390), [huang-etal-2023-large](https://doi.org/10.18653/v1/2023.emnlp-main.67). Although
recent work shows encouraging progress in this
direction [chen2024selfplay](https://arxiv.org/pdf/2401.01335), [yuan2024self](https://arxiv.org/abs/2401.10020), the upper
bound of self-improvement and the underlying reasons for its
effectiveness remain open questions. Future research should investigate
the theoretical underpinnings and practical feasibility of
self-improvement through synthetic data generation in more diverse
scenarios, examining the necessary conditions, potential limitations,
and associated risks. By unlocking the potential of emergent
self-improvement capabilities, we could enable more adaptable,
efficient, and autonomous learning
processes [lecun2022path](http://arxiv.org/pdf/1409.8027v2).
# Conclusion
Synthetic data has emerged as a promising solution to address the
challenges of data scarcity, privacy concerns, and high costs in AI
development. By generating realistic and diverse datasets, synthetic
data enables the training and evaluation of AI models at scale across
various domains. As we approach human-level or even superhuman-level
intelligence, obtaining synthetic data becomes even more crucial, given
that models need better-than-average-human quality data to progress.
However, ensuring the factuality, fidelity, and lack of bias in
synthetic data remains a critical challenge.
Future research directions on synthetic data could focus on improving
the fidelity and controllability of generative models and developing
standardized evaluation and contamination protocols and tools. We could
also explore the integration of synthetic data with other techniques and
its application in other domains. Despite the challenges, the potential
benefits of synthetic data in advancing AI research are significant. By
leveraging synthetic data responsibly and effectively, we can build more
powerful, inclusive, and trustworthy AI systems that benefit society as
a whole.
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
2024-03-14
Hugo Laurençon, Léo Tronchon, Victor Sanh
Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
Show Paper Content
Hugging Face
0.5cm0.5cm Using vision-language models (VLMs) in web development
presents a promising strategy to increase efficiency and unblock no-code
solutions: by providing a screenshot or a sketch of a UI, a VLM could
generate the code to reproduce it, for instance in a language like HTML.
Despite the advancements in VLMs for various tasks, the specific
challenge of converting a screenshot into a corresponding HTML has been
minimally explored. We posit that this is mainly due to the absence of a
suitable, high-quality dataset. This work introduces WebSight, a
synthetic dataset consisting of 2 million pairs of HTML codes and their
corresponding screenshots. We fine-tune a foundational VLM on our
dataset and show proficiency in converting webpage screenshots to
functional HTML code. To accelerate the research in this area, we
open-source WebSight.
Comparison of an original web page (input) on the left, and
the rendering of the code generated by our model - Sightseer - (output)
on the right. To match the visual aspect of the original webpage, the
model picked a suitable image background from www.unsplash.com
# Introduction
Current advancements in vision-language models (VLMs) have significantly
improved their capabilities, enabling them to master a variety of tasks
including image captioning, question answering, and optical character
recognition (OCR)
[openai2023gpt4](https://arxiv.org/pdf/2303.08774), [geminiteam2023gemini](https://arxiv.org/pdf/2312.11805), [hong2023cogagent](https://arxiv.org/pdf/2312.08914), [liu2024llavanext](https://llava-vl.github.io/blog/2024-01-30-llava-next/).
Despite these achievements, the task of converting screenshots of
websites or web components into usable HTML code—a process highly
valuable to web developers—remains relatively unexplored, particularly
in the open-source community. The development and open-source release of
a model capable of such a conversion could unlock new AI-powered tools
for UI developers, facilitating the creation of no-code modules and
plugins for design tools like Figma. For instance, the ability to
rapidly transform a design sketch into a functional UI component and
code could significantly increase the iteration pace for UI developers.
We posit that the primary challenge for VLMs to achieve proficiency in
this specific task does not stem from the inherent difficulty of the
task itself. Rather, it is the lack of a large, high-quality, dataset of
pairs of HTML codes and their associated screenshots that poses the
primary obstacle. In fact, VLMs are commonly trained on web-scale
datasets of image-text pairs
[schuhmann2022laion5b](https://arxiv.org/pdf/2210.08402), [gadre2023datacomp](https://arxiv.org/pdf/2304.14108) or multimodal
web documents [laurencon2023obelics](https://openreview.net/forum?id=SKN2hflBIZ), [zhu2023multimodal](https://openreview.net/forum?id=tOd8rSjcWz).
Having such a dataset of screenshots-HTML pairs as an open and
accessible artifact would significantly accelerate research in this area
by enabling the community to inspect the data, its limitations and
improve upon the dataset. Consequently, our initial focus is on
developing a dataset useful for the fine-tuning of VLMs for this task.
To accomplish this, several strategies can be considered:
1. *Leveraging existing webpages and their HTML codes.* The vast
repository of HTML files available online (and often captured in web
crawls like Common Crawl) presents a tempting resource for
generating pairs of screenshots and corresponding HTML codes by
simply rendering the HTML and capturing the output. However, this
approach poses significant challenges. HTML files found on the web
are often laden with noise like comments, scripts or data, and can
be excessively lengthy, encompassing a very large number tokens,
sometimes even exceeding the maximum sequence length of most current
models. This complexity hinders a model’s ability to accurately
learn the correlation between the contents of a screenshot and the
underlying HTML syntax. Additionally, HTML codes frequently
incorporate references to external JavaScript (JS) or Cascading
Style Sheets (CSS) scripts, or rely on files located in separate
directories. This dependency complexifies the creation of a
self-contained HTML file that faithfully reproduces the intended
design in a screenshot. Given these obstacles, we opted to forego
this method in favor of a more controlled approach.
2. *Synthesizing HTML codes with Large Language Models (LLMs).* The
most recentlarge language models, especially those trained
extensively on programming languages, show remarkable proficiency in
generating high-quality code applicable to various domains,
including website development. This capability opens the door to
artificially create a vast corpus of HTML codes using a LLM
specialized in coding which has been further fine-tuned to follow
instructions. By adapting the prompts, we can introduce specific
constraints to the code generation process, such as controlling the
topic, the text length or the image placement in the websites. This
level of control not only ensures the production of relevant HTML
code but also makes them more suitable for VLMs by providing the
models with cleaner, more concise, and structured data that models
can be effectively trained on. Our study adopts this approach.
In response to the identified gap, we develop WebSight, a comprehensive
synthetic dataset comprising 2 million examples of HTML code paired with
corresponding screenshots. Leveraging this dataset, we proceed to
fine-tune our forthcoming foundational VLM of 8 billion parameters,
notably enhanced by robust OCR capabilities, to obtain the specialized
model Sightseer. This fine-tuning process yields promising outcomes,
demonstrating the model’s proficiency in converting webpage screenshots
into functional HTML code. Remarkably, the model also exhibits the
versatility to adapt to untrained scenarios, such as transforming
handwritten sketches into functional HTML code. To accelerate
advancements in this direction, we open source WebSight.
# Related work
[Nguyen2015ReverseEM](https://api.semanticscholar.org/CorpusID:7499368) uses a classical pipeline of
interface elements recognition (images, texts, containers, etc.) with
computer vision and optical character, followed by heuristics to
generate code on these detections. The authors show the effectiveness of
this approach on mobile UIs. [beltramelli2017pix2code](http://arxiv.org/pdf/1705.07962v2)
introduces an end-to-end method for generating computer code from
graphical user interface (GUI) screenshots using deep learning. The
model, trained end-to-end, can generate code for different platforms
(iOS, Android, and web) from a single input image. It uses convolutional
and recurrent neural networks to interpret GUI screenshots and generate
corresponding code. In [pix2struct](https://proceedings.mlr.press/v202/lee23g.html), authors pre-train a
VLM to convert masked screenshots of web pages into simplified HTML, and
show the effectiveness of this training objective to pretrain
foundational VLM that transfers well to a variety of downstream tasks.
Similar to Sightseer, their model accepts images of varying resolutions
as input.
In our recent beta release of WebSight-v0.1, we provided a dataset with
823K synthetic pairs of screenshots and associated HTML + traditional
CSS code. In the current version of WebSight discussed in this paper
(v0.2), we introduce significant improvements. First, WebSight-v0.2
replaces the colored rectangles used as image placeholders in
WebSight-v0.1 with real images that match the website’s content.
Additionally, we adopt Tailwind CSS to streamline the code and
facilitate the creation of visually appealing designs. Other notable
upgrades include 2.5x the dataset size, offering higher resolution
screenshots, and providing richer metadata.
WebSight-v0.1 has already proven to be a helpful resource. In
Design2Code [si2024design2code](https://arxiv.org/pdf/2403.03163), the authors create a
benchmark for evaluating VLMs at generating HTML code given a
screenshot. They also fine-tune an 18B-parameter VLM on WebSight-v0.1,
after observing that models trained on synthetic examples outperform
those trained on longer, more complex real-world code data.
# Construction of the dataset
Examples of synthetic web pages present in
WebSight.
#### Overview of the strategy
Our synthetic HTML code generation process involves two key steps for
maximizing diversity and quality. First, we employ a smaller language
model to generate a variety of website themes and designs. These
creative outputs serve as the foundation for the next stage, where they
are fed into the prompts of a larger language model, mostly trained on
code data. This LLM then generates the final HTML code, ensuring that
our dataset encompasses a wide range of styles while generating
high-quality codes.
#### Generating diverse website concepts
We employ Mistral-7B-Instruct [jiang2023mistral](https://arxiv.org/pdf/2310.06825) to
generate several million unique website concepts and designs with the
prompt:
Here are 5 examples of generated concepts:
#### Opting for Tailwind CSS over traditional CSS
Generating visually diverse and appealing designs requires more than
just pure HTML. However, to simplify the learning process of VLMs,
employing standalone code is preferable to managing separate files. In
this context, Tailwind CSS emerges as an ideal solution. This
utility-first framework allows creating unique designs by providing a
wide array of utility classes, enables direct styling within the HTML
document, and eliminates the need for external style files. Tailwind CSS
offers an extensive array of predefined classes that mirror various CSS
properties. By integrating these utility classes into HTML elements, we
can efficiently style web pages, resulting in concise code that is
easier for VLMs to learn from.
#### Using a code specialized LLM to generate the HTML codes
To generate the final HTML codes, we leverage
Deepseek-Coder-33b-instruct [guo2024deepseekcoder](https://arxiv.org/pdf/2401.14196), a
state-of-the-art language model mostly trained on code data and
fine-tuned to follow instruction. We use the prompt:
An initial challenge was the text-only nature of our outputs,
contrasting with the real websites containing many images. The task of
integrating images into an HTML code seems hard, especially when trying
to look for images related to the context of the web page. However, we
discovered an effective solution through photo stocks like
`https://source.unsplash.com/`, which offers the capability to
dynamically generate images based on keywords, thus providing images of
any size and relevant to any specified topics.
After a filtering step in which we discard web pages with insufficient
text, generic content or images not aligning with the website’s topic,
we finally ended up with 2 million web pages.
#### Screenshot capture process
We use Playwright[^1] to visualize and capture the output of our
generated HTML codes. We ensure that screenshots encompass the entire
web page, regardless of its length. As a result, our dataset features
screenshots in a wide range of resolutions. This diversity in image size
and format is useful for enhancing the robustness of our model.
#### Visualization of WebSight examples
Five examples present in WebSight are shown in Figure
2.
# Fine-tuning a foundation vision-language model on WebSight
#### Model prerequisites for webpage conversion
For a model to accurately convert webpage screenshots into HTML code, it
necessitates several capabilities. These include advanced OCR to
transcribe text from images, spatial understanding to arrange elements
on the page, and object recognition abilities to replicate images
similar to those in the input with the strategy explained above.
We use our forthcoming foundation VLM as the base model. It is built
upon Mistral-7B [jiang2023mistral](https://arxiv.org/pdf/2310.06825) and SigLIP-SO400M
[zhai2023sigmoid](https://arxiv.org/pdf/2303.15343), and is using the Patch n’ Pack
strategy [dehghani2023patch](https://arxiv.org/pdf/2307.06304) to preserve the original
aspect ratio of the input images, with a resolution of up to 980 pixels
for each side.
This base model was trained mostly on OBELICS
[laurencon2023obelics](https://openreview.net/forum?id=SKN2hflBIZ), synthetic captions of image/text
pairs datasets, and a combination of OCR datasets
[biten2022ocridl](https://arxiv.org/pdf/2202.12985).
Further insights into the model’s architecture and its training process
will be detailed upon its release.
#### Fine-tuning on WebSight
For the fine-tuning, instead of unfreezing all the weights, which
requires lowering significantly the learning rate for a stable training,
we use the parameter efficient DoRA method [liu2024dora](https://arxiv.org/pdf/2402.09353)
with a rank 64. We use the same learning rate that was chosen during the
pre-training, $10^{-4}$, while seeing 2016 examples per iteration, for a
total of 1100 iterations, representing a bit less than one epoch.
We find that the validation loss is not a good indicator of the trained
model and in particular the quality of generated codes in real-world
cases. Consequently, we perform checkpoint selection by manually
inspecting generated samples rather than relying on the validation loss.
Despite the validation loss continuing to decrease significantly over
several epochs, it did not translate into an increased ability to
generalize to websites that differ from those in the training dataset.
# Qualitative evaluation
## Results on different screenshots
Comparison of an original web page (input) on the left, and
the rendering of the code generated by our model, Sightseer, (output) on
the right.
Figure 3 showcases various outputs
from Sightseer when fed with simple website designs. Notably, in
instances where the input contains a limited amount of text, this text
tends to be accurately preserved in the output.
Remarkably, Sightseer sometimes exhibits the capability to generalize
beyond its training dataset to websites that differ significantly in
appearance, as evidenced by its conversion of a handwritten website
sketch into functional HTML code.
## Failure cases
Examples where Sightseer-generated codes fall short in
replicating the intended design.
In our analysis, Sightseer struggles with complex website layouts,
excessive text, or designs significantly divergent from its training
data.
In some instances, generated code includes elements such as images,
text, or buttons that fail to appear upon rendering. This can result
from issues like text colored identically to its background or incorrect
syntax use, suggesting that Sightseer has not fully mastered the HTML +
Tailwind CSS syntax.
While the model produces visually more attractive websites, it sometimes
produces errors not observed in our initial model[^2] trained on
WebSight-v0.1, which used traditional CSS instead of Tailwind CSS. As a
more recent framework than traditional CSS, Tailwind CSS has less
frequent occurrence in the pre-training data of the base LLM, and we
hypothesize that the LLM has bigger challenges in fully mastering its
syntax. We posit that starting with a foundational VLM pre-trained with
text-only HTML + Tailwind CSS in the mixture of data could significantly
enhance Sightseer’s translation accuracy, and we are exploring related
strategies to achieve this improvement.
# Conclusion
In this work, we introduce WebSight, a large synthetic dataset of 2
million pairs of HTML codes and corresponding renderings, and Sightseer,
a vision and language model with OCR ability fine-tuned on WebSight, as
contributions towards automating the conversion of webpage screenshots
to HTML code. By leveraging synthetic data generation and fine-tuning a
high-capacity base VLM on the dataset, we demonstrate a viable path to
accelerate UI development tasks and enhance no-code solutions with
increasingly more powerful AI-powered tools. By open-sourcing WebSight,
we aim to foster further innovation and research in this area.
[^1]:
[^2]:
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
2023-10-03
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
Show Paper Content
# Introduction [sec:introduction]
This complexity is observed not only in textual scenarios but also
significantly in visual contexts. For instance, when assessing a child’s
mathematical and reasoning capabilities, problems are often designed to
encompass visual contexts in addition to arithmetic calculations
[stipek1989developmental](http://arxiv.org/pdf/2002.02184v2), [pollitt2020assessing](http://arxiv.org/pdf/1506.02794v1). At the
same time, AI agents with strong mathematical reasoning capabilities in
visual contexts have a wide range of real-world applications, such as
solving complex problems in educational
disciplines [seo2015solving](http://arxiv.org/pdf/2308.02823v1), [wang2017deep](http://arxiv.org/pdf/2302.03145v1), helping
analysts with logical queries about statistical
data [wu2023bloomberggpt](None), [yang2023fingpt](None), and assisting
in theorem proving and scientific discovery in advanced research
fields [taylor2022galactica](http://arxiv.org/pdf/2405.11265v1), [dong2023large](None), [trinh2024solving](http://arxiv.org/pdf/2404.06405v2).
Accuracies of on our proposed across mathematical reasoning
and visual context types. PoT GPT-4 is a textual, program-aided LLM
augmented with the Bard caption and OCR text.
Numerous datasets have been curated to assess the mathematical reasoning
abilities of AI systems, with most presented purely in text form. Some
datasets such as ChartQA
[lu2021inter](None), [dahlgren2022clevr](None), [masry2022chartqa](None) have
explored mathematical reasoning in vision-language settings. However,
these datasets tend to either focus on specific tasks, like math word
problems, or particular visual contexts, such as geometry problems or
bar charts. General-purpose visual question answering (VQA) datasets on
natural scenes contain only a small portion of questions necessitating
mathematical reasoning, leaving a comprehensive investigation of
vision-language reasoning within a mathematical framework largely
unexplored.
Large Language Models (LLMs)
[openai2022chatgpt](https://openai.com/blog/chatgpt), [openai2023gpt4](None) and Large Multimodal
Models (LMMs)
[google2023bard](https://bard.google.com/), [openai2023gpt4v](https://openai.com/research/gpt-4v-system-card), [team2023gemini](http://arxiv.org/pdf/2405.12107v1) have
exhibited impressive problem-solving skills in many tasks and domains.
Recently, some studies have aimed to augment existing LLMs with
mathematical and scientific reasoning capabilities using external
tools [lu2023chameleon](http://arxiv.org/pdf/2305.00061v1), [wang2023scibench](None). However, the
ability of these foundation models to perform mathematical reasoning in
visual contexts has not been systematically examined.
In this paper, we present , a consolidated **Math**ematical
reasoning benchmark in **Vis**ual contexts. We propose a task
taxonomy to guide the development of : (1) we identify seven
mathematical reasoning types: *algebraic reasoning*, *arithmetic
reasoning*, *geometry reasoning*, *logical reasoning*, *numeric common
sense*, *scientific reasoning*, and *statistical reasoning*; (2) we
focus on five primary tasks: *figure question answering* (FQA),
*geometry problem solving* (GPS), *math word problem* (MWP), *textbook
question answering* (TQA), and *visual question answering* (VQA); and
(3) we encompass a diverse array of visual contexts, including natural
images, geometry diagrams, abstract scenes, synthetic scenes, as well as
various figures, charts, and plots. incorporates 28 existing multimodal
datasets, including 9 math-targeted question answering (MathQA) datasets
and 19 VQA datasets. In addition, we have created three new datasets
(*i.e.*, IQTest, FunctionQA, PaperQA) which are tailored to evaluating
logical reasoning on puzzle test figures, algebraic reasoning over
functional plots, and scientific reasoning with academic paper figures,
respectively. Overall, consists of 6,141 examples, with 736 of them
being newly curated
(Table [tab:statistics]). To facilitate
fine-grained evaluation, examples are annotated with metadata, including
question type, answer type, task category, grade level, visual context,
and required reasoning skills. Detailed descriptions of data collection
can be found in §[sec:dataset],
§[sec:collection_guideline],
and §[app:collection_details].
We conduct extensive experiments on to evaluate the reasoning abilities
of 12 foundation models known for their leading performance in
mathematical and multimodal reasoning. This ensemble includes three LLMs
(*i.e*, ChatGPT, GPT-4, Claude-2), two proprietary LMMs (*i.e.*, GPT-4V,
Bard), and seven open-source LMMs. For LLMs, we examine zero-shot and
few-shot settings using two prompting strategies: chain-of-thought (CoT)
[wei2022chain](http://arxiv.org/pdf/2201.11903v6) and program-of-thought (PoT)
[chen2022program](http://arxiv.org/pdf/2211.12588v4). These LLMs can also be augmented with
off-the-shelf visual models for image captioning and OCR. We establish a
human performance baseline by engaging qualified human annotators with a
high school diploma or higher. We show that , featuring advanced topics
such as college curricula and scientific reasoning, is a very
challenging benchmark, with human performance reaching only 60.3%
accuracy.
Our results indicate that CoT GPT-4, the best-performing LLM without
visual tool augmentations, achieves an overall accuracy of 29.2%.
Multimodal Bard, the best-performing LMM, achieves 34.8%
(§[sec:results]), which attains only 58%
of human performance (34.8% vs 60.3%). When augmented with Bard captions
and OCR text, PoT GPT-4 obtains 33.9%, closely matching Multimodal Bard
(§[sec:fine_grained_results]).
Further analysis indicates that the Multimodal Bard model failures arise
from incorrect calculations and hallucinations caused by visual
perception and textual reasoning
(§[sec:qualitative_analysis]).
Examples of our newly annotated datasets: IQTest,
FunctionQA, and PaperQA.
# The Dataset [sec:dataset]
## Collection Guidelines
As discussed previously, there is a notable gap in existing benchmarks,
which primarily evaluate mathematical reasoning in textual contexts,
overlooking the intrinsic visual nature of many mathematical problems.
Our dataset, , is therefore motivated to bridge this gap, offering a
robust evaluation benchmark for mathematical reasoning intertwined with
visual understanding, thus pushing AI assistants towards general-purpose
capabilities. Our benchmark adheres to the following collection
guidelines: (1) it covers multiple tasks and topics to mirror real-world
applications; (2) it incorporates diverse visual contexts and
mathematical skills to foster a well-rounded evaluation; (3) it offers
varying levels of challenge to effectively probe and uncover the
potential limitations of current models; and (4) it provides robust
evaluation settings for deterministic evaluations.
The taxonomy for this work is introduced as follows: We identify seven
types of mathematical reasoning: *algebraic reasoning*, *arithmetic
reasoning*, *geometry reasoning*, *logical reasoning*, *numeric common
sense*, *scientific reasoning*, and *statistical reasoning*, with
detailed definitions provided in
§[sec:math_reasoning] and
examples shown in
§[app:math_examples]. We focus on
five primary tasks: *figure question answering* (FQA), which centers
around statistical reasoning over multiple charts and plots; *geometry
problem solving* (GPS), which deals with geometrical topics; *math word
problem* (MWP), which involves arithmetic reasoning in everyday
scenarios; *textbook question answering* (TQA), which usually entails
knowledge-intensive reasoning on scientific topics and figures; and
*visual question answering* (VQA). Furthermore, our objective is to
account for a diverse array of visual contexts, including natural
images, geometry diagrams, abstract scenes, synthetic scenes, multiple
charts and plots, scientific figures, tables, function plots, puzzle
test figures, and more, with examples shown in
§[app:visual_context].
## Data Collection [sec:data_collection]
#### Collection of MathQA datasets.
We collected nine MathQA datasets in multimodal settings, including four
for GPS, two for MWP with visual contexts of synthetic scenes, abstract
diagrams, and tables, and two for TQA on college curricula (see
§[sec:source_data]). Annotations
such as solutions, programs, parsing results, and grounded theorems are
also collected, providing demonstration examples for LLMs. Each source
dataset is limited to up to 400 examples to ensure a balanced
representation of each source in our final compiled benchmark. In total,
we collected 2,666 examples.
#### Review and collection of VQA datasets.
Many existing VQA datasets feature instances requiring mathematical
reasoning abilities, such as arithmetic operations or numeric common
sense. Incorporating these datasets enhances problem diversity in terms
of tasks, domains, visual contexts, and reasoning skills involved. We
reviewed more than 70 datasets, collecting 19 of them that contain
math-related instances and are publicly available, as listed in
§[sec:source_data]. Since these
datasets are not originally math-targeted, we initially designed
heuristic rules to automatically select examples likely to involve
mathematical reasoning from a large pool of candidates. Examples with
numeric answers or those containing quantity words (as listed in
§[sec:automatic_selection])
in the questions were selected. This automatic filtration yielded 4,949
VQA-format examples, though some false positive examples remained.
Therefore, we engaged three expert annotators to manually label these
examples to determine if they involve mathematical reasoning (more
details in § [sec:human_is_math]). Utilizing
majority voting and limiting each source dataset to 400 examples, we
finalized a collection of 2,739 examples.
#### Collection of three new datasets.
While the source datasets we collected encompass multiple visual
contexts and mathematical reasoning abilities, certain scenarios remain
unaddressed: logical reasoning on puzzle test diagrams, statistical
reasoning on functional plots, and scientific reasoning on academic
figures. To address these gaps, we introduced three new datasets:
IQTest, FunctionQA, and PaperQA, with examples illustrated in
Figure [fig:our_new_3_datasets].
IQTest comprises 228 examples requiring inductive reasoning, abstract
thinking, pattern prediction, and calculations, sourced from puzzle test
figures on online learning platforms. FunctionQA, with 400 examples,
emphasizes subtle visual perceptions of functional plots and algebraic
reasoning concerning variables, expressions, equations, and functions.
PaperQA is a novel dataset featuring questions derived from informative
academic illustrations, including tables, figures, and charts from
online education resources, with 107 examples sourced from papers
released in August 2023 on Huggingface[^1].
To ensure data quality, all questions were manually annotated by
graduate students in STEM fields and further refined through a rigorous
review process. The GUI of the annotation tool is shown in
Figure [fig:gui_new_data_annotation]
in §[sec:annotate_new_data].
## Metadata Annotation
Fine-grained metadata facilitates a comprehensive analysis of models’
reasoning capabilities across various aspects. To this end, we annotate
the examples in with information including question type, answer type,
language, source, category, task, grade level, and visual context, which
can be accurately obtained from the details provided in the source
datasets. features seven different types of mathematical reasoning
abilities, as categorized in
Table [tab:math_definition]
(§[sec:math_reasoning]). Coarse
labels of mathematical reasoning can be automatically obtained from the
details of the source datasets. To verify the quality of automatic
annotation, expert annotators manually label the mathematical reasoning
categories from seven candidates for 1,000 examples, using the
annotation tool illustrated in
§[sec:human_math_reasoning].
The results show that 94.1% of the examples from automatic and human
annotations have the exact same set of reasoning types, while 98.79% of
the individual labels are identical, indicating that the automatic
annotation for the labeling of mathematical reasoning is highly
accurate.
## Data Preparation and Release
consists of 6,141 examples, divided into two subsets: *testmini* and
*test*. *testmini* contains 1,000 examples, intended for model
development validation or for those with limited computing resources.
The *test* set features the remaining 5,141 examples for standard
evaluation. Notably, the answer labels for *test* will not be publicly
released to prevent data contamination, and we will maintain an online
evaluation platform. To ensure that each source dataset is well
represented in *testmini* and to maintain a distribution in *testmini*
closely resembling the whole set, we adopted this sampling strategy: (1)
first, randomly sample questions with a threshold number of 4 for each
source dataset; (2) then, randomly sample the remaining questions for
each source dataset on its proportion in the entire set. The KL
Divergence and Total Variation (TV) distance between the *testmini* set
and the entire set are 0.008 and 0.035, respectively, suggesting that
*testmini* is close to the distribution of the whole set. We also
conducted several quality checks to address any unidentified errors.
## Data Analysis
Statistic
Number
Total questions
6,141
- multiple-choice questions
3,392 (55.2%)
- Free-form questions
2,749 (44.8%)
- Questions with annotations
5,261 (85.6%)
- Questions newly annotated
736 (12.0%)
Unique number of images
5,487
Unique number of questions
4,746
Unique number of answers
1,464
Source datasets
31
- Existing VQA datasets
19
- Existing MathQA datasets
9
- Our newly annotated datasets
3
Visual context (image) classes
19
Maximum question length
213
Maximum answer length
27
Maximum choice number
8
Average question length
15.6
Average answer length
1.2
Average choice number
3.4
Source dataset distribution of . FQA: figure question
answering, GPS: geometry problem solving, MWP: math word problem, TQA:
textbook question answering, VQA: visual question
answering.
The main statistics of are presented in Table
[tab:statistics]. There are two
types of questions: multiple-choice and free-form. Answers to free-form
questions are categorized as integers, floating numbers, or lists. The
large unique number of images, questions, and answers ensures pattern
diversity in . is derived from 31 source datasets, including three newly
annotated datasets to address the missing types of mathematical
reasoning over specific visual contexts. Dataset examples in Table
[tab:math_examples]
(§[app:math_examples] ) highlight
the richness of mathematical reasoning involved. Examples in
§[app:visual_context] demonstrate
the diverse visual contexts present in . Further details on data
analysis are available in
§[app:data_analysis].
[^1]:
# Experiments
## Evaluation Protocols [sec:evaluation_protocol]
Recent LLMs and LMMs have been instructed to generate long responses in
conventional settings instead of short text. Therefore, we propose a new
strategy for benchmarking , unlike using human-designed or template
matching rules [lu2022learn](http://arxiv.org/pdf/2209.09513v2). The evaluation process
consists of three stages: *response generation*, *answer extraction*,
and *score calculation*. Initially, the baselines generate responses
given the input query, which incorporates the task description, the
question, the choices, and the metadata, using the template defined in
Table [tab:promt_response_generation]
(§[sec:promt_response_generation]).
Next, the short answer text is extracted from the detailed response. We
propose an answer extractor
(§[sec:promt_answer_extraction])
based on LLMs such as GPT-4, inspired by its remarkable ability for text
processing [wei2022chain](http://arxiv.org/pdf/2201.11903v6). A preliminary study of 200
examples shows that GPT-4 can extract the answer text with more than
99.5% accuracy. Finally, the extracted answer is normalized to a
required answer format (e.g., an option letter or an integer), and the
target metric scores are computed. Taking advantage of the fact that the
instances in are either multiple-choice questions for textual answers or
free-form questions for numerical answers, accuracy scores are used as
metrics for deterministic evaluation.
## Experimental Setup [sec:experimental_setup]
We evaluate the models on under three setups: (a) *Text-Only LLMs*
including ChatGPT [openai2022chatgpt](https://openai.com/blog/chatgpt), GPT-4
[openai2023gpt4](None), and Claude-2 [claude2](https://www.anthropic.com/index/claude-2)
in zero-shot and two-shot settings with Chain-of-Thought (CoT)
[wei2022chain](http://arxiv.org/pdf/2201.11903v6) and Program-of-Thought (PoT)
[chen2022program](http://arxiv.org/pdf/2211.12588v4), (b) *Augmented-LLMs* where the LLMs
are provided with additional visual information including the generated
image captions from Multimodal Bard [google2023bard](https://bard.google.com/) and
the detected OCR text from EasyOCR
[jaidedai2020easyocr](https://github.com/JaidedAI/EasyOCR), (c) *LMMs* that include
open-source models such as IDEFICS-9B
[laurencon2023obelics](https://arxiv.org/pdf/2306.16527), mPLUG-OWL-LLaMA-7B
[ye2023mplug](None), miniGPT-4-LLaMA-2-7B
[zhu2023minigpt](None), LLaMA-Adapter-V2-7B
[gao2023llamaadapterv2](None), InstructBLIP-Vicuna-7B
[instructblip](https://arxiv.org/pdf/2305.06500), LLaVA-LLaMA-2-13B
[liu2023llava](http://arxiv.org/pdf/2402.11690v1), LLaVAR [zhang2023llavar](None),
and . We provide the prompts for LLMs and the hyperparameters used for
LMMs in §[app:setup].
## Experimental Results [sec:results]
We compare the performance of several models, including Text-only LLMs,
Augmented LLMs, and LMMs on in Table
[tab:mathvista]. We include random
chance (*i.e.*, one of the options in multiple-choice questions, and
empty in the free-form questions) and frequency guess
(§[sec:frequent_guess]) as naive
baselines. Additionally, we established a human performance baseline
using Amazon Mechanical Turk. Eligible human annotators must have a
satisfactory annotating history, successfully pass qualification
examples, and possess a high school degree or higher. We asked each
annotator to complete five questions within 20 minutes. Further details
can be found in
§[sec:human_performance].
Among text-only LLMs, all models outperform the random baselines, with
the 2-shot GPT-4 using chain-of-thought (CoT) prompting achieving 29.2%.
The limited performance of text-only LLMs suggests that our dataset
requires models to reason within visual contexts for optimal results.
When equipped with image captions and detected OCR text, augmented LLMs
exhibit superior performance compared to their text-only counterparts on
. Specifically, the best-performing augmented LLM is the 2-shot GPT-4
employing program-of-thought (PoT) prompting, which scores 33.9%.
The open-source models (IDEFICS to LLaVA) achieve underwhelming
performance on . This can be attributed to their lack of math reasoning
capabilities, text recognition (useful for math word problems), shape
detection (useful for geometrical problems), and chart understanding.
Notably, these models utilize different model architectures for
processing the vision (e.g., OpenCLIP, CLIP, Vit-G) and language (e.g.,
LLaMA-1, LLaMA-2), different alignment strategies (e.g., MLP projection
in LLaVA, Q-former in InstructBLIP, visual abstractor in mPLUGOwl), and
instruction tuning data (e.g., 150K instruction-response pairs from
LLaVA data, 3,500 instruction-response pairs from miniGPT-4). While
fine-tuned with instruction-following data from text-rich images, LLaVAR
does not perform well, indicating that strong text recognition abilities
do not guarantee high performance on , which requires comprehensive
visual perception and mathematical reasoning. This underscores that
there are immense possibilities for innovations in model, data, or
training objectives to improve the zero-shot performance of LMMs on .
## Fine-grained Results [sec:fine_grained_results]
We also report fine-grained scores for a comprehensive study of the
capabilities of existing models across different tasks (Table
[tab:mathvista]), mathematical
reasoning abilities (Table
[tab:mathvista], Figures
[fig:tease_scores],
[fig:math_reasoning_bar_chart]),
visual context types (Figures
[fig:tease_scores],
[fig:visual_context_bar_chart]),
and grade levels (Figure
[fig:grade_level_bar_chart]).
We perform an ablation study on the augmented LLMs and present the
results in
Table [fig:llm_ablation_study]
(see §[sec:llm_ablation_study]).
The gap in the performance of the Augmented LLMs can be attributed to
poor image captions, which may not adequately describe the math in
visual contexts, the inability of the OCR to detect shapes useful for
geometrical reasoning, and the lack of mathematical reasoning
capabilities.
## Qualitative Analysis [sec:qualitative_analysis]
Error analysis of Bard results: (a) presents errors in
answers and explanations; (b) delves into the details of wrong
explanations. Notations: “Answer” is “Ans.”, “Explanation” is “Exp.”,
“Partially Correct” is “Partial”, and “Not applicable” refers to
unanswerable or indeterminate cases.
**Question:** Is the function ($f:$ $R$ to $R$) injective?
**Choices:**
(A) Yes
(B) No
**Response from Bard:**
The function $f(x) = x^2$ is not injective because it takes two
different inputs, $-1$ and $1$, to the same output, $1$. This means that
there is no one-to-one correspondence between the domain and the range
of the function.
In the image, we can see that the graph of the function passes through
the points $(-1, 1)$ and $(1, 1)$. This means that $f(-1) = f(1) = 1$.
So, if we have two different inputs, $-1$ and $1$, they both map to the
same output, $1$. This violates the definition of an injective
function.
Therefore, the answer is (B).
**Correct output:**
**Question:** Find $z$.
**Choices:**
(A) 7 (B) 9
(C) 12 (D) 15
**Response from Bard:**
The Law of Cosines states that for a triangle with side lengths $a$,
$b$, and $c$ and angle measure $\theta$ opposite side $c$, the following
equation holds: $c^2 = a^2 + b^2 - 2ab\cos(\theta)$
In this case, we have $a = 7$, $b = 9$, and . Plugging these values into
the Law of Cosines, we get:
Since , we can simplify the equation to get:
Taking the square root of both sides, we get . Since $\sqrt{130}$ is not
among the answer choices, we round $z$ to the nearest integer, which is
$\boxed{12}$.
**Correct output:**
#### Success and failure analysis of Multimodal Bard.
In §1.3, we observe that Multimodal Bard
achieves the highest average accuracy on . Here, we analyze its
predictions through human evaluation to understand its mode of success
and failure. To do so, we ask the human workers, from Amazon Mechanical
Turk (AMT), to study Bard’s predictions given the math question, its
associated image, and the ground truth from dataset for 250 instances.
Specifically, workers were instructed to decide whether the predictions
contained the correct answer with the correct explanation. If the
workers find that the model’s explanation is incorrect, they had to
choose whether the wrong explanation was due to various failure modes
such as incorrect reasoning with *hallucination* or wrong calculations.
In our setup, we define hallucination as an introduction of incorrect
facts, in the model explanation, that is not mentioned in the context of
the image or question (e.g., in Figure
[fig:visual_hallucination]
and Figure
[fig:textual_hallucination]).
More details can be found in
§[sec:human_study_bard].
We present the distribution of the quality of Bard’s predictions, judged
by the human annotators, in Figure
1 (a). We find that 44.6% of the
Bard’s predictions had incorrect answers with incorrect explanations.
Interestingly, we observe that Bard responds with partial (6.8%) or
completely (8.1%) incorrect explanations despite giving the correct
answer to the input image and question, highlighting its failure to
reach the correct answer for the wrong reasons. In Figure
1 (b), we present the
distribution over possible reasons when Bard provides incorrect
explanations. Notably, we find that 49.6% of its responses contain
hallucinations. Our analysis highlights that hallucination is a major
source of errors in the generative foundation
models [lu2023dl4math](http://arxiv.org/pdf/2212.10535v2), [ji2023survey](http://arxiv.org/pdf/2311.05232v1). We also observe
that the model responds with correct reasoning but either hallucinates
(18.6%) or performs wrong calculations (19.5%) leaving an overall
impression of being a wrong explanation.
#### Qualitative examples of Multimodal Bard.
We also present a few qualitative examples of Bard’s predictions. In
Figure [fig:bard_examples] (a), we find
that Bard generates the correct answer with the correct explanation,
including detecting the correct function (*i.e.*, $f(x) = x^2$) and
analyzing its properties (*i.e.*, injective) to answer the question.
However, in Figure
[fig:bard_examples] (b), we
observe that the model provides the correct answer (*i.e.*, 12) but with
an incorrect explanation (*i.e.*, using the law of cosines when the
question requires an understanding of the properties of isosceles
triangles). We present more examples in
§[sec:more_examples_bard].
Overall, our analysis of Bard highlights its modes of failure in detail,
which could guide future foundation model design to address these
issues.
**Question:** A race car driver kept track of how many laps he drove in
the past 5 days. What is the mode of the numbers?
**Detected OCR text:**
(\[55, 9\], Laps driven),
(\[21, 34\], Day),
(\[81, 37\], Number of laps),
(\[1, 64\], Thursday), (\[129, 67\], 53),
(\[3, 99\], Friday), (\[129, 99\], 53),
(\[1, 126\], Saturday), (\[129, 129\], 53),
(\[3, 161\], Sunday), (\[129, 161\], 55),
(\[1, 188\], Monday), (\[129, 191\], 55)
**Response from GPT-4:**
``` python
def mode(laps):
return max(set(laps), key=laps.count)
laps = [53, 53, 53, 55, 55]
print(mode(laps))
```
**Correct output:**
**Question:** What is this function most likely be?
**Choices:**
(A) polynomial (B) trigonometric
(C) exponential (D) logarithmic
**Caption from Bard:**
The image shows a graph of a function $f(x) = c$ on a graph paper. The
$x$ axis is the horizontal axis and the $y$ axis is the vertical axis.
The graph is a horizontal line that passes through the point $(0, c)$.
... The text “$(0, c)$” means that the point $(0, c)$ is on the graph.
The text “$f(x) = c$” means that the function $f(x)$ is equal to $c$ for
all values of $x$. ... The graph of a constant function is a horizontal
line. . ...
**Response from GPT-4:**
The function shown in the image is a constant function, which is a type
of polynomial function. Thus, the answer is A.
**Correct output:**
#### Qualitative examples of Augmented GPT-4.
Augmented with external visual models, CoT GPT-4 and PoT GPT-4 are able
to achieve comparable performance with Multimodal Bard. As shown in
Figure [fig:gpt4_examples] (a), provided
with the accurate OCR text detected in the image, PoT GPT-4 accurately
understands the structural information of the image and generates a code
snippet to perform precise statistical reasoning. In Figure
[fig:gpt4_examples] (b), the
caption provides some accurate descriptions of the image (e.g.,
$f(x)=c$) along with hallucination (e.g., $y=3$, the line passes through
$(0,3)$) caused by the external Bard model. Although CoT GPT-4 predicts
the correct answer given the partially correct information, the
qualities of visual information augmented by external models have an
impact on the accurate visual perception and thus the final mathematical
reasoning performance. Examples in
§[sec:model_comparison] show
failure cases due to hallucination caused by external visual models.
# Related Work [sec:short_related_work]
Several benchmarks
[amini2019mathqa](http://arxiv.org/pdf/1905.13319v1), [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), [mishra2022lila](None), [frieder2023mathematical](http://arxiv.org/pdf/2306.16282v1)
have emerged to assess the mathematical reasoning capabilities of LLMs,
but most focus solely on text-based tasks. Current benchmarks, such as
GSM-8K [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), exhibit performance
saturation. Given the rise of LMMs [li2023multimodal](http://arxiv.org/pdf/2309.10020v1),
there is a need for robust multimodal benchmarks in scientific domains.
To address this gap, we introduce a math reasoning dataset that
incorporates visual contexts.
VQA datasets
[antol2015vqa](None), [gurari2018vizwiz](None), [mobasher101parsvqa](None)
gauge the visual reasoning abilities of LMMs. Recent studies explore
assessing LMMs beyond natural images, including abstract scenes,
geometry diagrams, figures, charts, documents, and synthetic images
[lu2021inter](None), [kahou2017figureqa](None), [masry2022chartqa](None). In
this work, we introduce new datasets (IQTest, FunctionQA, PaperQA) to
create a holistic benchmark for evaluating mathematical reasoning.
Generative foundation models like GPT-3, ChatGPT, GPT-4, Claude, and
LLaMA have enabled diverse task solutions without fine-tuning.
Specialized pretraining methods like
PixStruct [lee2023pix2struct](None),
MatCha [liu2022matcha](None), and
UniChart [masry2023unichart](None) enhance chart reasoning in
visual contexts. Models like LLaVA, miniGPT4, InstructBLIP, and Bard
leverage large-scale image-text data, while specialized versions, such
as LLaVAR [zhang2023llavar](None), [ye2023mplug](None), emphasize
document understanding and math comprehension. Recent
works [bitton2023visit](None), [yu2023mm](None) evaluate
instruction-following and reasoning capabilities, underscoring the
growing importance of generative foundation models in practical
applications. We introduce as a benchmark to evaluate their math
reasoning capabilities in varied visual contexts.
# Conclusion
# Detailed Related Work [sec:related_work]
#### Mathematical reasoning benchmarks.
Recently, numerous benchmarks
[amini2019mathqa](http://arxiv.org/pdf/1905.13319v1), [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), [mishra2022lila](None), [frieder2023mathematical](http://arxiv.org/pdf/2306.16282v1)
have been proposed to evaluate the mathematical reasoning capabilities
of Large Language Models (LLMs). However, most of these are textual
only [lu2023dl4math](http://arxiv.org/pdf/2212.10535v2), despite a substantial amount of
mathematical information and reasoning being encapsulated in visual
modalities. Meanwhile, some datasets exhibit performance saturation; for
instance, GPT-4 achieves 92.0% accuracy on
GSM-8K [cobbe2021training](http://arxiv.org/pdf/2110.14168v2), a dataset of grade-school
mathematics questions. On the other hand, the recent rapid advancement
of Large Multimodal Models (LMMs) necessitates the establishment of
robust multimodal benchmarks. However, current multimodal reasoning
benchmarks provide limited coverage of rigorous and scientific domains
[antol2015vqa](None), [kembhavi2016diagram](http://arxiv.org/pdf/1603.07396v1), [kahou2017figureqa](None), [mathew2022infographicvqa](None),
which are key components for creating general-purpose AI assistants. To
bridge this gap, it is crucial to develop a robust math reasoning
dataset that integrates visual contexts.
#### Vision-language reasoning benchmarks.
High-quality evaluation datasets and benchmarks are a cornerstone for
assessing the progress of machine learning models to solve real-world
tasks [liao2021we](None). Prior studies such as
VQA [antol2015vqa](None), [goyal2017making](None),
VizWiz [gurari2018vizwiz](None), and
ParsVQA-Caps [mobasher101parsvqa](None) assess the
general-purpose visual question answering abilities of the LMMs, with or
without task-specific training, on open-ended questions about images. In
addition, there are several works that focus on evaluating specific
skills of the LMMs beyond natural scenes, such as abstract scenes and
shapes) [antol2015vqa](None), [lu2021iconqa](None), [ji2022abstract](http://arxiv.org/pdf/2211.16492v1),
geometry diagrams
[seo2015solving](http://arxiv.org/pdf/2308.02823v1), [lu2021inter](None), [chen2022unigeo](None), [cao2022augmented](http://arxiv.org/pdf/2206.02978v1),
figures and
charts [methani2020plotqa](None), [masry2022chartqa](None), [kahou2017figureqa](None), [chang2022mapqa](None), [kafle2018dvqa](None),
documents (text in images)
[singh2019towards](None), [mathew2022infographicvqa](None), [liu2023hidden](None),
or synthetic
images [dahlgren2022clevr](None), [li2023super](None), [bitton2023breaking](None).
Besides, there has been significant progress on developing datasets to
judge LMMs on skills that require external knowledge
[schwenk2022okvqa](None), [shah2019kvqa](None), common sense
reasoning [zellers2019recognition](http://arxiv.org/pdf/2402.17213v1), [yin2021broaden](http://arxiv.org/pdf/2402.17213v1),
scientific-knowledge [lu2022learn](http://arxiv.org/pdf/2209.09513v2), [kembhavi2017you](None), [kembhavi2016diagram](http://arxiv.org/pdf/1603.07396v1),
medical understanding [zhang2023pmc](None), [lau2018dataset](http://arxiv.org/pdf/2311.18681v1). In
this work, we create new datasets (IQTest, FunctionQA, PaperQA) and
subsequently design a benchmark for holistic evaluation of the math
reasoning capabilities of the LMMs.
#### Generative foundation models and their evaluation.
Recently, there has been a surge of generative foundation
models [bommasani2021opportunities](http://arxiv.org/pdf/2110.10024v1) that are trained on
web-scale data, such as GPT-3, ChatGPT, GPT-4, Claude, LLaMA,
LLaMA-Adapter [brown2020language](http://arxiv.org/pdf/2112.07522v2), [openai2022chatgpt](https://openai.com/blog/chatgpt), [openai2023gpt4](None), [claude2](https://www.anthropic.com/index/claude-2), [touvron2023llama](None), [llamaadapter2023](None),
with the ability to solve a wide range of downstream
tasks [wei2022emergent](http://arxiv.org/pdf/2403.15796v2) without any task-specific
finetuning. Prior work has focused on evaluating their abilities to
respond to the queries from various disciplines, grounded in text, such
as QA, math, medicine, coding and
science [bubeck2023sparks](http://arxiv.org/pdf/2303.12712v5), [nori2023capabilities](None), [chen2021evaluating](http://arxiv.org/pdf/1810.11895v3), [fu2023codeapex](None), [sun2023scieval](None), [wang2023scibench](None), [huang2023c](http://arxiv.org/pdf/2305.08322v3), [huang2022language](http://arxiv.org/pdf/2404.04619v1), [liu2023agentbench](None), [llamaadapter2023](None).
Prior work, such as PixStruct [lee2023pix2struct](None),
MatCha [liu2022matcha](None), and
UniChart [masry2023unichart](None), has focused on developing
specialized pretraining recipe for improved math and chart reasoning in
visual contexts.
On the vision-language side, there are several generative foundation
models such as LLaVA, miniGPT4, InstructBLIP, Flamingo, LLaMA-Adapter
V2, Multimodal
Bard [liu2023llava](http://arxiv.org/pdf/2402.11690v1), [zhu2023minigpt](None), [instructblip](https://arxiv.org/pdf/2305.06500), [alayrac2022flamingo](http://arxiv.org/pdf/2205.07065v1), [awadalla2023openflamingo](None), [gao2023llamaadapterv2](None), [google2023bard](https://bard.google.com/)
that are trained on vast amount of
paired [schuhmann2022laion](None), [sharma2018conceptual](None), [lin2014microsoft](None)
and interleaved image-text data [zhu2023multimodal](None). In
addition, there has been recent development on specialized versions of
these LMMs for document understanding where visual contexts require text
recognition, math understanding being one of
them [zhang2023llavar](None), [ye2023mplug](None). In recent times,
there have been several works, such as Visit-Bench, LVLM-eHub,
MMBench [bitton2023visit](None), [yu2023mm](None), [liu2023mmbench](None), [xu2023lvlm](None), [shao2023tiny](None),
that assess their instruction-following and reasoning capabilities. As
the generative foundation models become more relevant to real-world
applications, unlike prior work, we propose to benchmark their
capabilities of math reasoning (logical, arithmetic, statistical) on a
diverse set of visual contexts (word problems in images, natural scenes,
geometrical shapes, and plots).
####
# Limitations of the Benchmark
# Data Collection Guidelines [sec:collection_guideline]
## Mathematical Reasoning Definition [sec:math_reasoning]
Seven mathematical reasoning types are defined in Table
1.
Definitions and proportions of seven mathematical reasoning
categories in .
Math Reasoning
Description
Arithmetic Reasoning
()
It covers the fundamental
operations such as addition, subtraction, multiplication, division,
and understanding of number properties. It may also include the
ability to interpret numerical data in different forms.
It focuses on data interpretation
and analysis, including measures (mean, median, mode),
dispersion metrics (standard deviation, range), probability concepts,
regression, correlation, and data inferences. It also identifies trends,
outliers, and patterns.
It encompasses understanding
variables, equations, and the manipulation of
expressions with polynomials and exponents. It also covers
solving simple to complex equations, and grasping functions, their
properties, and graphical depictions.
Geometry Reasoning
()
It emphasizes spatial
understanding, analysis of 2D and 3D figures, and reasoning
about their shapes, sizes, and relationships. It includes
symmetry, congruency, similarity, area, volume, and
transformations.
It involves intuitive understanding of
daily numerical concepts, including understanding time
differences, numerical judgment, and estimates. It covers temporal
reasoning, spatial numeric assessments, and practical uses like
budgeting and time reading.
It deals with the application of
mathematical concepts in scientific contexts. This includes
scientific notations, formula use, understanding rates, proportions, and
percentages in practical situations, and problem-solving in scientific
inquiries.
Logical Reasoning
()
It focuses on critical thinking
and deduction from provided information, including pattern
recognition, sequence understanding, predictions, and statement
evaluation. Key components include premises, conclusions, and the use of
abstract reasoning.
Definitions and proportions of seven mathematical reasoning categories
in .
| **Math** | **Examples** |
|:---|:---|
| ARI | |
| **Solution:** | |
| Find the cost of the silk scraps. Multiply: \$9.08 $\times$ 4 = \$36.32 | |
| Find the cost of the canvas scraps. Multiply: \$8.17 $\times$ 4 = \$32.68 | |
| Now find the total cost by adding: \$36.32 + \$32.68 = \$69 | |
| She spent \$69. | |
| **Answer:** 69 | |
| STA | |
| **Answer:** 2 | |
| ALG | |
| **Choices:** (A) larger than (B) equal to (C) smaller than | |
| **Answer:** (A) larger than | |
| | |
| **Question:** How many zeros does this function have? | |
| **Answer:** 1 | |
| | |
| **Question:** What is the value of $y$ at $x=1$? | |
| **Answer:** 0 | |
| | |
| GEO | |
| **Diagram logic forms:** | |
| `PointLiesOnLine(D, Line(B, A))` | |
| `PointLiesOnCircle(B, Circle(D, radius))` | |
| `PointLiesOnCircle(A, Circle(D, radius))` | |
| `PointLiesOnCircle(C, Circle(D, radius))` | |
| **Answer:** (C) 8.5 | |
| NUM | |
| **Named entities:** Winston Churchill, Charles de Gaulle | |
| **Wiki caption**: Winston Churchill and General de Gaulle at Marrakesh, January 1944 | |
| **Answer:** 16 | |
| SCI | |
| **Answer:** 5.77 | |
| LOG | |
| **Solution:** | |
| Circle + Square = 5, Triangle + Triangle = 8, | |
| Triangle = 4. | |
| Circle + Triangle = 7, Circle = 3. | |
| Therefore Square = 2 | |
| **Answer:** 2 | |
Examples of seven mathematical reasoning categories in .
## Visual Context Types [app:visual_context]
Examples of the visual context for the geometry
diagram type.Examples of the visual context for the synthetic
scene type.Examples of the visual context for the bar chart
type.Examples of the visual context for the natural
image type.Examples of the visual context for the scientific
figure type.Examples of the visual context for the table
type.Examples of the visual context for the function
plot type.Examples of the visual context for the abstract
scene type.Examples of the visual context for the puzzle test
type.Examples of the visual context for the scatter plot
type.Examples of the visual context for the line plot
type.Examples of the visual context for the pie chart
type.Examples of the visual context for the document
image type.Examples of the visual context for the medical
image type.Examples of the visual context for other types,
including word cloud, map chart, radar chart, violin plot, and heatmap
chart.
## Source Dataset Summary [sec:source_data]
The source datasets are summarized in Table
3.
## Human Labeling of Mathematical Problems [sec:human_is_math]
GUI for labeling if a problem involves mathematical
reasoning.
1 We are compiling a dataset that incorporates image context and
involves mathematical reasoning (MathQA in visual contexts). We have
gathered a set of examples in which some involve mathematical reasoning,
while others do not.
In our task, a question can be classified as a mathematical problem if
it
- Involves numbers or symbols in the question text or the image
context, AND requires further operations or transformations to be
performed on them to reach a solution.
- Involves more complex forms of mathematical reasoning, including
logical reasoning, abstract thought, and understanding of patterns.
Based on the definition above, a problem is classified as a negative
example (NOT involving mathematical reasoning) if it:
- Does not involve any numbers or quantity words, OR
- Involves only counting, reading, or recognizing numbers, OR
- Relies solely on factual information, such as recalling years and
dates.
We developed an annotation tool, as illustrated in Figure
1, to enable expert
annotators to label problems that involve mathematical reasoning.
Annotators were trained using detailed instructions, as shown in Table
[tab:instruction_is_math],
along with a variety of examples—positive ones that involve mathematical
reasoning and negative ones that do not. We provided three labeling
options:
- *Yes* - This indicates that the problem involves mathematical
reasoning.
- *No* - This indicates that the problem does not involve mathematical
reasoning.
- *Unsure* - This option should be selected if it is uncertain whether
the problem involves mathematical reasoning. (Annotators are advised
to use this option sparingly.)
They may leave comments if they find anything incorrect or offensive for
removal at a later stage.
In our study, we employed the Fleiss Kappa score to conduct an
inter-annotator agreement analysis among three annotators tasked with
labeling examples based on mathematical reasoning. The Fleiss Kappa
score is a statistical measure used to evaluate the reliability of
agreement between multiple raters, providing a quantifiable metric to
assess the consistency across different annotators. A score of 1
indicates perfect agreement, while a score of 0 suggests no agreement
beyond what would be expected by chance. Our analysis yielded a Fleiss
Kappa score of 0.775, indicating a substantial level of consistency
among the annotators. This high degree of agreement underscores the
reliability of our annotation process and affirms the quality of the
labeled data generated for our study.
## Annotating Three New Datasets [sec:annotate_new_data]
GUI for annotating our new source datasets.
## Human Labeling of Mathematical Reasoning [sec:human_math_reasoning]
GUI for labeling mathematical reasoning skills.
# More Dataset Analysis [app:data_analysis]
#### Question distribution.
Apart from English questions, contains 6.57% non-English questions,
including languages such as Chinese and Persian. The multilingual
feature necessitates that models be capable of understanding and
processing multiple languages to ensure accurate results across the
dataset. As illustrated in Table
[fig:source_dataset], the
average number of words in English questions within is 15.58, while the
maximum number of words in a question reaches 213.
Figure 1 further elucidates the
distribution of word counts, highlighting the diverse patterns of
questions. features two types of questions: multiple-choice questions
and free-form questions. For multiple-choice questions, the average
number of choices is 3.4, while the maximum number of choices is 8. In
the case of free-form questions, answers can be integers, floating-point
numbers, or lists, which can be converted into a standard format. The
standard settings in question and answer types facilitate consistent
accuracy evaluation for existing models.
The distribution of the number of words per question in .
Questions with a length greater than 60 are categorized as 61 for
visualization simplicity.
#### Dataset category and task type.
Source datasets in can be categorized into two types: math-targeted VQA
datasets, which are originally proposed for assessing mathematical
reasoning, and general VQA datasets, which address visual reasoning in
everyday scenarios. The distribution proportions of these two categories
(55.4% vs. 44.6%, as illustrated in Figure
2) within enable a balanced
examination of mathematical reasoning in both domain-specific and
general-purpose applications. The distribution of the five tasks
contained within is visualized in Figure
3. The relatively balanced distribution of
these tasks enhances the benchmarking robustness that our dataset
provides.
Category distribution of problems within .Task type distribution of problems within .
#### Grade level.
The datasets within are categorized into four distinct grade levels:
*elementary school*, *high school*, *college*, and *not applicable*,
each representing a different level of reasoning complexity and
contextual application. The *elementary school* category aligns with the
typical mathematical curriculum of elementary education, introducing
basic topics such as arithmetic operations and introductory geometry.
*High school* level questions delve into more complex mathematical
concepts such as algebra, geometry, and introductory calculus. The
*college* category encapsulates the highest level of complexity,
featuring questions on advanced mathematical and scientific concepts
like calculus, linear algebra, and physics. Questions without specific
grade levels are categorized as *not applicable*.
The distribution of questions across these grade levels is visualized in
Figure 4. This structured categorization
enriches the diversity of , providing a meaningful framework for
evaluating and benchmarking the mathematical and visual reasoning
capabilities of various models across different educational contexts,
thereby assessing their practical utility and educational relevance.
Distribution of questions across different grade levels
within .
#### Visual context.
The datasets within encompass over 10 different visual contexts (with
the distribution shown in Figure
5), crucial for evaluating models’
ability to interpret and reason across diverse visual information.
Common visual contexts include geometry diagrams, synthetic scenes, bar
charts, natural images, and scientific figures as illustrated in Figure
[fig:2] to Figure
[fig:13]. Less frequent, yet equally
important visual contexts such as medical images, word clouds, map
charts, radar charts, violin plots, and heatmap charts are depicted in
Figure [fig:14] and Figure
[fig:15]. These visual contexts, ranging
from common to specialized representations, challenge the models to
decode and reason with varying visual information, contributing to a
more robust and comprehensive evaluation. The diversity in visual
contexts enriches , enhancing the benchmarking robustness and providing
a solid foundation for understanding the practical utility and
domain-specific performance of various models across different domains
and applications.
Visual context distribution within .
#### Mathematical reasoning ability.
The datasets within encompass a spectrum of seven distinct mathematical
reasoning types, facilitating a thorough evaluation of models’
mathematical reasoning capabilities. Figure
6 illustrates the portion of
each reasoning type involved in the problems, with arithmetic being the
most frequent and logical reasoning being the least frequent. This
distribution reflects the varying degrees of mathematical reasoning
required across different problems. Figure
7 further delineates the
distribution of reasoning types, showcasing a mean of 1.45. The sparse
distribution observed aids in the precise analysis of each type’s
performance by the models, providing a nuanced understanding of their
strengths and weaknesses across different mathematical reasoning
domains. This structured representation of mathematical reasoning types
within not only enriches the dataset but also significantly contributes
to a more robust and comprehensive evaluation of models, aiding in the
identification of areas for improvement and the development of more
proficient mathematical reasoning models.
Portion of each mathematical reasoning type involved in the
problems of .Distribution of the number of mathematical reasoning types
within .
# More Details on the Setup [app:setup]
## Frequent Guess [sec:frequent_guess]
We employ a strategy where the most frequent answers in the *testmini*
set are utilized as predictions for various question and answer types.
For multiple-choice questions, the most frequent option is selected
based on the number of available options. For instance, option $B$ is
chosen for questions with two options, aligning with the answer
distribution in *testmini*. Similarly, for questions requiring an answer
type of integer, a floating number with one decimal place, a floating
number with two decimal places, or a list, we use $2$, $1.2$, $0.21$,
and $[0, 2, 0, 2, 1, 7, 1, 2, 0, 3, 0, 6]$ respectively, in accordance
with the answer distribution observed in *testmini*.
## Prompt for Answer Extraction [sec:promt_answer_extraction]
The prompt used to instruct GPT-4 for answer extraction is illustrated
in Table
1.
| **Element** | **Prompt** |
|:--:|:---|
| Task description | Please read the following example. Then extract the answer from the model response and type it at the end of the prompt. |
| Example 1 | |
| **Question:** Which number is missing? | |
| **Model response:** The number missing in the sequence is 14. | |
| **Extracted answer:** | |
| | |
| Example 2 | |
| **Question:** What is the fraction of females facing the camera? | |
| **Model response:** The fraction of females facing the camera is 0.6, which means that six out of ten females in the group are facing the camera. | |
| **Extracted answer:** | |
| Example 3 | |
| **Question:** How much money does Luca need to buy a sour apple candy and a butterscotch candy? (Unit: \$) | |
| **Model response:** Luca needs \$1.45 to buy a sour apple candy and a butterscotch candy. | |
| **Extracted answer:** | |
| | |
| Example 4 | |
| **Question:** Between which two years does the line graph saw its maximum peak? | |
| **Model response:** The line graph saw its maximum peak between 2007 and 2008. | |
| **Extracted answer:** | |
| Example 5 | |
| **Question:** What fraction of the shape is blue? | |
| **Choices:** (A) 3/11 (B) 8/11 (C) 6/11 (D) 3/5 | |
| **Model response:** The correct answer is (B) 8/11. | |
| **Extracted answer:** | |
Task description along with five examples used to prompt GPT-4 for
answer extraction.
## Prompts for Response Generation [sec:promt_response_generation]
| **Question type** | **Answer type** | **Task instruction** |
|:--:|:--:|:---|
| multiple-choice | Text | Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. |
| | | Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end. |
| Free-form | Float (1) | Please answer the question requiring a floating-point number with one decimal place and provide the final value, e.g., 1.2, 1.3, 1.4, at the end. |
| | | Please answer the question requiring a floating-point number with two decimal places and provide the final value, e.g., 1.23, 1.34, 1.45, at the end. |
| Free-form | List | Please answer the question requiring a Python list as an answer and provide the final list, e.g., \[1, 2, 3\], \[1.2, 1.3, 1.4\], at the end. |
The task instructions for different question and answer types in answer
extraction. Here, Float (1) refers to a floating-point number with one
decimal place, and Float (2) refers to a floating-point number with two
decimal places.
## Prompt for Caption Generation
We instruct Multimodal Bard to generate a detailed description for an
input image, aiming to augment current LLMs with visual understanding
capabilities. The prompt is shown in
Table [tab:prompt_bard_caption].
1 Describe the fine-grained content of the image or figure, including
scenes, objects, relationships, and any text present.
## Model Hyperparameters
The hyperparameters for the experiments in
§[sec:experimental_setup] are
set to their default values unless specified otherwise. Table
3 and Table
4 detail specific
generation parameters for the various large language models (LLMs) and
large multimodal models (LMMs) we evaluated, respectively.
| **Model** | **Generation Setup** |
|:----------|:------------------------------------------------------------|
| Claude-2 | model = `claude-2`, temperature = 0, max_tokens = 1024 |
| ChatGPT | model = `gpt-3.5-turbo`, temperature = 0, max_tokens = 1024 |
| GPT-4 | model = `gpt-4-0613`, temperature = 0, max_tokens = 1024 |
Generating parameters for various LMMs.
| **Model** | **Generation Setup** |
|:-----------------|:------------------------------------------------|
| IDEFICS-9B-Instruct | max_new_tokens = 256, temperature = 1.0 |
| mPLUG-Owl-LLaMA-7B | do_sample = True, top-k = 5, max_length = 512 |
| miniGPT4-LLaMA-2-7B | num_beams = 1, temperature = 1.0, max_new_tokens = 300, max_length = 1000 |
| | max_gen_len = 256, temperature = 0.1, top_p= 0.75 |
| LLaVAR | do_sample = True, temperature = 0.2, max_new_tokens = 1024 |
| InstructBLIP-Vicuna-7B | do_sample = False, num_beams = 5, max_length = 256, min_length = 1, top_p = 0.9, repetition_penalty = 1.0, temperature = 1 |
| | do_sample = True, temperature = 0.2, max_new_tokens = 1024 |
| | Chatbot URL: , evaluation dates range from Sep 8, 2023 to Sep 10, 2023 |
| | Chatbot URL: , evaluation dates range from Oct 7, 2023 to Oct 15, 2023 |
Generating parameters for various LMMs.
## Human Performance [sec:human_performance]
We conducted a study to evaluate human performance on the *testmini*
subset of the , utilizing Amazon Mechanical Turk (AMT). Each question
from the *testmini* subset was assigned to five annotators, all of whom
have a history of completing more than 5,000 HIT tasks and boast an
acceptance score higher than 0.99, to ensure the quality of the results.
The study comprised five test questions and two qualification questions,
which were to be answered within a 20-minute timeframe. The
qualification questions consisted of elementary math word problems
requiring basic arithmetic operations (e.g., addition and subtraction).
Only annotators who successfully answered the qualification questions
were deemed eligible for the study, and their responses were included in
the final analysis. Additionally, annotators were requested to provide
information regarding their highest level of educational attainment. We
retained the results exclusively from annotators who had achieved a high
school diploma or higher, as 30.9% of the problems in are of high-school
level difficulty and 10.8% correspond to college-level curricula.
## Multimodal Bard Assessment Task [sec:human_study_bard]
A screenshot of our AMT worker interface, utilized for the Multimodal
Bard assessment task, is provided in Figure
1. The workers were compensated at
a rate of \$18 per hour.
Screenshot of the Multimodal Bard assessment task
interface.
# More Experimental Results
## Results on the Test Set
Table [tab:mathvista_test_result]
reports the accuracy scores of two heuristic baselines, two leading
augmented LLMs (CoT GPT-4, PoT GPT-4), and one leading LMM
(LLaVA-LLaMA-2-13B) on the *test* subset. The minor differences between
scores on the *test* subset and the *testmini* subset, as shown in Table
[tab:mathvista], suggest that
*testmini* effectively mirrors the *test* subset, serving as a valuable
evaluation subset for model development, especially for those who have
limited computing resources.
## Scores for Math Reasoning Types [sec:scores_math_reasoning]
## Scores for Various Visual Contexts [sec:scores_visual_contexts]
1 illustrates the
accuracy scores of leading baselines on across a diverse range of visual
contexts. Other foundation models trail behind humans in visual
perception and reasoning across most visual context categories.
Multimodal Bard demonstrates comparable performance to humans in
questions with a visual context of geometry diagrams, showcasing its
promising capabilities in recognizing geometric shapes and
relationships. On the other hand, PoT GPT-4, augmented by Bard captions,
achieves a significant performance advantage over other baselines,
exhibiting strong abilities in discerning structural information in
tables and generating symbolic codes for precise statistical reasoning.
Accuracy scores of leading baselines across various visual
contexts in .
## Scores Across Different Grade Levels [sec:grade_level_bar_chart]
2 displays the average
accuracy scores across different grade levels (*elementary school*,
*high school*, and *college*) for the leading foundation models. Humans
exhibit the highest performance on questions at the elementary school
level , while they fare the worst on college-level questions within .
Foundation model baselines exhibit varying performance behaviors: they
achieve better accuracy scores on high school level questions compared
to the other two categories.
Average accuracy scores across different grade levels for
.
## Ablation Study for LLMs [sec:llm_ablation_study]
Table 3 presents an ablation study
conducted on LLMs, examining their performance under varying visual
information inputs.
Average accuracy scores of LLM baselines under various
visual inputs.
## LLMs with Different Shots [sec:llm_fewshot]
## LMMs with Different Shots [sec:lmm_fewshot]
## Hallucinations in Model Explanations [sec:hallucination_in_explanation]
**Question:** What is the highest amount this class measures?
**Response from Bard:**
.
The markings on the side of the beaker are in milliliters (ml). .
The other markings on the beaker are 300 ml, 200 ml, 100 ml, and . .
**Wrong output:**
**Ground truth:**
**Question:** As shown in the figure, if $CB = 4.0$, $DB = 7.0$, and $D$
is the midpoint of $AC$, then the length of $AC$ is ().
**Choices:** (A) $6$cm (B) $7$cm (C) $4$cm (D) $5$cm
**Response from Bard:**
Since $D$ is the midpoint of $AC$, then $AD = DC = (AC/2)$.
We know that $CB = 4.0$ and $DB = 7.0$, and that $AD = DC$. So we can
write the equation: $$\red{AD + DC = 4.0 + 7.0}$$ $$2AD = 11.0$$
$$AD = 5.5$$ $$AC = 2AD = 11.0 \text{cm}$$
Therefore, the length of $AC$ is $11.0$ cm, which is answer choice
(D).
**Wrong output:**
**Ground truth:**
## More Examples for Multimodal Bard [sec:more_examples_bard]
**Question:** As shown in the figure: In Rt$\triangle$ABC, $\angle$C =
90.0, AC = 8.0, AB = 10.0, then the value of $\text{sin} B$ is equal to
().
**Choices:** (A) $\frac{3}{5}$ (B) $\frac{4}{5}$ (C) $\frac{3}{4}$ (D)
$\frac{4}{3}$
**Response from Bard:**
In a right triangle, the sine of an angle is equal to the ratio of the
opposite side to the hypotenuse. In this case, the opposite side is
$AC = 8$ and the hypotenuse is $AB = 10$. Therefore, $\text{sin} B$ =
$8/10$ = .
Here is the detailed solution:
Given:
\* $\triangle ABC$ is a right triangle with right angle at $C$
\* $AC = 8$
\* $AB = 10$
To find: $\text{sin} B$
$$\text{sin} B = \text{opposite}/\text{hypotenuse} = AC/AB = 8/10 = \red{3/5}$$
**Wrong output:**
**Ground truth:**
**Question:** What is the age gap between these two people in image?
(unit: years)
**Response from Bard:**
**Wrong output:**
**Ground truth:**
**Question:** What is the accuracy of the algorithm with lowest
accuracy?
**Response from Bard:**
The other algorithms are *desert*, *pine*, and *globe*.
**Wrong output:**
**Ground truth:**
**Question:** What is the age gap between these two people in image?
(unit: years)
**Response from Bard:**
The dissatisfied people are 78%, the satisfied people (20%).
**Wrong output:**
**Ground truth:**
**Question:** Is the number of cars in front of the tiny metal thing
less than the number of large matte things in front of the cyan rubber
road bike?
**Choice:** (A) Yes (B) No
**Response from Bard:**
The answer is (B), No.
There are no cars in front of the tiny metal thing (the motorcycle).
There are two large matte things in front of the cyan rubber road bike
(the van and the bicycle).
So the number of large matte things in front of the cyan rubber road
bike is greater than the number of cars in front of the tiny metal
thing.
**Wrong output:**
**Ground truth:**
**Question:** What is the missing number in the picture?
**Choice:** (A) 6 (B) 8 (C) 10 (D) 11
**Response from Bard:**
.
2
10
(8)
/12
**Correct output:**
**Question:** The magnitude of the acceleration vector a is
$10 \mathrm{~cm} / \mathrm{s}^2$. Use the figure to estimate the normal
components of $\mathbf{a}$.
**Response from Bard:**
The normal component of the acceleration is given by
$$a_n = a \cos \theta = 10 \text{ cm}/s^2 \cdot \cos 30^\circ = \dfrac{10 \sqrt{3}}{2} \text{ cm}/s^2$$
The normal component of the acceleration is approximately
$7.07 \text{ cm}/s^2$.
Here is a breakdown of the steps involved in finding the normal
component of the acceleration:
1. Find the angle between the acceleration vector and the radius
vector.
2. Use the cosine function to find the cosine of the angle.
3. Multiply the magnitude of the acceleration vector by the cosine of
the angle to find the normal component of the acceleration.
**Wrong output:**
**Ground truth:**
**Question:** The derivative of $y$ at $x=6$ is
------------------------------------------------------------------------
that at $x=8$
**Chocies:** (A) larger than (B) equal to (C) smaller than
**Response from Bard:**
The graph of the function $y=\log_2 x$ is a concave up curve. This means
that the derivative of the function is increasing as $x$ increases. So,
.
Here is a table of the values of the derivative of $y$ at different
values of $x$:
$x \mid dy/dx$
–$\mid$–
As you can see, . .
**Wrong output:**
**Ground truth:**
**Question:** The derivative of $f(x)$ at $x=2$ is
------------------------------------------------------------------------
that at $x=5$
**Chocies:** (A) larger than (B) equal to (C) smaller than
**Response from Bard:**
The derivative of a function at a point gives the slope of the line
tangent to the function’s graph at that point. The steeper the line, the
larger the derivative.
In mathematical terms, the derivative of $f(x)$ is given by the
expression . Plugging in $x=2$ and $x=5$, . Therefore, .
**Wrong output:**
**Ground truth:**
## Comparisons of Different Models [sec:model_comparison]
**Multimodal Bard:**
The sequence progresses by adding consecutive odd numbers to each
number. The first number is 4, and we add 1 to get 5. , and so on. The
missing number is therefore the next number after 32, .
Here is the complete sequence: $$4 + 1 = 5$$ $$5 + \red{3 = 7}$$
$$7 + \red{5 = 10}$$ $$10 + \red{7 = 14}$$ $$14 + \red{9 = 19}$$
$$19 + \red{11 = 25}$$ $$25 + \red{13 = 32}$$ $$32 + \red{15 = 28}$$
**GPT-4V:**
4, 5, 7, 10, 14, 19, 25, 32, ?, 52
Answer: The missing number is 40.
This sequence follows a pattern where the difference between each number
increases by one. For example, the difference between 4 and 5 is 1, the
difference between 5 and 7 is 2, the difference between 7 and 10 is 3,
and so on. .
**Correct output:**
**Augmented Caption from Bard:**
The image shows a set of 8 numbers in circles on a white background. The
numbers are 4, 5, 7, 10, 14, 19, 25, and 32. There is a question mark in
the last circle. The numbers are arranged in increasing order, .
Therefore, .
The image is a simple math problem, and it is likely that it is being
used as part of an IQ test or other kind of intelligence test. The image
is also consistent with the webpages that you mentioned, which are all
related to IQ tests.
**Augmented OCR Text from EasyOCR:**
(\[300, 80\], 5),
(\[83, 281\], 10), (\[285, 281\], 14), (\[491, 279\], 19),
(\[81, 485\], 25), (\[285, 485\], 32),
**CoT GPT-4:**
The pattern in the numbers is that . For example, the second number (5)
is 1 more than the first number (4) . The third number (7) is .
Following this pattern, . $$32 + \red{9 + 1 = 42}$$ So, .
**Question:** What fraction of the shape is blue?
**Choices:** (A) 3/11 (B) 8/11 (C) 6/11 (D) 3/5
**Ground truth:**
**IDEFICS:**
**mPLUG-Owl:**
**miniGPT4:**
**LLaMA-Adapter-V2:**
**LLaVAR:**
**InstructBLIP:**
**LLaVA:**
**Multimodal Bard:**
The shape is divided into 11 equal parts. . Therefore, .
The other choices are incorrect. Option (A), . Option (B), 8/11, is the
fraction of the shape that is not blue. Option (D), 3/5, is not a
fraction of the shape.
**GPT-4V:**
**Question:** What fraction of the shape is blue?
**Choices:** (A) 3/11 (B) 8/11 (C) 6/11 (D) 3/5
**Ground truth:**
**Augmented Caption from Bard:**
The image shows a blue circle with a white stripe in the middle. . The
colors are blue, . There is no text present in the image.
The image could be used to represent a variety of concepts, such as
fractions, colors, or the four elements. It could also be used as a
decorative element or a symbol.
**CoT GPT-4:**
**PoT GPT-4:**
``` python
def fraction_of_blue(description, choices):
parts = {
'blue': 1,
'green': 1,
'yellow': 1,
'red': 1
}
total_parts = sum(parts.values())
blue_fraction = parts['blue'] / total_parts
for choice in choices:
fraction = choice.split('/)
if int(fraction[0]) / int(fraction[1]) == blue_fraction:
return choice
return ""
choices = ['3/11', '8/11', '6/11', '3/5']
description = "The image shows a blue circle with a white stripe in the middle. The circle is divided into four parts, each with a different color. The colors are blue, green, yellow, and red. There is no text present in the image."
print(fraction_of_blue(description, choices))
```
**Output:**
**Question:** If you add the two visible numbers, on the jerseys, what
is the total sum?
**Ground truth:**
**Solution:** 2 + 1 = 3
**IDEFICS:**
**mPLUG-Owl:**
**miniGPT4:**
**LLaMA-Adapter-V2:**
**LLaVAR:**
**InstructBLIP:**
**LLaVA:** The two visible numbers on the jerseys are 2 and 1. .
**Correct output:**
**Multomodal Bard:**
**GPT-4V:**
**Augmented OCR Text from EasyOCR:**
(\[110, 154\], 21)
**CoT GPT-4:**
The image only shows one visible number on the jerseys, which is 21.
**Question:** How much money does Luca need to buy a sour apple candy
and a butterscotch candy? (Unit: \$)
**Ground truth:**
**Solution:** Add the price of a sour apple candy and the price of a
butterscotch candy: \$0.06 + \$0.07 = \$0.13. Luca needs \$0.13.
**IDEFICS:** .
**mPLUG-Owl:**
**miniGPT4:** The answer to the question is:
Explanation:
\* The price of the sour apple candy is .
\* The price of the butterscotch candy is .
\* Therefore, to buy both candies.
**LLaMA-Adapter-V2:** .
**LLaVAR:** .
**InstructBLIP:**
**LLaVA:** .
**Multimodal Bard:**
The price of a sour apple candy is \$0.06 and the price of a
butterscotch candy is \$0.07. So, .
In two decimal places, .
**Correct output:**
**GPT-4V:**
**Correct output:**
**Question:** How much money does Luca need to buy a sour apple candy
and a butterscotch candy? (Unit: \$)
**Ground truth:**
**Solution:** Add the price of a sour apple candy and the price of a
butterscotch candy: \$0.06 + \$0.07 = \$0.13. Luca needs \$0.13.
**Augmented Caption from Bard:**
The image shows a table with the prices of different types of candy. The
table is on a pink background. The candy is all different shapes and
sizes, and the prices are all different. The candy is all different
colors, and the prices are all different.
The following text is present in the image:
\* Sour apple candy \$0.06
\* Piece of gum \$0.07
\* Gummy worm \$0.09
\* Lemon drop \$0.05
\* Piece of licorice \$0.07
\* Butterscotch candy \$0.07
**Augmented OCR Text from EasyOCR:**
(\[9, 7\], sour apple candy), (\[195, 5\], \$0.06),
(\[7, 39\], piece of gum), (\[195, 37\], \$0.07),
(\[9, 73\], gummy worm), (\[195, 69\], \$0.09),
(\[7, 105\], lemon drop), (\[195, 101\], \$0.05),
(\[6, 132\], piece of licorice), (\[195, 133\], \$0.07),
(\[9, 167\], butterscotch candy), (\[195, 165\], \$0.07)
**CoT GPT-4:**
Find the total cost of a sour apple candy and a butterscotch candy.
\$0.06 + \$0.07 = \$0.13
Luca needs \$0.13.
**Correct output:**
# A Comparative Study of GPT-4V, Bard, and Other Models [app:gpt4v_study]
GPT-4 with vision (GPT-4V) is the multimodal version of GPT-4 that is
instructed to understand multiple modalities such as texts and images.
Due to its remarkable improvements over other AI models
(§[sec:results] and
§[sec:fine_grained_results]),
we have conducted a comprehensive evaluation to understand its
capabilities, strengths, and areas for improvement. Our findings not
only validate GPT-4V’s various problem-solving skills but also shed
light on developing general-purpose multimodal AI agents.
## GPT-4V Playground for Manual Evaluation
Given that GPT-4V does not offer API access, we have performed manual
evaluations using the playground platform[^1]. For a fair comparison, we
used the same input queries as those for all the other LMMs and recorded
the responses in a single round of chat without additional feedback
(Figure 1).
The GPT-4V playground for our manual
evaluation.
## Leaderboard Scores [sec:gpt4v_leaderboard]
The leaderboard in Figure
2 highlights GPT-4V’s
substantial advancements over the current LLM and LMM baselines.
Notably, there is a 15.1% improvement over the second-best performing
Multimodal Bard model. However, a significant gap of 10.4% still exists
between GPT-4V and human performance, indicating plenty of room for
further improvement by developing new LMMs and tool-augmented LLMs.
[^1]:
## Abilities in Mathematical Reasoning [sec:gpt4v_math]
This section compares the mathematical reasoning ability of GPT-4V with
that of other LLMs on , including LLaMA-Adapter-V2-7B (LLaMA-Adapter-V2
for simplification), LLaVA-LLaMA-2-13B (LLaVA for simplification), and
Multimodal Bard.
### Algebraic Reasoning
Algebraic reasoning problems on require understanding the function plots
from figures and inferring their properties. As shown in Figure
[fig:tease_scores], GPT-4V
demonstrates outstanding capabilities in algebraic reasoning, surpassing
all competing models and even humans. For instance, GPT-4V accurately
identifies the function plot by its equation and subsequently infers its
correct properties (Figure
[fig:example_926]). However, both
GPT-4V and the other LLMs face challenges in comprehending
low-resolution figures (Figure
[fig:example_346]) and those that
depict multiple functions (Figure
[fig:example_963]).
**ID:** 926
**Question:** What is the size of the shaded area under the curve? Round
the answer to 2 decimal places.
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:**
**Multimodal Bard:**
The given curve is a circle with center at $(3, 0)$ and radius 3. The
shaded area is a quarter of the circle’s area. $$\begin{aligned}
\text{Area of the circle} &= \pi r^2 \\
&= 3.14 \times 3^2 \\
&= 28.26.\\
\text{Area of the shaded region} &= \frac{28.26}{4} \\
&= 7.065.
\end{aligned}$$ Rounding to two decimal places, the area is $7.07$.
**Correct output:**
**GPT-4V:**
To find the size of the shaded area under the curve
$f(x) = \sqrt{9 - (x - 3)^2}$, we need to integrate the function over
the given interval.
The given function represents the upper half of a circle with a radius
of 3 and centered at the point (3,0). The shaded region corresponds to
the area of a quarter circle with a radius of 3. The formula for the
area of a circle is $\pi r^2$. For a quarter of the circle, the area is
$\frac{1}{4} \pi r^2$.
Given: $r = 3$ $$\begin{aligned}
\text{Area} &= \frac{1}{4} \pi (3^2) \\
\text{Area} &= \frac{1}{4} \pi (9) \\
\text{Area} &= \frac{9}{4} \pi \\
\text{Area} &= 7.07 \text{~(rounded to 2 decimal places)}
\end{aligned}$$ So, the size of the shaded area under the curve is
approximately $7.07$ square units.
**Correct output:**
**ID:** 346
**Question:** What is the $y$-intercept of this function?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
The $y$-intercept of the function is the point where the graph crosses
the $y$-axis. Since the $y$-axis is also the line $x=0$, the
$x$-coordinate of the $y$-intercept will always be 0. In the image, .
Therefore, .
Here are the steps on how to find the $y$-intercept of a function:
1. Set the $x$-coordinate to 0.
2. Solve for the $y$-coordinate.
In this case, we set $x=0$ and get $y=3$. Therefore, .
**GPT-4V:**
The $y$-intercept of a function is the point where the graph of the
function crosses the $y$-axis.
From the provided graph, .
Thus, .
**ID:** 963
**Question:** For what $x$ does $f$ reach its local maximum?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
The local maximum of the function is the point where the graph of the
function changes from increasing to decreasing. In the given graph, this
point is at $x = 3$. So the answer is **3**.
Here is a more detailed explanation:
- A local maximum point of a function is a point on the graph of the
function where the function value is higher than all the function
values in a small neighborhood of the point.
- . However, for values of $x$ greater than 3, the function value
starts to decrease. .
- Therefore, the local maximum point of the function is at $x = 3$.
**Correct output:**
**GPT-4V:**
The function $f(x)$ reaches its local maximum where its first
derivative, $f'(x)$, changes from positive to negative.
Looking at the graph, .
So, .
### Arithmetic Reasoning
Arithmetic reasoning problems in require accurate fundamental operations
in conjunction with understanding diverse visual contexts. As
illustrated in Figure
[fig:tease_scores], GPT-4V
exhibits a significant improvement in arithmetic reasoning compared to
existing models. For instance, some LLMs struggle with basic arithmetic
tasks, such as determining the difference between two values in a bar
chart (Figure [fig:example_74]) or computing the
probability based on simple statistical data (Figure
[fig:example_718]).
**ID:** 74
**Question:** What is the difference of largest and smallest bar?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
The largest bar in the image is Iceland with 100%, and the smallest bar
is Cambodia with 52.4%. So the difference between the largest and
smallest bar is 100 - 52.4 = 47.6%.
However, the question asks for the answer in one decimal place. .
Therefore, .
**GPT-4V:**
From the graph:
Largest bar (Iceland) = 100%
Smallest bar (Cambodia) = 52.4%
Difference = 100% - 52.4% = 47.6%
**Correct output:**
**ID:** 718
**Question:** Jeffrey is the proud owner of an eclectic bow tie
collection. He keeps track of how many bow ties he has, and organizes
them by pattern and material. What is the probability that a randomly
selected bow tie is designed with swirls and is made of velvet? Simplify
any fractions.
**Ground truth:**
**LLaMA-Adapter-V2:**
.
**LLaVA:**
To determine the probability of a randomly selected bow tie being
designed with swirls and made of velvet, we need to divide the total
number of bow ties with the specific characteristics (swirls and velvet)
by the total number of bow ties in Jeffrey’s collection.
In the image, .
So, the probability of a randomly selected bow tie being designed with
swirls and made of velvet is:
**Multimodal Bard:**
The probability that a randomly selected bow tie is designed with swirls
and is made of velvet is:
$$(\text{Number of swirls and velvet bow ties}) / (\text{Total number of bow ties})$$
.
We can find the total number of bow ties by adding the values in all
three columns. This gives us $5 + 3 + 1 + 5 = 14$ bow ties.
Therefore, the probability that a randomly selected bow tie is designed
with swirls and is made of velvet is:
$$\red{(8) / (14) = 0.57 = 57.14\%}$$
To two decimal places, .
**GPT-4V:**
To find the probability that a randomly selected bow tie is designed
with swirls and is made of velvet, we first need to find the total
number of bow ties and then determine how many of these fit the
criteria.
Total number of bow ties: $$\text{Zigzags (Velvet)} = 5$$
$$\text{Zigzags (Cotton)} = 1$$ $$\text{Swirls (Velvet)} = 3$$
$$\text{Swirls (Cotton)} = 5$$ $$\text{Total} = 5 + 1 + 3 + 5 = 14$$
Number of bow ties that are designed with swirls and are made of velvet
= 3
Probability = Number of desired outcomes / Total number of possible
outcomes Probability = $3/14$
Now, converting the fraction to a decimal (rounded to two decimal
places): $3 ÷ 14 = 0.21$
**Correct output:**
### Geometry Reasoning
In geometry reasoning, the performance of GPT-4V is comparable to that
of humans on , as demonstrated in Figure
[fig:tease_scores]. Figure
[fig:example_606] and Figure
[fig:example_640], respectively,
present two geometry reasoning problems: one at an elementary level and
the other at a college level. For both problems, GPT-4V produces the
correct answers accompanied by detailed explanations.
**ID:** 606
**Question:** What is the perimeter of the rectangle?
**Ground truth:**
**LLaMA-Adapter-V2:** The perimeter of the rectangle is .
**Correct output:**
**LLaVA:** The perimeter of the rectangle is 10 cm.
**Correct output:**
**Multimodal Bard:**
I see the image. The rectangle has a width of 3 cm and a length of 2 cm.
.
**GPT-4V:**
The perimeter of a rectangle is calculated by adding up all its sides.
Using the provided dimensions:
Perimeter = 2 × (length + width)
Perimeter = 2 × (3 cm + 2 cm)
Perimeter = 2 × 5 cm
Perimeter = 10 cm
The perimeter of the rectangle is 10 cm.
**Correct output:**
**LLaVA:**
Given that $\triangle \text{IWN}$ is equilateral, we can use the cosine
rule to find the length of the side opposite .
where $a$ and $b$ are the lengths of the adjacent sides of the
equilateral triangle. $$a = 5$$ $$b = 5$$
Substituting the values, we get:
$$\red{\cos \angle \text{WN} = \frac{5}{5} = 1}$$
So, the length of the side opposite .
Therefore,
**Multimodal Bard:**
The correct answer is (B).
Since $\triangle IWN$ is equilateral,
$\widehat{WIN} = \widehat{WNI} = 60^\circ$. Therefore,
$\red{\widehat{WN} = \pi - 2 \cdot 60^\circ} = \boxed{\frac{5}{3}\pi}$.
**Correct output:**
**GPT-4V:**
For an equilateral triangle, each interior angle is $60^{\circ}$.
The arc corresponding to a central angle in a circle is given by:
$\text{Arc length} = \text{central angle in radians} \times \text{radius}$
Given that $WI = IN = NW = 5$ (since $\triangle IWN$ is equilateral),
the radius of the circle is 5.
Using the formula above:
$\widehat{WN} = 60^{\circ} \times \frac{\pi}{180^{\circ}} \times 5 = \frac{\pi}{3} \times 5 = \frac{5\pi}{3}$
The correct option is: (B) $\frac{5}{3} \pi$.
**Correct output:**
### Logical Reasoning
Logical reasoning problems represent a different type of question in .
Solving these problems requires abstract thinking to deduce the
underlying patterns of numbers or shapes from figures. Current
foundation models struggle to effectively tackle logical reasoning
problems: GPT-4V achieves only 21.6% accuracy in logical reasoning,
which is a modest improvement of 8.1% over random chance, as shown in
Table [tab:mathvista]. The challenges that
logical reasoning problems present to current LMMs are further
highlighted in Figures
[fig:example_98],
[fig:example_778], and
[fig:example_750].
**ID:** 98
**Question:** Find the value of the square in the figure.
**Ground truth:**
**ID:** 778
**Question:** Can you find the missing shape in this picture puzzle?
**Choices:** (A) A (B) B (C) C (D) D
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
The missing shape in the picture puzzle is **(D)**.
The shapes in the puzzle are all regular polygons, and . , the second
shape is a hexagon, , and . .
The answer choice A is a pentagon, . . . .
Therefore, .
**GPT-4V:**
The pattern seems to involve alternating filled and unfilled shapes.
Following the pattern:
- The first shape is a filled .
- The second shape is an unfilled hexagon.
Thus, the missing shape should be a filled .
.
**ID:** 750
**Question:** Which number is missing?
**Ground truth:**
**Solution:** $5 \times 2 = 10$, $6 \times 2 = 10$, $9 \times 2 = 18$
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
The missing number in the puzzle is **7**.
The numbers in the puzzle are arranged in a circle, and . For example,
.
**GPT-4V:**
From the given image, : $$\red{12 - 10 = 2}$$ $$\red{9 - 5 = 4}$$
$$\red{6 - ? = 2}$$
.
.
### Numeric Commonsense Reasoning
Problems involving numeric commonsense reasoning on require commonsense
knowledge about daily objects and celebrities to answer visual
questions. However, these problems present significant challenges to
existing foundation models, including GPT-4V, as depicted in Figure
[fig:tease_scores]. For instance,
Multimodal Bard struggles to understand the optical illusion in an image
(Figure [fig:example_693]) and to infer the
age gap between two celebrities from another image (Figure
[fig:example_29]). Figure
[fig:example_19] poses a question
about the maximum volume a beaker can measure. However, GPT-4V lacks
commonsense knowledge regarding the use of a beaker, resulting in an
incorrect prediction.
**ID:** 693
**Question:** Is the orange larger than the car?
**Choices:** (A) Yes (B) No
**Ground truth:**
**LLaMA-Adapter-V2:** No, the orange is not larger than the car.
**Correct output:**
**LLaVA:** .
**Multimodal Bard:**
The answer is (B), No. The orange is not larger than the car. The car is
clearly much larger than the orange. Even if it is a large orange, it is
still much smaller than the car. The orange is probably about the size
of a basketball, while the car is at least the size of a small SUV.
.
**Correct output:**
**GPT-4V:**
The orange is not larger than the car; .
**Correct output:**
**ID:** 29
**Question:** What is the age gap between these two people in image?
(unit: years)
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
Based on the image you sent, .
The king is Richard III of England, who was born in 1452. The queen is
Anne Neville, who was born in 1456. So, .
The stained glass window is likely depicting the couple after their
marriage, so .
**GPT-4V:**
The two figures depicted in the stained glass are King Richard III and
Anne Neville. Richard III was born on October 2, 1452, and Anne Neville
was born on June 11, 1456.
Calculating the age gap: $$1456 - 1452 = 4$$
So, the age gap between Richard III and Anne Neville is 4 years.
**Correct output:**
**ID:** 19
**Question:** What is the highest amount this glass measures?
**Ground truth:**
**LLaMA-Adapter-V2:** The highest amount this glass measures is 400 ml.
**Correct output:**
**LLaVA:** .
**Multimodal Bard:**
.
The markings on the side of the beaker are in milliliters (ml). .
. The negative markings indicate the amount of liquid that is below the
zero mark.
**GPT-4V:**
.
### Scientific Reasoning
Scientific reasoning represents a distinct mathematical reasoning
ability within our . To tackle problems in this area, a model must not
only accurately interpret domain-specific information from figures, but
also possess the necessary in-domain knowledge to reason rigorously on
scientific topics. Figure
[fig:tease_scores] shows that
GPT-4V substantially outperforms the other foundation models. This
superiority is further illustrated by the examples in Figures
[fig:example_807] and
[fig:example_278]. However, the
failure of GPT-4V, as shown in Figure
[fig:example_725], indicates that
there is considerable room for improvement.
**ID:** 807
**Question:** Which of the following leaf shapes would have the least
amount of wind resistance and water loss?
**Choices:**
(A) Truncate (B) Acuminate
(C) Rounded (D) Sagittate
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
. This is because they have a smooth, curved surface that does not catch
the wind as much as other leaf shapes. Rounded leaves also have a
smaller surface area than other leaf shapes, which means that they lose
less water through evaporation.
. Truncate leaves have a flat end, which can catch the wind. . Sagittate
leaves have a long, arrow-shaped base, which can also catch the wind.
**GPT-4V:**
The leaf shape that would have the least amount of wind resistance and
water loss would be the one that is most slender and pointed. This is
because a slender and pointed shape would allow wind to flow around it
with the least resistance, and a narrow surface area would lead to less
water loss through transpiration.
Among the options, the “acuminate” leaf shape is the most slender and
pointed.
**Correct output:**
**ID:** 278
**Question:** Among the following objects, which one has the best PSNR
score?
**Choices:** (A) Lego (B)Mats (C) Mic (D)Ship
**Ground truth:**
**LLaMA-Adapter-V2:**
.
**LLaVA:** .
**Multimodal Bard:**
The answer is **(D) Ship**.
. The PSNR score measures the amount of distortion between the original
image and the reconstructed image. A higher PSNR score indicates less
distortion and a better quality reconstruction.
The other objects have the following PSNR scores:
\* Lego: 29.0 dB
\* Mats: 27.0 dB
\* Mic: 30.7 dB
.
**GPT-4V:**
Based on the table for PSNR (dB) scores:
Lego:
DMTet: 29.0
FLeXiCubes: 28.8
Mats:
DMTet: 27.0
FLeXiCubes: 26.7
Mic:
DMTet: 30.7
FLeXiCubes: 30.8
Ship:
DMTet: 26.0
FLeXiCubes: 25.9
Among the choices, “Mic” has the highest PSNR scores with both DMTet and
FLeXiCubes, so the answer is (C) Mic.
**Correct output:**
**ID:** 725
**Question:** In Figure, suppose that Barbara’s velocity relative to
Alex is a constant $v_{B A}=52 \mathrm{~km} / \mathrm{h}$ and car $P$ is
moving in the negative direction of the $x$ axis. If Alex measures a
constant $v_{P A}=-78 \mathrm{~km} / \mathrm{h}$ for car $P$, what
velocity $v_{P B}$ will Barbara measure?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:**
.
**Multimodal Bard:**
The velocity of car $P$ relative to Barbara is given by:
$$v_{PB} = v_{PA} - v_{BA} = -78 \mathrm{~km} / \mathrm{h} - 52 \mathrm{~km} / \mathrm{h} = -130 \mathrm{~km} / \mathrm{h}$$
Therefore, Barbara will measure a velocity of **-130**
$\mathrm{~km} / \mathrm{h}$ for car $P$.
Here is the explanation:
\* The velocity of car $P$ relative to Alex, $v_{PA}$, is measured by
Alex, who is stationary in frame $A$.
\* The velocity of Barbara relative to Alex, $v_{BA}$, is measured by
Alex, who is also stationary in frame $A$.
\* The velocity of car $P$ relative to Barbara, $v_{PB}$, is measured by
Barbara, who is moving in frame $B$ with a velocity of $v_{BA}$.
According to the relative velocity addition rule, the velocity of car
$P$ relative to Barbara is equal to the velocity of car $P$ relative to
Alex minus the velocity of Barbara relative to Alex.
In this case, the velocity of car $P$ relative to Alex is -78
$\mathrm{~km} / \mathrm{h}$ and the velocity of Barbara relative to Alex
is 52 $\mathrm{~km} / \mathrm{h}$. Therefore, the velocity of car $P$
relative to Barbara is -130 $\mathrm{~km} / \mathrm{h}$.
**Correct output:**
**GPT-4V:**
Given:
$v_{BA}$ = 52 $\mathrm{~km} / \mathrm{h}$ (velocity of Barbara relative
to Alex)
$v_{PA}$ = -78 $\mathrm{~km} / \mathrm{h}$ (velocity of car $P$ as
measured by Alex)
We need to find $v_{PB}$ (velocity of car $P$ as measured by Barbara).
Using the Galilean transformation for velocities in one dimension:
$$\begin{aligned}
v_{PB} &= \red{v_{PA} + v_{BA}} \\
v_{PB} &= \red{-78 \text{ $\mathrm{~km} / \mathrm{h}$} + 52 \text{ $\mathrm{~km} / \mathrm{h}$}} \\
v_{PB} &= \red{-26 \text{ $\mathrm{~km} / \mathrm{h}$}}
\end{aligned}$$
### Statistical Reasoning
In , problems encompass a variety of charts, plots, and graphs designed
to assess the statistical reasoning capabilities of foundation models.
As demonstrated in Figure
[fig:tease_scores], GPT-4V shows
strong statistical reasoning ability. For instance, GPT-4V produces
accurate answers for the format-rich table in Figure
[fig:example_649] and the data
analysis table in Figure
[fig:example_389].
**ID:** 649
**Question:** How many values are below 30 in Mainly are incidents of
individual misconduct?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
. , which is the percentage of women who believe that recent allegations
of sexual harassment and assault are mainly incidents of individual
misconduct.
.
**GPT-4V:**
Based on the “Mainly are incidents of individual misconduct” column, the
following values are below 30:
- 28 (Total)
- 22 (Women)
- 19 (College grad+)
- 25 (Dem/Lean Dem)
That’s a total of 4 values.
So, the answer is 4.
**Correct output:**
**ID:** 389
**Question:** Look at the table. Then answer the question. At a price of
\$320, is there a shortage or a surplus?
**Choices:** (A) shortage (B) surplus
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
.
At a price of \$320, .
Here is a table summarizing the supply and demand at each price:
**GPT-4V:**
At a price of \$320:
Quantity demanded = 20,000
Quantity supplied = 12,500
Since the quantity demanded (20,000) is greater than the quantity
supplied (12,500), there is a shortage.
**Correct output:**
## Abilities Across Visual Contexts [sec:gpt4v_visual]
This section compares the reasoning abilities of GPT-4V with other large
multimodal models (LLMs) on , considering various types of visual
contexts. Models used for comparison include LLaMA-Adapter-V2-7B
(simplified as LLaMA-Adapter-V2), LLaVA-LLaMA-2-13B (simplified as
LLaVA), and Multimodal Bard.
### Abstract Scene
Based on Figure [fig:tease_scores], current
foundation models lag behind human performance in mathematical reasoning
in abstract scenes by a substantial margin. Consider the problems in
Figures [fig:example_742] and
[fig:example_7] that are derived from
math word problems found in elementary school curricula. Despite their
advanced capabilities, foundation models such as Multimodal Bard and
GPT-4V fail to produce the correct responses.
**ID:** 742
**Question:** Move the ruler to measure the length of the twig to the
nearest inch. The twig is about ( ) inches long.
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
According to the image, . . Therefore, .
**GPT-4V:** .
**ID:** 7
**Question:** What time is shown? Answer by typing a time word, not a
number. It is ( ) after eight.
**Choices:** (A) half (B) quarter (C) o’clock (D) quarter to (E) quarter
past
**Ground truth:**
**LLaMA-Adapter-V2:**
**LLaVA:** .
**Multimodal Bard:**
The image you sent shows a clock with the hands pointing to 8:15.
The hour hand is between 8 and 9, which means it is past 8 o’clock. The
minute hand is pointing to the 3, which is 15 minutes past. Therefore,
the time is quarter past eight.
**GPT-4V:**
, which indicates that it is 45 minutes past the hour. The hour hand is
pointing a little past the 8, which means it’s after 8 o’clock.
So,
The correct answer is .
### Bar Chart
As shown in Figure [fig:tease_scores], foundation
models, including GPT-4V, significantly underperform humans in
mathematical reasoning when bar charts serve as the visual context.
Neither Multimodal Bard nor GPT-4V can solve the problems depicted in
Figures [fig:example_497] and
[fig:example_629], which do not
need complex understanding and reasoning.
**ID:** 497
**Question:** What is the value of the largest bar?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** .
**GPT-4V:** .
**ID:** 629
**Question:** What percentage of people prefer the object hut?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** Based on the image, . .
**GPT-4V:** .
### Function Plot
GPT-4V outperforms other baselines on problems related to function plots
and even exceeds human performance. Figures
[fig:example_767] and
[fig:example_39] show questions with
digital and hand-drawn function plots, respectively. In both cases,
GPT-4V accurately identifies their functions and infers the correct
properties.
**ID:** 767
**Question:** What is the maximum value of $y$?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
.
. This is the equation of a circle with center $(0, 0)$ and . .
**GPT-4V:**
The equation given is for a circle: $x^2 + y^2 = 25$.
The radius of this circle is the square root of 25, which is 5.
The maximum value of $y$ corresponds to the highest point of the circle,
which is equal to the radius.
So, the maximum value of $y$ is 5.
**Correct output:**
**ID:** 39
**Question:** Is this function odd or even?
**Choices:** (A) odd (B) even
**Ground truth:**
**LLaMA-Adapter-V2:** The function is odd.
**Correct output:**
**LLaVA:** The function in the image is odd.
**Correct output:**
**Multimodal Bard:**
The function in the image is **odd**.
An even function is a function where $f(-x) = f(x)$. An odd function is
a function where $f(-x) = -f(x)$.
, if we substitute $x = -x$, we get: $$\begin{aligned}
\red{y} &= \red{a_2(-x) - a_1} \\
&= \red{-a_2x + a_1}
\end{aligned}$$
This is not equal to $f(x)$, so the function is **odd**.
The correct choice is (A).
**Correct output:**
**GPT-4V:**
The given function is $y = a^x$ where $a > 1$.
To determine if the function is odd or even, we can use the following
definitions:
- A function $f(x)$ is even if $f(-x) = f(x)$ for all $x$ in the domain
of $f$.
- A function $f(x)$ is odd if $f(-x) = -f(x)$ for all $x$ in the domain
of $f$.
For the given function $y = a^x$, when we plug in $-x$ for $x$:
$$y = a^{-x}$$
This is not equal to $-a^x$ and it’s not equal to $a^x$ either.
Thus, the function is neither odd nor even. However, since the given
choices only include “odd” and “even”, the best answer is neither. But
since I must choose one of the provided options (A) odd.
**Correct output:**
### Geometry Diagram
Geometry diagrams are a distinct type of visual context in . To answer
questions involving these diagrams, a model must comprehend the
fine-grained details, including symbols, variables, and relations from
the figures. Additionally, it should apply associated theorems before
executing calculations to produce final responses. GPT-4V surpasses
other models and even humans due to its superior capabilities in
geometry recognition and reasoning. In the examples shown in Figures
[fig:example_478] and
[fig:example_410], GPT-4V delivers
the correct results through the application of relevant theorems and
subsequent calculations.
**ID:** 478
**Question:** As shown in the figure, in the diamond $ABCD$, two
diagonal lines $AC = 12.0$, $BD = 16.0$, then the edge length of this
diamond is ().
**Choices:** (A) 10 (B) 8 (C) 6 (D) 5
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
The answer is (A). 10.
In a rhombus, the diagonals bisect each other at right angles. So, we
can draw two right triangles within the rhombus, each with a leg of
length 6 and a hypotenuse of length 10.
By the Pythagorean Theorem, .
So the answer is (A).
**Correct output:**
**GPT-4V:**
Given that $AC = 12.0$ and $BD = 16.0$ in the rhombus (diamond)
$ABCD$.
Since all sides of a rhombus are congruent, the diagonals of a rhombus
are perpendicular bisectors of each other. This means that each half of
the diagonal $BD$ is 8.0 and each half of diagonal $AC$ is 6.0.
Using the Pythagorean theorem for one of the right triangles (formed by
the diagonals): $$\begin{aligned}
\text{Side}^2 &= \left(\text{Half of $BD$}\right)^2 + \left(\text{Half of $AC$}\right)^2 \\
\text{Side}^2 &= 8^2 + 6^2 \\
\text{Side}^2 &= 64 + 36 \\
\text{Side}^2 &= 100 \\
\text{Side} &= 10
\end{aligned}$$ **Correct output:**
**ID:** 410
**Question:** Find the perimeter of $\triangle D E F,$ if
$\triangle D E F \sim \triangle C B F,$ perimeter of
$\triangle C B F=27, D F=6,$ and $F C=8$.
**Choices:** (A) 20.25 (B) 21 (C) 27 (D) 36
**Ground truth:**
**LLaMA-Adapter-V2:** The perimeter of $\triangle D E F$ is 20.25, as
shown in the image.
**Correct output:**
**LLaVA:**
The perimeter of $\triangle D E F$ can be calculated using the
similarity theorem, which states that if two triangles are similar,
their corresponding sides are proportional. In this case, we know that
$\triangle D E F \sim \triangle C B F$ and the perimeter of
$\triangle C B F$ is 27.
Since the ratio of the corresponding sides of similar triangles is the
same, we can use the perimeter of $\triangle C B F$ to find the
perimeter of $\triangle D E F$. The perimeter of $\triangle D E F$ is
proportional to the perimeter of $\triangle C B F$, so we can multiply
the perimeter of $\triangle C B F$ by the ratio of the corresponding
sides.
.
Therefore, .
**Multimodal Bard:**
Since $\triangle DEF \sim \triangle CBF$, we have
$\frac{DE}{CB} = \frac{DF}{FC}$. Since $DE = 6$ and $FC = 8$, we have
$\frac{DE}{CB} = \frac{6}{8} = \frac{3}{4}$.
. Substituting $\frac{3}{4}$ for $\frac{CB}{BF}$, .
The perimeter of $\triangle DEF$ is
$DE + EF + DF = \red{6 + EF + 6 = EF + 12}$. Substituting $\frac{3}{4}$
for $\frac{EF}{BF}$, .
Therefore, .
**GPT-4V:**
Given $\triangle D E F \sim \triangle C B F$ and $DF = 6$ and $FC = 8$,
we can find the scale factor of the two similar triangles.
Scale factor = $\frac{DF}{FC} = \frac{6}{8} = \frac{3}{4}$.
Given the perimeter of $\triangle CBF = 27$, we can find the perimeter
of $\triangle DEF$ by multiplying by the scale factor.
Perimeter of $\triangle DEF$ = $27 \times \frac{3}{4} = 20.25$.
**Correct output:**
### Line Plot
As evidenced by Figure
[fig:tease_scores], current models
such as GPT-4V do not perform as well as humans in mathematical
reasoning involving line plots. We speculate that the low performance is
mainly due to the difficulty in detecting OCR text in the figures and
accurately grounding the values, as illustrated by the examples in
Figures [fig:example_981] and
[fig:example_209].
**ID:** 981
**Question:** What is the sum of 2002, 2003 and 2004?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
I see the image.
. .
**GPT-4V:**
The sum for the years 2002, 2003, and 2004 from the graph are:
2002: 19.4%
2003: 23.2%
**ID:** 209
**Question:** What is the highest value in black line chart?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** .
**GPT-4V:** .
### Natural Image
includes questions that require numeric and spatial reasoning based on
text and objects in natural images. If models have limited abilities to
recognize text (OCR), as shown in Figure
[fig:example_980], or to identify
visual objects, as in Figure
[fig:example_884], they are
unlikely to generate correct answers to visual questions.
**ID:** 980
**Question:** What is the highest number shown?
**Ground truth:**
**LLaMA-Adapter-V2:** The highest number shown is 12.
**Correct output:**
**LLaVA:** .
**Multimodal Bard:** I see the image. .
**GPT-4V:** The highest number shown on the clock is 12.
**Correct output:**
**ID:** 884
**Question:** What is the blue kite in the lower right corner shaped
like?
**Choices:**
(A) ferret (B) cat (C) cloud (D) octopus
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:** .
**GPT-4V:** The blue kite in the lower right corner is shaped like (D)
octopus.
**Correct output:**
### Puzzle Test
Math reasoning with puzzle text figures is challenging for current AI
foundation models because interpreting these figures requires discerning
underlying patterns from sets of shapes, as illustrated in Figure
[fig:example_123], and numbers, as
in Figure [fig:example_941]. There is plenty
of room for improvement.
**ID:** 123
**Question:** The diagram represents successive rotations, starting from
the top down. Which shape comes next?
**Choices:** (A) A (B) B (C) C (D) D (E) E
**Ground truth:**
**LLaMA-Adapter-V2:**
The diagram represents successive rotations, starting from the top down.
The next shape that comes next is letter D.
**Correct output:**
**LLaVA:** .
**Multimodal Bard:**
.
.
To find the next shape, we need to rotate the cube 90 degrees on the
z-axis. .
**GPT-4V:**
The diagram shows a pattern of rotating the shape in a specific
sequence. To determine the next shape:
Starting with the first shape, when we rotate it 90 degrees
counterclockwise, we get the second shape.
Rotating the second shape 90 degrees counterclockwise gives the third
shape.
Thus, rotating the third shape 90 degrees counterclockwise will give us
the next shape.
.
**ID:** 941
**Question:** Which is the next number in the series?
**Ground truth:**
**Solution:**
$(3968 + 1) = 63^2$
$(63 + 1) = 8^2$
$(8 + 1) = 3^2$
$(3 + 1) = 2^2$
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
.
The series is based on the square root of the number plus 1. So, . Then,
63 + 1 = 64, and its square root is 8. Continuing the pattern, 8 + 1 =
9, and its square root is 3.
Therefore, .
**GPT-4V:**
The pattern in the ladder involves dividing the number by the step
number to get the next number.
$\text{Step 1:} \quad \red{3968 \div 1 = 3968}$
$\text{Step 2:} \quad \red{3968 \div 2 = 1984}$
$\text{Step 3:} \quad \red{63 \div 3 = 21}$
$\text{Step 4:} \quad \red{8 \div 4 = 2}$
$\text{Step 5:} \quad \red{3 \div 5 = 0.6}$
.
However, if we assume : $$\begin{aligned}
\red{3968 \div 63} & \red{\approx 63} \\
\red{63 \div 8} & \red{= 7.875 \text{ which isn't a round number.}} \\
\red{8 \div 3} & \red{= 2.6666 \text{ again not a round number.}}
\end{aligned}$$
Without a clear and consistent pattern, it’s challenging to determine
the next number in the sequence.
### Scatter Plot
A scatter plot is a graphical representation of data points on a
two-dimensional axis, where each point represents a value from the
dataset. includes the reasoning task that requires comprehending scatter
plots taken from daily life and academic papers, as shown in Figures
[fig:example_105] and
[fig:example_542]. Although GPT-4V
outperforms other LMMs, such as Multimodal Bard, and even humans in
overall accuracy (Figure
[fig:tease_scores]), it often
fails in the cases where fine-grained understanding is required, as in
Figure [fig:example_542].
**ID:** 105
**Question:** Does Dark Violet have the minimum area under the curve?
**Choices:** (A) yes (B) no
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** .
**Multimodal Bard:**
.
.
**GPT-4V:**
Based on the scatter plot provided, “area under the curve” is not
represented since there is no curve shown. However, if by “area under
the curve” we mean the overall distribution and position of the points
for "Dark Violet" compared to others, we can analyze the positions of
the points.
From the scatter plot, the color “Dark Violet” consistently appears to
be the lowest on the y-axis across the entire range of the x-axis.
Given the choices: (A) yes (B) no
The correct answer is: (A) yes.
**Correct output:**
**ID:** 542
**Question:** How many models in the figure achieve an Acc score greater
than 60?
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:** There are two models in the figure that achieve an Acc score
greater than 60.
**Correct output:**
**Multimodal Bard:**
. These models are:
\* GPT-4
\*
\*
\*
\*
\*
\*
\*
\* LLAMA2-70B
The Acc score is the accuracy of the model, which is the percentage of
test cases that the model correctly classifies. An Acc score of 60 means
that the model correctly classifies 60% of the test cases.
**GPT-4V:** .
### Scientific Scene
Answering questions based on scientific scenes poses a challenge in
aligning the scientific concepts present in the question text and those
in the accompanying figures. GPT-4V demonstrates its superior ability to
reason about scientific scenes compared to other models, as evidenced in
Figure [fig:tease_scores]. In the example
of Figure [fig:example_685], GPT-4V adeptly
identifies two organisms in the food web and elucidates their
relationship. In another instance, shown in Figures
[fig:example_862] and
[fig:example_862_2], both
Multimodal Bard and GPT-4V are able to use knowledge in the physical
domain to effectively ground the symbols and variables depicted in the
figure.
**ID:** 685
**Question:** What would happen to organism c if organism b increased?
**Choices:** (A) decrease (B) increase
(C) can’t predict (D) stay same
**Ground truth:**
**LLaMA-Adapter-V2:**
. This is because organism c is dependent on organism b for its food
supply, and .
**LLaVA:**
. . As a result, .
**Multimodal Bard:**
.
In the food chain, . This means that . .
.
**GPT-4V:**
If Organism B (which appears to be some kind of insect) increased in
population, it would mean that there is more available food for Organism
C (the fish).
Therefore, the population of Organism C would likely: (B) increase.
**Correct output:**
**ID:** 862
**Question:** One of the most dramatic videos on the web (but entirely
fictitious) supposedly shows a man sliding along a long water slide and
then being launched into the air to land in a water pool. Let’s attach
some reasonable numbers to such a flight to calculate the velocity with
which the man would have hit the water. Figure indicates the launch and
landing sites and includes a superimposed coordinate system with its
origin conveniently located at the launch site. From the video we take
the horizontal flight distance as $D=20.0 \mathrm{~m}$, the flight time
as $t=2.50 \mathrm{~s}$, and the launch angle as
$\theta_0=40.0^{\circ}$. Find the magnitude of the velocity at launch
and at landing.
**Ground truth:**
**Multimodal Bard:**
Given the following information:
\* Horizontal flight distance, $D = 20.0 \text{ m}$
\* Flight time, $t = 2.50 \text{ s}$
\* Launch angle, $\theta_0 = 40.0^{\circ}$
We can find the magnitude of the velocity at launch using the following
equation:
$$v_0 = \sqrt{v_{0x}^2 + v_{0y}^2}$$
where $v_{0x}$ is the horizontal component of the launch velocity and
$v_{0y}$ is the vertical component of the launch velocity.
The horizontal component of the launch velocity can be found using the
following equation:
$$v_{0x} = \frac{D}{t} = \frac{20.0 \text{ m}}{2.50 \text{ s}} = 8.00 \text{ m/s}$$
The vertical component of the launch velocity can be found using the
following equation:
$$v_{0y} = v_0 \sin \theta_0 = 8.00 \text{ m/s} \sin 40.0^{\circ} = 5.83 \text{ m/s}$$
Therefore, the magnitude of the velocity at launch is:
$$v_0 = \sqrt{v_{0x}^2 + v_{0y}^2} = \sqrt{(8.00 \text{ m/s})^2 + (5.83 \text{ m/s})^2} = 10.44 \text{ m/s} \approx 10.4 \text{ m/s}$$
The magnitude of the velocity at landing will be the same as the
magnitude of the velocity at launch, since there is no air resistance in
this case.
.
**ID:** 862
**Question:** One of the most dramatic videos on the web (but entirely
fictitious) supposedly shows a man sliding along a long water slide and
then being launched into the air to land in a water pool. Let’s attach
some reasonable numbers to such a flight to calculate the velocity with
which the man would have hit the water. Figure indicates the launch and
landing sites and includes a superimposed coordinate system with its
origin conveniently located at the launch site. From the video we take
the horizontal flight distance as $D=20.0 \mathrm{~m}$, the flight time
as $t=2.50 \mathrm{~s}$, and the launch angle as
$\theta_0=40.0^{\circ}$. Find the magnitude of the velocity at launch
and at landing.
**Ground truth:**
**GPT-4V:**
Given:
- $D = 20.0$ m
- $t = 2.50$ s
- $\theta_0 = 40.0^{\circ}$
Using kinematic equations, the horizontal motion is described as:
$$D = V_{0x} \times t$$
Where $V_{0x}$ is the horizontal component of the initial velocity.
$$V_{0x} = V_0 \times \cos(\theta_0)$$
From the first equation: $$V_0 \times \cos(\theta_0) = \frac{D}{t}$$
Rearranging for $V_0$: $$V_0 = \frac{D}{t \times \cos(\theta_0)}$$
Substitute the given values to get $V_0$:
$$V_0 = \frac{20.0 \mathrm{~m}}{2.50 \mathrm{~s} \times \cos(40.0^{\circ})}$$
Using a calculator:
$$V_0 = \frac{20.0}{2.50 \times 0.766}$$ $$V_0 = \frac{20.0}{1.915}$$
$$V_0 = 10.44 \mathrm{~m/s}$$
Thus, the magnitude of the launch velocity (which will be the same as
the landing velocity, ignoring air resistance) is $10.44 \mathrm{~m/s}$.
**Correct output:**
### Synthetic Scene
Problems involving synthetic scenes require a nuanced understanding of
visual objects, such as the numbers, attributes, and positions of these
objects, as shown in Figures
[fig:example_52] and
[fig:example_177]. Although GPT-4V
demonstrates notable advancements over other models, such as Multimodal
Bard, it still falls significantly short of human performance, as shown
in Figure [fig:tease_scores].
**ID:** 52
**Question:** Subtract all big red metallic spheres. Subtract all big
brown matte things. How many objects are left?
**Ground truth:**
**LLaMA-Adapter-V2:**
.
**LLaVA:**
.
**Multimodal Bard:**
:
\* 1 big red metallic sphere
\* 1 big brown matte cube
\* 1 small red metallic sphere
\* 1 small brown matte sphere
:
\* 1 big brown matte cube
\* 1 small red metallic sphere
\* 1 small brown matte sphere
:
\* 1 small red metallic sphere
\* 1 small brown matte sphere
Therefore, .
**GPT-4V:**
There are no big red metallic spheres in the image. There is 1 big brown
matte object (the cube).
Total objects to be subtracted = 0 + 1 = 1
Total objects in the image = 7
Objects left = 7 - 1 = 6
There are 6 objects left.
**Correct output:**
**ID:** 177
**Question:** Is the number of brown tandem bikes that are to the left
of the small blue matte car greater than the number of tiny blue
biplanes?
**Choices:** (A) Yes (B) No
**Ground truth:**
**LLaMA-Adapter-V2:**
Yes, the number of brown tandem bikes that are to the left of the small
blue matte car is greater than the number of tiny blue biplanes.
**Correct output:**
**LLaVA:**
.
**Multimodal Bard:**
. , and . .
**GPT-4V:**
Looking at the image, there is 1 brown tandem bike to the left of the
small blue matte car. There are no tiny blue biplanes in the image.
Thus, the number of brown tandem bikes that are to the left of the small
blue matte car is (A) Yes greater than the number of tiny blue biplanes.
**Correct output:**
### Table
Tables serve as a powerful tool to present and summarize large amounts
of data in a comprehensible manner. In particular, GPT-4V has shown
significant advancements over other foundation models and even surpasses
human performance on table-related reasoning tasks, as shown in Figure
[fig:tease_scores]. The example in
Figure [fig:example_923] shows a complex
table taken from an academic paper. GPT-4V can accurately pinpoint the
target cells among numerous rows and columns. Figure
[fig:example_265] shows a QA task
in which the answer needs to be derived from the table regarding the
push-up competition. GPT-4V is the only model that can produce the
correct answer.
**ID:** 923
**Question:** Which model achieves the highest score in terms of Rec?
**Choices:**
(A) Transformers Agent (GPT-4)
(B) LLaMA-Adapter v2-7B
(C) LLaVA-7B
(D) Otter-9B
(E) MM-ReAct-GPT-3.5
(F) LLaVA-13B (LLaMA-2)
(G) MM-ReAct-GPT-4
**Ground truth:**
**LLaMA-Adapter-V2:** .
**LLaVA:**
.
**Multimodal Bard:**
The answer is **(G) MM-ReAct-GPT-4**.
MM-ReAct-GPT-4 achieves the highest score of **33.1** in terms of Rec,
which is the ability to recognize and understand text. The other models
have the following scores in terms of Rec:
\* Transformers Agent (GPT-4): 18.2
\* LLaMA-Adapter v2-7B: 16.8
\* LLaVA-7B: 28.0
\* Otter-9B: 28.4
\* MM-ReAct-GPT-3.5: 24.2
\*
**GPT-4V:**
The model that achieves the highest score in terms of Rec is “LLaVA-13B
(LLaMA-2)” with a score of 39.2%. The correct option is: (F) LLaVA-13B
(LLaMA-2)
**Correct output:**
**ID:** 265
**Question:** Justine’s P.E. class participated in a push-up
competition, and Justine wrote down how many push-ups each person could
do. How many people did at least 60 push-ups? (Unit: people)
**Ground truth:**
**LLaMA-Adapter-V2:**
Justine’s P.E. class participated in a push-up competition, and Justine
wrote down how many push-ups each person could do. . .
**LLaVA:**
.
**Multimodal Bard:**
There are 4 people who did at least 60 push-ups.
, so .
. .
**GPT-4V:**
For the stem-and-leaf plot provided:
6 represents the tens value, and the digits 3, 6, 7, and 7 represent the
ones value. So, the number of push-ups represented by the 6 row are 63,
66, 67, and 67.
7 represents the tens value, and the digits 1, 3, 3, 5, 7, 7, and 8
represent the ones value. So, the number of push-ups represented by the
7 row are 71, 73, 73, 75, 77, 77, and 78.
Counting these numbers: 4 from the 6 row + 7 from the 7 row = 11.
**Correct output:**
### Other Visual Contexts
On the reasoning tasks using *other* visual contexts, GPT-4V achieves a
higher overall accuracy than all the other models, as depicted in Figure
[fig:tease_scores]. For instance,
GPT-4V is the only model that is capable of generating the correct
answer to the question regarding a violin plot, as shown in Figure
[fig:example_895].
**ID:** 895
**Question:** Which model has the lowest Audio-Audio Similarity and
Text-Audio Similarity scores overall?
**Choices:**
(A) MusicLDM (mix-up)
(B) MusicLDM (original)
(C) MusicLDM (BLM)
(D) MusicLDM (BAM)
(E) MuBERT
**Ground truth:**
**LLaMA-Adapter-V2:**
.
**LLaVA:**
Based on the image, .
**Multimodal Bard:**
.
The violin plot shows that the distribution of audio-audio similarity
scores for MusicLDM (BLM) is the most spread out, indicating that there
is a lot of variation in the scores. The distribution of text-audio
similarity scores for MusicLDM (BLM) is also the most spread out. This
suggests that MusicLDM (BLM) is not very consistent in its ability to
capture the similarity between audio and text.
**GPT-4V:**
The model with the lowest Audio-Audio Similarity and Text-Audio
Similarity scores overall is MuBERT.
**Correct output:**
## Self-Verification in GPT-4V [app:gpt4v_self]
Self-verification is a social psychological theory asserting that people
desire others to perceive them as they see themselves. Consequently,
individuals will take active measures to ensure that others view them in
ways that confirm their stable self-concepts
[talaifar2020self](http://arxiv.org/pdf/2405.14152v1).
Interestingly, in our experiments, GPT-4V demonstrates an ability
similar to *self-verification*. The model can inspect its own behaviors
during the course of reasoning and can take active actions to correct
its mistakes. Note that self-verification we discuss here differs from
several recent works on improving the model’s outputs based on external
feedback [peng2023check](http://arxiv.org/pdf/2302.12813v3) or additional generations
[thedawn2023yang](None). The examples in Figures
[fig:example_6] and
[fig:example_28] show that GPT-4V,
on its own, can inspect a set of candidate answers and identify the one
that is valid and meets all the given constraints. The multi-step
reasoning example in Figure
[fig:example_615] shows that GPT-4V
can verify the validity of (the result of) each reasoning step, and
explore alternative approaches if any invalid (intermediate) result is
detected (*e.g.,* a negative value for length).
Although self-verification does not guarantee an accurate response even
after multiple tries, especially when applying GPT-4V to visual
perception or mathematical reasoning in intricate scenarios (see Figure
[fig:example_644]), it is
instrumental in improving the model performance on . We also found that
GPT-4V’s self-verification is weaker for non-English tasks, such as
Mandarin, as shown in Figure
[fig:example_32]. It is also worth
noting that self-verification does not emerge in other foundation models
we studied, or at least it is not as robust as that of GPT-4V. As shown
in Figure [fig:example_44], Multimodal Bard
first attempts a natural language solution, followed by a
program-assisted one for verification. However, the program-aided
solution leads to a different and incorrect prediction.
The emergent ability of self-verification highlights GPT-4V’s potential
in solving rigorous reasoning and theorem-proving tasks. One of the most
exciting research topics for future work is to develop a mechanism that
allows the model to activate self-verification consistently at the right
time and to use a set of alternative approaches that maximize the
success rate of task completion.
**ID:** 615
**Question:** If two sides of a triangle measure 12 and 7, which of the
following cannot be the perimeter of the triangle?
**Choices:** (A) 29 (B) 34 (C) 37 (D) 38
**Ground truth:**
**GPT-4V:**
The sum of the lengths of any two sides of a triangle must be greater
than the length of the third side. Let $x$ be the length of the third
side.
From the triangle inequality: $$12 + 7 > x => x < 19$$ Similarly:
$$7 + x > 12 => x > 5$$ And:
$$12 + x > 7 => x > -5 \text{~(\blue{This will always be true since the length cannot be negative})}$$
So, $$5 < x < 19$$
**Correct output:**
**ID:** 6
**Question:** As shown in the figure, if $CB = 4.0$, $DB = 7.0$, and $D$
is the midpoint of $AC$, then the length of $AC$ is ().
**Choices:** (A) $6$ cm (B) $7$ cm (C) $4$ cm (D) $5$ cm
**Ground truth:**
**GPT-4V:**
From the figure, we can deduce:
$$AD = DC \text{~(since $D$ is the midpoint of $AC$)}$$
Now, using the entire line segment $AB$: $$AB = AD + DB$$
$$AB = DC + DB$$
Given $DB = 7.0$ cm, we can rearrange to find:
$$AD~(\text{or}~DC) = AB - DB$$ $$AD = AB - 7.0$$
Now, let’s find $AB$:
From the given information, $$AB = AC + CB$$ $$AB = AC + 4.0$$
Using the previously deduced equation: $$AD = AC + 4.0 - 7.0$$
$$AD = AC - 3.0$$
Since $D$ is the midpoint of $A$C, $AD$ will be half the length of AC:
$$AC = 2(AD)$$
From the above: $$\red{AC - 3.0 = 2(AD)}$$
Given that $AD$ and $DC$ are equal, we can represent $DC$ as:
$$DC = AC/2$$
Using the relation $CB = 4.0$ cm and $DB = 7.0$ cm:
$$\red{AC/2 + 7.0 = 4.0}$$ $$\red{AC/2 = -3.0}$$
From the original figure: $$AB = AD + DB$$ $$AB = DC + DB$$
Given $DB = 7.0$ cm, we have: $$AD = AB - 7.0$$ $$AD = AC/2 - 7.0$$
Using the relation $AB = AC + CB$: $$AC/2 + 7.0 = AC + 4.0$$
Subtracting $AC/2$ from both sides: $$7.0 = AC/2 + 4.0$$
Subtracting $4$ from both sides: $$AC/2 = 3.0$$
Multiplying both sides by $2$: $$AC = 6.0 \text{~cm}$$
**Correct output:**
**ID:** 28
**Question:** $\overline{AB}$ is a diameter, $AC=8$ inches, and $BC=15$
inches. Find the radius of the circle.
**Choices:** (A) 7.5 (B) 8 (C) 8.5 (D) 17
**Ground truth:**
**GPT-4V:**
Given that $AB$ is a diameter and $AC = 8$ inches and $BC = 15$ inches,
we can use the Pythagorean theorem to determine the length of $AB$.
The right triangle is formed by $A$, the center of the circle, and $C$
makes $AC$ the shorter leg, the radius the longer leg, and $BC$ the
hypotenuse.
Let $r$ be the radius of the circle. Using the Pythagorean theorem:
Substituting the given lengths: $$\red{8^2 + r^2 = 15^2}$$
$$\red{64 + r^2 = 225}$$ $$\red{r^2 = 161}$$
Taking the square root of both sides: $$\red{r = \sqrt{161}}$$
$$\red{AB = 2r = 2\sqrt{161}}$$
$$r = \sqrt{161}/2 = 8.5 \text{~(approximately)}$$
**Correct output:**
**ID:** 644
**Question:** What is the missing number in the picture?
**Choices:** (A) 6 (B) 8 (C) 10 (D) 11
**Ground truth:**
**Solution:** $8 + 4 -2 = 10$, $3 + 2 - 3 = 2$, $12 + 1 - 5 = 8$
**GPT-4V:**
Let’s analyze the pattern in the given triangles:
1. For the first triangle: $$\red{(8 - 4) \times 2 = 8}$$
2\. For the second triangle: $$\red{(3 - 2) \times 3 = 3}$$
Applying the same pattern to the third triangle:
$$\red{(12 - 1) \times 5 = 55}$$
However, since the answer is in single digits, it seems we take only the
unit’s place of the result.
So, the missing number is:
1. For the first triangle: $$\red{(8 + 4) / 2 = 6}$$
2\. For the second triangle:
$$\red{(3 + 2) / 3 = 5/3 = 1.67} \text{~(This does not match the number inside)}$$
But let’s continue with this pattern for the third triangle:
$$\red{(12 + 1) / 5 = 13/5 = 2.6}$$
1. For the first triangle: $$\red{(8 \times 4) / 2 = 16}$$
2\. For the second triangle: $$\red{(3 \times 2) / 3 = 2}$$
3\. For the third triangle: $$\red{(12 \times 1) / 5 = 12/5 = 2.4}$$
(In the figure, within $ABCD$, $AB = AC$, and $\angle CAB = 40^\circ$,
then the measure of $\angle D$ is ().)
**Choices:** (A) 40° (B) 50° (C) 60° (D) 70°
**Ground truth:**
**GPT-4V:**
UTF8gbsn 给定 $AB = AC$,$\triangle ABC$ 是等腰三角形。由于
$\angle CAB = 40^\circ$,所以
(因为在等腰三角形中,两个底角是相等的)。
(Given $AB = AC$, $\triangle ABC$ is an isosceles triangle. Since
$\angle CAB = 40^\circ$, (because in an isosceles triangle, the two base
angles are equal).)
考虑四边形 $ABCD$,其内角和为 $360^\circ$。已知
$\angle CAB = 40^\circ$,,并且 。
(Considering the quadrilateral $ABCD$, its interior angles sum up to
$360^\circ$. We know that $\angle CAB = 40^\circ$, , and .)
我们可以找到 $\angle BCA$ 的角度为:
(We can find the angle of $\angle BCA$ as:)
,我们有:
(, we have:)
()
注意到 。所以:
(Notice that . So:)
所以 。。
(Thus, . .)
()
。
(.)
因此, ,但是 。
(Therefore, , but .)
所以 (So) $$\red{\angle D = 100^\circ - 40^\circ = 60^\circ}$$
**ID:** 44
**Question:** Chase wants to buy 4 kilograms of oval beads and 5
kilograms of star-shaped beads. How much will he spend? (Unit: \$)
**Ground truth:**
**Solution:** $8 + 4 -2 = 10$, $3 + 2 - 3 = 2$, $12 + 1 - 5 = 8$
**Multimodal Bard:**
The cost of 4 kilograms of oval beads is $4 * 2 = \$8$.
The cost of 5 kilograms of star-shaped beads is $5 * 2 = \$10$.
So the total cost is $\$8 + \$10 = \$18$.
.
``` python
def calculate_cost(oval_beads, star_shaped_beads):
oval_bead_cost = 2
star_shaped_bead_cost = 2
total_cost = oval_beads * oval_bead_cost + star_shaped_beads * star_shaped_bead_cost
return total_cost
print(calculate_cost(4, 5))
# 35
```
## Self-Consistency for GPT-4V [app:gpt4v_consistency]
Self-consistency [wang2022self](http://arxiv.org/pdf/2404.13149v1) is a decoding strategy
for chain-of-thought prompting [wei2022chain](http://arxiv.org/pdf/2201.11903v6). A diverse
set of reasoning paths is sampled, and the most consistent answer is
selected as the final prediction. Moving beyond vanilla greedy decoding,
this method resorts to the inherent coherence and reliability of
multiple reasoning trajectories to produce a more trustworthy
conclusion. Self-consistency has been widely employed in LLMs for
complex reasoning tasks, such as math word problems and commonsense
reasoning.
In our experiments, we validated the effectiveness of using
self-consistency for GPT-4V on . Given a question and context, we ran
GPT-4V multiple times to obtain a set of different reasoning paths and
then selected the most frequent answer as the final prediction. We found
that self-consistency is instrumental in rectifying visual perception
errors (Figure [fig:example_648]), correcting
calculation mistakes (Figure
[fig:example_331]), and mitigating
hallucinations (Figure
[fig:example_430]). In comparison,
self-consistency is less effective when GPT-4V has difficulties in
interpreting complex visual contexts (Figures
[fig:example_157],
[fig:example_551]) or extracting
salient information from images (Figure
[fig:example_965]).
**ID:** 648
**Question:** What is the minimum value of this function?
**Ground truth:**
**GPT-4V (Reasoning Path 1):**
The minimum value of the function is observed where the graph is at its
lowest point along the $y$-axis.
From the graph, the minimum value lies on the $y$-axis between -0.5 and
0. Based on the visual representation, it appears that the minimum value
of $y$ is: -0.5.
So, the answer is -0.5.
**Wrong output:**
**GPT-4V (Reasoning Path 2):**
The graph shown represents a function with circular data points plotted
across the $x$-axis. To determine the minimum value of the function, we
need to locate the lowest point on the graph.
From the graph, it is evident that the lowest point (or the minimum
value of the function) is at a $y$-coordinate of approximately -1.
Therefore, the minimum value of the function is -1.
**Correct output:**
**GPT-4V (Reasoning Path 3):**
The graph depicts the function using circular points. The minimum value
of the function can be observed by identifying the lowest point on the
graph.
From the graph, the lowest point seems to be at $y$ = -1.
Therefore, the minimum value of this function is -1.
**Correct output:**
**ID:** 331
**Question:** Several people compared how many Web pages they had
visited. What is the mean of the numbers?
**Ground truth:**
**GPT-4V (Reasoning Path 1):**
To find the mean of the numbers, you add up all the numbers and then
divide by the quantity of numbers. $$\begin{aligned}
\text{Mean} &= (84 + 56 + 70 + 68 + 60 + 47 + 75 + 52) ÷ 8\\
&= 462 ÷ 8\\
&= 57.75
\end{aligned}$$ However, since the question requires an integer answer,
the mean number of web pages visited is: 58.
**Wrong output:**
**GPT-4V (Reasoning Path 2):**
To find the mean number of web pages visited:
First, sum the number of web pages for all the individuals:
Bonnie: 84, Winston: 56, Percy: 70, Derek: 68, Jake: 60, Finn: 47,
Carter: 75, Kurt: 52
$$\text{Total} = 84 + 56 + 70 + 68 + 60 + 47 + 75 + 52 = 512$$
Next, divide by the number of individuals (8):
$$\text{Mean} = 512 ÷ 8 = 64$$
The mean number of web pages visited is 64.
**Correct output:**
**GPT-4V (Reasoning Path 3):**
First, I’ll list the number of web pages each person visited:
Bonnie: 84, Winston: 56, Percy: 70, Derek: 68, Jake: 60, Finn: 47,
Carter: 75, Kurt: 52
Next, I’ll group some numbers together for easier addition:
$$\begin{aligned}
(84 + 56) &= 140\\
(70 + 68) &= 138\\
(60 + 47) &= 107\\
(75 + 52) &= 127
\end{aligned}$$
Now, I’ll sum these results:
$$140 + 138 + 107 + 127 = 512$$
Now, I’ll divide the sum by the number of individuals to get the mean:
$$512 ÷ 8 = 64$$
So, the mean number of web pages visited is 64.
**Correct output:**
**ID:** 430
**Question:** The derivative of $f(x)$ at $x=0$ is
------------------------------------------------------------------------
that at $x=5$
**Choices:** (A) larger than (B) equal to (C) smaller than
**Ground truth:**
**GPT-4V (Reasoning Path 1):**
The graph represents the function $f(x) = |2x - 3| + 1$.
The slope (or the derivative) of $f(x)$ is represented by the
inclination of the line.
At $x = 0$, the slope is negative (the line is declining), and at
$x = 5$, the slope is positive (the line is inclining). The magnitude of
the slopes appears to be the same; only their directions differ.
However, when considering the magnitude of the derivative, the value at
$x = 0$ is negative, and the value at $x = 5$ is positive, but they have
the same absolute value.
**Wrong output:**
**GPT-4V (Reasoning Path 2):**
The given function is $f(x) = |2x - 3| + 1$.
The absolute value function is piecewise defined, so its derivative will
have different values depending on where you take it.
For $2x - 3 > 0$, which is $x > \frac{3}{2}$ or 1.5, $f(x) = 2x - 3$.
The derivative is $f'(x) = 2$.
For $2x - 3 < 0$, which is $x < \frac{3}{2}$ or 1.5, $f(x) = -2x + 3$.
The derivative is $f'(x) = -2$.
At $x = 0$, the derivative $f'(x) = -2$.
At $x = 5$, the derivative $f'(x) = 2$.
Therefore, the derivative of $f(x)$ at $x = 0$ is smaller than that at
$x = 5$.
**Correct output:**
**GPT-4V (Reasoning Path 3):**
The function provided is $f(x) = |2x - 3| + 1$.
The absolute value function has a V-shaped graph, as shown in the image.
The slope (or derivative) of the line to the left of the vertex is
negative, and the slope of the line to the right of the vertex is
positive.
The derivative of $f(x)$ at $x=0$ corresponds to the slope of the line
on the left side of the vertex, which is negative (downward slope). The
derivative of $f(x)$ at $x=5$ corresponds to the slope of the line on
the right side of the vertex, which is positive (upward slope).
Since a positive value is always larger than a negative value, the
derivative of $f(x)$ at $x=0$ is smaller than that at $x=5$.
**Correct output:**
**ID:** 157
**Question:** What number is shown?
**Ground truth:**
**GPT-4V (Reasoning Path 1):**
Let’s break down the number shown using base-ten blocks:
- There are 6 large cubes, each representing $1,000$. So,
$6 \times 1,000 = 6,000$.
- There are 7 flats, each representing 100. So, $7 \times 100 = 700$.
- There are 4 rods, each representing 10. So, $4 \times 10 = 40$.
- There are 3 units, each representing 1. So, $3 \times 1 = 3$.
Adding all of these values together: $$6,000 + 700 + 40 + 3 = 6,743$$
The number shown is $6,743$.
**Wrong output:**
**GPT-4V (Reasoning Path 2):**
The image displays a series of cubes that appear to represent numbers in
base 10 using place value.
From left to right:
1. There are six large cubes, which represent a value of
$6 \times 1000 = 6000$.
2. There are three medium-sized cubes, which represent a value of
3$\times 100 = 300$.
3. There are two small cubes, which represent a value of
$2 \times 10 = 20$.
4. The vertical stack of single unit cubes shows nine cubes, which
represent a value of $9 \times 1 = 9$.
Adding up these values: $$6000 + 300 + 20 + 9 = 6329$$
The number shown is $6329$.
**Wrong output:**
**GPT-4V (Reasoning Path 3):**
The image displays a series of cubes representing numbers in base 10
using place value.
From left to right:
1. There are seven large cubes, which represent a value of
$7 \times 1000 = 7000$.
2. There are two medium-sized cubes, which represent a value of
$2 \times 100 = 200$.
3. There are nine small cubes, which represent a value of
$9 \times 10 = 90$.
4. The vertical stack of single unit cubes shows eight cubes, which
represent a value of $8 \times 1 = 8$.
Adding up these values: $$7000 + 200 + 90 + 8 = 7298$$
The number shown is $7298$.
**Wrong output:**
**ID:** 551
**Question:** What is the difference between the highest tattoos in male
and the least in female?
**Ground truth:**
**Solution:**
27% (male in 2015)
- 13% (female in 2008)
= 14%
**GPT-4V (Reasoning Path 1):**
The highest percentage of tattoos in males is 31% (in 2015), and the
least percentage of tattoos in females is 15% (in 2008).
Difference = 31% - 15% = 16%
**Wrong output:**
**GPT-4V (Reasoning Path 2):**
The highest percentage of males with at least one tattoo is in 2015 at
31%.
The lowest percentage of females with at least one tattoo is in 2003,
2008, and 2012, all at 15%.
The difference between these values is: 31% - 15% = 16%
**Wrong output:**
**GPT-4V (Reasoning Path 3):**
To determine the difference between the highest tattoos in males and the
least in females:
Highest tattoos in males: 31% (in 2015)
Least tattoos in females: 15% (in 2003 and 2008)
Difference: 31% - 15% = 16%
**Wrong output:**
**Prompt:** What is the percentage of tattoos in females in 2008?
**GPT-4V:** .
**Prompt:** What is the percentage of tattoos in males in 2015?
**GPT-4V:**
**ID:** 965
**Question:** What’s the lowest number yard line that you can see?
**Ground truth:**
**GPT-4V (Reasoning Path 1):**
The lowest number yard line visible in the image is .
**GPT-4V (Reasoning Path 2):**
From the image provided, the lowest number yard line that is clearly
visible is .
**GPT-4V (Reasoning Path 3):**
From the image provided, the lowest number yard line that can be seen is
.
**GPT-4V (Reasoning Path 4):**
From the image, the lowest number yard line that is visible is .
**GPT-4V (Reasoning Path 5):**
From the image, the lowest number yard line that is visible is .
**Prompt:** List the yard line numbers that are visible in the image.
**GPT-4V:**
**Prompt:** Can you see the yard numbers 40 and 50 in the image?
**GPT-4V:**
## Abstaining from Responding [sec:gpt4v_abstain]
We observed that both Multimodal Bard and GPT-4 often abstain from
responding. Specifically, Multimodal Bard returns “Sorry, I can’t answer
this question” (7.0%) or “I can’t process this file” (0.8%), while
GPT-4V responds with “Sorry, I cannot help with that.” (7.0%). As
illustrated in Figures
[fig:example_699] through
[fig:example_more_safety],
these patterns can be attributed to patterns in large language models
that have been characterized as “exaggerated safety behaviors”
[rottger2023xstest](None). Further research is necessary to
explore these behaviors in large multimodal models.
**ID:** 699
**Question:** What is the age gap between the center and the rightmost
person? (unit: years)
**Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
**ID:** 740
**Question:** What percent of the stands are full?
**Choices:** (A) 15 (B) 100 (C) 50
**Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
**ID:** 779
**Question:** Is the cardiac silhouette less than half the diameter of
the diaphragm?
**Choices:** (A) Yes (B) No
**Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
**ID:** 821
**Question:** Are most of the people young men?
**Choices:** (A) Yes (B) No
**Ground truth:**
**Multimodal Bard:**
**GPT-4V:**
**ID:** 352
**Question:** How many people will fit in the smaller vehicle?
**Ground truth:**
**ID:** 516
**Question:** Are most the stepping stones square?
**Choices:** (A) Yes (B) No
**Ground truth:**
**GPT-4V:**
## GPT-4V for Multi-Turn Human-AI Interaction [app:gpt4v_chatbot]
This section investigates the use of GPT-4V for multi-turn human-AI
interaction on , as exemplified in the goal-directed dialog in Figure
1.
We found that GPT-4V is effective in engaging multi-turn goal-directed
conversations with users. In particular, GPT-4V can make good use of
*hints* (*e.g.,* user feedback or responses) to guide the conversion to
generate desirable results. For instance, it can (1) rectify visual
perception errors based on hints (Figure
2), (2) reassess
reasoning steps and calculations (Figure
3), (3) correct
misinformation using user-provided domain-specific knowledge (Figure
4), and (4) aggregate
intricate contexts over multiple turns in a human-AI conversation
(Figures
5 and
6).
We also observed failure cases in our evaluation. For instance, GPT-4V
struggles to generate correct responses when questions and user hints
are ambiguous (Figure 7), or when the model fails to
understand abstract shapes and concepts visually (Figure
8). These failures
motivate the development of more powerful, conversational foundation
models.
A multi-turn dialog between a human user and
GPT-4V.GPT-4V initially produces an incorrect solution due to
errors in visually perceiving the number of baseballs. After receiving
user feedback, it recalibrates and generates the accurate
answer.GPT-4V correctly identifies the sequence of numbers in the
images but struggles to discern the underlying pattern. With feedback
from the user, GPT-4V re-evaluates the pattern and produces the correct
response to the logical reasoning question.Initially, GPT-4V applies an incorrect theorem to address
the scientific reasoning query. However, after receiving the appropriate
domain knowledge hint from the user, it utilizes the correct theorem,
combined with key visual cues from the images, to produce the right
answer.Answering this question requires comprehending a complex
table regarding AI models (such as identifying relevant rows and
columns), and up-to-date knowledge of AI research, and the ability to
perform advanced arithmetic reasoning. Without human in the loop, GPT-4V
fails to produce the correct answer.(Continued) In this multi-turn human-AI conversation, the
user acts as a tutor to help GPT-4V generate the correct answer . (1)
Initially, the user directs GPT-4V to the “Model Size” column after
observing its misdirection. Despite the guidance, GPT-4V still
struggles. (2) Recognizing the model’s difficulty in identifying the
“Model Size” column, the user provides more detailed instructions. This
time, GPT-4V is on the right track but still makes some error in its
response. (3) With further information from the user, GPT-4V corrects
the remaining errors and generate the correct answer.In this example, even after the user offers helpful hints,
GPT-4V struggles to decipher the number of triangles in the given
geometric shape. The task is challenging for AI models, including
GPT-4V, with a limited ability to understand abstract shapes
visually.
## Future Directions [sec:gpt4v_future]
Text
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
2022-09-16
Yu-Chung Hsiao, Fedir Zubach, Maria Wang, Jindong Chen
We present a new task and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 86K question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity.
Show Paper Content
# Introduction
Mobile app screenshots have been analyzed using machine learning from
multiple aspects. These analyses range from pixel level understanding,
e.g., layout structural analyses, UI issue detection and
correction [liLearningDenoiseRaw2022](https://doi.org/10.1145/3491102.3502042), to UI element
semantics, e.g., icon recognition, button action
prediction [sunkaraBetterSemanticUnderstanding2022](None), to
even higher-level functional analyses such as accessibility
support [liWidgetCaptioningGenerating2020](https://doi.org/10.18653/v1/2020.emnlp-main.443), screen
description [wangScreen2WordsAutomaticMobile2021](https://doi.org/10.1145/3472749.3474765), and
screen type classification [dekaRicoMobileApp2017](https://doi.org/10.1145/3126594.3126651).
Comparatively, the content understanding aspect is relatively
understudied. By content, we mean the information displayed on the
screen to convey and satisfy the purpose of using the app. Examples
include star ratings from restaurant reviews, messages from chat apps,
cuisine ingredients from recipe apps, flight status and in-flight
amenities from travel planner apps, etc. Having this capacity of
understanding is important for two reasons: First, the sole reason for
many apps and app categories to exist is to satisfy users’ information
need, e.g., weather, map navigation, and news apps. Second, for task
completion[^1], which requires the eyes-free agent capacity, the two
types of screen understandings — content and action understanding — are
inseparable in order to carry out a task successfully. Without knowing a
screen state properly, a machine learning agent is unable to self-assess
if the action is performed as expected, or unable to provide sufficient
feedback to the user to achieve true eyes-free user experience. More
intrinsically, from a pure research perspective, we are interested in
knowing the limit of machine screen content understanding[^2] and what
constitutes the challenges, given that app screenshots are entirely
human artifacts made for convenient comprehension.
Accordingly, we annotated the RICO
dataset [dekaRicoMobileApp2017](https://doi.org/10.1145/3126594.3126651) with 86,025
question-answer pairs, referred to as Screen Question Answering, or, in
short, **ScreenQA** annotations later in this work, and released the
dataset in the public domain[^3]. The ScreenQA task requires an agent to
answer a user’s question by selecting one or multiple UI elements from
the given screenshot, as will be formulated in
Section [sec:problem_setting]. Question
answering is employed as a touchstone to sparsely[^4] verify the quality
of screen content understanding. To the best of our knowledge, this is
the first large-scale questions answering dataset over mobile app
screenshots, and the first one to be publicly available. Much inspired
by the SQuAD dataset [rajpurkarSQuAD1000002016a](https://doi.org/10.18653/v1/D16-1264), we
hope, by releasing this set of annotations, to encourage the community
to advance technologies toward better screen content understanding. We
anticipate that the advance of such technologies will benefit beyond
just the screen UI and the human computer interaction (HCI) domains. As
we will discuss in
Section [sec:related_work], other
vision-language related multimodal domains share similar challenges with
different emphases on respective modalities and contexts. Comparatively,
ScreenQA is language and layout heavy, but it also includes visual
ingredients such as icons and symbols as concise representations in
place of texts, to declutter the UI. It may also include images or art
designs that pose challenges to language centric machine learning
agents.
The remaining paper is organized in the following way:
Section [sec:problem_setting]
formulates the problem, including the problem description and the
evaluation metrics. We discuss relevant prior datasets and annotations
in Section [sec:related_work] to put this
work into perspective.
Section [sec:annotation_method]
describes our annotation method. The annotations are then analyzed in
Section [sec:annotation_analysis]
to provide readers both the qualitative and quantitative views. The
paper is concluded in
Section [conclusion] with a summary and a remark
on future works.
[^1]: Also referred to as automation or app control.
[^2]: This term is analogous to machine reading comprehension from
natural language processing.
[^3]: ScreenQA dataset is released at
.
[^4]: Because questions are not exhaustively asked against a given
screenshot.
# Problem Setting [sec:problem_setting]
Ambiguous UI element boundaries. Three possibilities are
annotated.Answers to “What’s the temperature on
Saturday?”Semantic groups are the basic unit for ordering, not element
coordinates.
We state the problem and define the evaluation metrics in this section.
## Problem statement [sec:problem_statement]
The ScreenQA task requires an agent to answer a user’s question by
selecting relevant UI elements from a given single screenshot. When it
comes with multiple relevant UI elements, a list of such UI elements
whose contents *minimally* satisfy the question should be selected and
ranked in descending order of relevance to the question, if applicable,
or following the common reading order by semantic groups, as will be
described in
Section 1.2. This assumes
that answers are directly selectable from the screen and logical
reasoning and calculation are not needed. If the screenshot does not
contain the answers to the question, the agent should respond with “\”. This is summarized in
Task [task:sreenqa].
**Input:** a question $Q$ and a screenshot $S$
**Output:** an answer list $A$ of UI elements selected from $S$ such
that their contents minimally satisfy $Q$. The order of $A$ is further
required to be
- Ranked in descending order of relevance to $Q$, if applicable.
- Otherwise, following the common reading order by semantic groups.
If no contents in $S$ can satisfy $Q$, then returns an empty list $A$.
## Properties and terminologies [sec:properties_and_terminologies]
The mobile app UI comes with some nuances. It is worth mentioning a few
properties below.
- View hierarchy, or the structural representation used to render the
screen, is not required in
Task [task:sreenqa], to be consistent
with the human annotation process in
Section [sec:annotation_method].
View hierarchy usually provides useful UI element candidates, but it
may not always be reliable, for example, when using WebView or
screen overlays. In such cases, a human annotator can still answer
screen questions entirely from pixels without an issue, so we want
to benchmark similarly. We leave the choice of dependency on view
hierarchies to the modelers and, hence, do not require it. However,
this comes with an ambiguity for UI element boundaries. See an
example in
Figure 1. We devise a more flexible
answer matching process to mitigate such an impact, as will be
discussed in
Section 1.3.3.
- Avoid question answering over long paragraphs. Although it is
permissive by Task [task:sreenqa], we discourage
annotators from asking such questions during the annotation process.
For ScreenQA, we want to focus on learning the relationships between
text segments arranged two-dimensionally on the screen, and leave
the long paragraph question answering, which investigates the
relationships between words, to the traditional NLP domain.
- Avoid logical reasoning. This task assumes answers can directly be
extracted from the screenshot without reasoning, entailment,
counting and comparing numbers. This further exclude yes/no and why
questions if not explicitly displayed on the screen. The reason is
that we want to separate “able to read” and “able to reason” and
focus on the former first without generating an over challenging
dataset. A few such excluded examples are: counting items, asking
about the weather a few days from now, what are the items cheaper
than X dollars, etc.
- Ordered by relevance. The task is designed to enable the eyes-free
user experience. That is, a user may not be fully aware of how many
relevant answers are displayed on the screen. For example, in
Figure 2, when a user asks
“What’s the temperature on Saturday?”, there are actually two
temperatures, high and low, for each day and two Saturdays on the
screen. In this case, the two temperatures should just follow the
reading order, and the two Saturdays follow the relevance order as a
user usually refers to the upcoming Saturday. For a well-designed
mobile app, these two usually overlap well and we do not expect a
large ambiguity here.
- Reading order by semantic groups. Sometimes some UI elements are
designed as semantic groups and should be referred to together to
keep their semantic meaning. For example, in
Figure 3, when a user asks
“What are the first two movements of deep squat?”, then the answer
should be “Deep Squat, 3 sets, 15x”, followed by “Lunge, 3 sets,
10x”. In other words, the common reading order should be based on
semantic groups as the unit, rather than simply sorted by the
coordinates of UI elements.
Note that we set up the problem this way strategically in order to
prioritize its solvability considering the progress of current
technologies. However, practically, long vs. short texts and retrieval
vs. reasoning are naturally mixed together in the daily usage of mobile
apps. We will leave this type of composite problems to the future work.
## Evaluation metrics [sec:metrics]
We consider two types of metrics: 1) Average normalized Discounted
Cumulative Gain (Average
nDCG) [jarvelinCumulatedGainbasedEvaluation2002](https://doi.org/10.1145/582415.582418), which
is commonly used in information retrieval and ranking systems, and 2)
Average F1 score, which has been employed in closed-domain question
answering problems, such as the SQuAD
dataset [rajpurkarSQuAD1000002016a](https://doi.org/10.18653/v1/D16-1264).
One major difference between our metrics described below and the
commonly used definitions is the unit of predictions. We use the element
in the answer list $A$, described in
Task [task:sreenqa], as the unit to
determine a hit or a miss for both metrics. Besides, as UI elements can
be ambiguous as mentioned in
Section 1.2, we will
describe an answer matching algorithm that mitigate such an impact in
Section 1.3.3.
### Average nDCG [sec:avg_ndcg]
We use a variant of nDCG that allows varying positions (number of
returns) as opposed to a typical fixed position. This is because, unlike
the search problem, which is fair to evaluate, say, top-10 retrieved
documents across queries, ScreenQA can have different needs of answer
lengths across different questions. For example, a question like “what
is the address of the store” expects a single returned result. A
question like “what are the login options?” expects an enumeration of
options on the screen that easily go beyond five. Accordingly, we allow
$v$arying positions as follows: Given a 1-indexed list $A$, which is the
predicted answer for the screen-question pair $(S, Q)$, and a ground
truth answer list $A_g$ for $(S, Q)$, the Discounted Cumulative Gain at
$v$arying positions (DCG$_v$) is computed by: $$\label{eq:dcg}
\mbox{DCG}_v = \sum^{\|A\|}_{i=1} \frac{r_i}{\log_2{(i+1)}},$$ where
$\|\cdot\|$ is the size of the list argument, $r_i$ is the relevance
score for the $i$-th item of $A$. We assign the relevance score 1 for a
hit and 0 for a miss compared with the ground truth $A^g$. The
corresponding Ideal Discounted Cumulative Gain (IDCG$_v$) is computed
by: $$\label{eq:idcg}
\mbox{IDCG}_v = \sum^{\|A^g\|}_{i=1} \frac{1}{\log_2{(i+1)}}.$$ The
nDCG$_v$ is then $$\label{eq:ndcg}
\mbox{nDCG}_v = \frac{\mbox{DCG}_v}{\mbox{IDCG}_v}.$$ Note that
nDCG$_v$ is still between 0 and 1, hence, convenient for comparing
scores and computing the average.
For a dataset of $N$ examples, each of which is indexed by $i$ and has a
predicted answer $A_i$ and $K$ ground truth annotations
$A^g_{i, j=1 \dots K}$, the average nDCG$_v$ can be computed by
$$\label{eq:avg_ndcg}
\mbox{avg}(\mbox{nDCG}_v) = \frac{1}{N}\sum_{i=1}^N \mbox{max}_j [ \mbox{nDCG}_v(A_i, A^g_{i,j} ) ].$$
We choose a variant nDCG as the metric because 1) we want to measure the
quality of the ranking. For example, if one incorrectly predicts the
result from the first to the third position, the discount factor brings
down the score from 1.0 to only 0.5. 2) nDCG has an orthogonal design,
which is easier to tweak toward a specific need than the mean average
precision (mAP) metric. For example, one can choose to discount faster
or slower by changing the base of the denominator $\log_2(i+1)$ and can
choose to penalize irrelevant predictions by assigning negative scores.
Mean reciprocal rank (MRR) and mAP are much less controllable in these
two aspects.
One known drawback of nDCG is that it does not naturally penalize
excessive predictions after the last relevant item. We therefore use the
average F$_1$ score as a complementary view of the agent performance.
### Average F$_1$ [sec:avg_f1]
Similar to the definition in SQuAD, the average F$_1$ score is computed
as below, following the same notation as in
[eq:avg_ndcg]:
$$\label{eq:avg_f1}
\mbox{avg}(\mbox{F}_1) = \frac{1}{N}\sum_{i=1}^N \mbox{max}_j [ \mbox{F}_1(A_i, A^g_{i,j} ) ].$$
Note that F$_1$ does not concern ranking. For some cases, such as
enumeration questions, this is desirable, as the ranking order is merely
the reading order, even if the item order is permuted, the answer
quality is in general not compromised, hence, reasonable to be assigned
the same evaluation score. On the contrary, if relevance ranking is
important, such as in
Figure 2, then nDCG provides a
better view. Since both types of questions exist in the ScreenQA
annotations, it is more complete to evaluate against both metrics. Also
note that the unit of precision and recall computation is based on items
in $A$, unlike SQuAD, which uses words as the unit instead. We describe
how to compare items in an answer $A$ with the ground truth $A^g$ in the
next section.
### Answer matching [sec:answer_matching]
As mentioned in
Section 1.2, the
segmentation of UI elements provided in the predicted answer list $A$
may not coincide with the UI elements in the ground truth list $A^g$.
Yet, if the overall answers are the same, the segmentation difference
should not affect the evaluation score. Therefore, we use the following
empirical procedure to mitigate such an impact, using an illustrated
example (each capitalized character is a word token): $$\begin{aligned}
A &= ["AB", "B", "BB", "CBA"] \\
A^g &= ["AB", "BC", "AB"],
\end{aligned}$$
1. Concatenate items in $A$ into a single item list
$A^c = [``ABBBBCBA"]$.
2. Iterate through each $g \in A^g$ and check if $g$ is contained in
any item in $A^c$. If so, mark the $g$ as HIT ($\cmark$) and mark
the corresponding matched word token in the original $A$ and remove
the matched part and split the remaining parts in $A^c$. Otherwise,
mark the $g$ as MISS ($\xmark$). In this example, when $g = "AB"$,
it is a HIT: $$\begin{aligned}
A &= [``A_\cmark B_\cmark", ``B", ``BB", ``CBA"] \\
A^c &= [``BBBCBA"] \\
A^g &= [``AB"_\cmark, ``BC", ``AB"].
\end{aligned}$$ Then when $g = ``BC"$, it is a HIT. Note that the
item in $A^c$ is split into two because of matching in the middle:
$$\begin{aligned}
A &= [``A_\cmark B_\cmark", ``B", ``BB_\cmark", ``C_\cmark BA"] \\
A^c &= [``BB", ``BA"] \\
A^g &= [``AB"_\cmark, ``BC"_\cmark, ``AB"].
\end{aligned}$$ Last, when $g = ``AB"$ again, it is a MISS, $A$ and
$A^c$ unchanged, hence, omitted: $$\begin{aligned}
A^g &= [``AB"_\cmark, ``BC"_\cmark, ``AB"_\xmark].
\end{aligned}$$
3. Finally, iterate through each $a \in A$. If any $a$ has at least one
word token marked as HIT, then the whole $a$ is a HIT, otherwise, a
MISS. $$\begin{aligned}
A &= [``AB"_\cmark, ``B"_\xmark, ``BB"_\cmark, ``CBA"_\cmark].
\end{aligned}$$
This procedure converts $A$ and $A^g$ into lists of HITs and MISSes.
Then the evaluation metrics
in [eq:avg_ndcg]
and [eq:avg_f1] can be applied. Note that
this procedure is not order invariant. This in turn makes the F$_1$
score not entirely order independent if any UI element ambiguity
happens. This choice is to avoid the permutation complexity in
evaluation. In practice, this is rarely an issue because when the
ambiguity happens, the UI elements involved are almost always tightly
close to each other, making their order practically fixed. See Case 3 in
Figure 1 as an example.
# Related Datasets and Annotations [sec:related_work]
ScreenQA has two aspects: multimodality and question answering. We
discuss related problems and datasets from these two aspects and focus
our survey on datasets that are 1) human annotated and 2) released to
the public domain.
## Multimodality
Mobile app screenshots contain nearly all possible representation of
information through pixels. Most commonly, the information is majorly by
text, blended with icons, symbols, and images.[^1] We discuss three
related multimodal domains.
### Screen UI for mobile apps
For data released in the public domain, the RICO
dataset [dekaRicoMobileApp2017](https://doi.org/10.1145/3126594.3126651) is, to the best of our
knowledge, still the largest collection of mobile app
screenshots [dekaEarlyRicoRetrospective2021](https://doi.org/10.1007/978-3-030-82681-9_8). It
contains 66k unique screenshots and their corresponding view hierarchies
from 9.7k Android apps spanning 27 app categories. Its overall approach
extended ERICA [dekaERICAInteractionMining2016](https://doi.org/10.1145/2984511.2984581), which is
an interactive trace recording tool and also released 3k traces for 18k
unique screenshots from 1k Android apps for the search intent.
LabelDroid [chenUnblindYourApps2020](https://doi.org/10.1145/3377811.3380327)
and [chenWireframebasedUIDesign2020](https://doi.org/10.1145/3391613) by the same authors
released a dataset of 55k UI screenshots from 25 categories of 7.7k
top-downloaded Android apps.
Annotations and the corresponding problems can be roughly categorized by
the scope of the contexts. At the UI element
level, [sunkaraBetterSemanticUnderstanding2022](None)
annotated 77 icon types by shape, 15 out of which are additionally
annotated with 38 semantic types, reaching about total 500k unique
annotations. This work is further concerned with how UI elements are
associated with companion labels such that the screen understanding
between UI elements can be established.
CLAY [liLearningDenoiseRaw2022](https://doi.org/10.1145/3491102.3502042) attempted to resolve the
layout and view hierarchy denoising problem, annotating 60k RICO
screenshots, a total of 1.4M UI elements with bounding boxes and types.
[liWidgetCaptioningGenerating2020](https://doi.org/10.18653/v1/2020.emnlp-main.443) annotated 163k
free-from descriptions for 61k UI elements from 22k RICO screenshots. At
the single-screen
level, [wangScreen2WordsAutomaticMobile2021](https://doi.org/10.1145/3472749.3474765) collected
text summarizations for screens, consisting of 112k screen descriptions
across 22k RICO screenshots.
At the multi-screen level, one challenging direction is screen
navigation, which requires the understanding of screen states, feasible
action spaces of the current screen, and overall task goals. Since
multiple types of understandings are involved, this problem is not
strictly focused on screen content understanding.
PixelHelp [liMappingNaturalLanguage2020b](https://doi.org/10.18653/v1/2020.acl-main.729) contains 187
multi-step instructions over 780 screenshots for four task categories.
MoTIF [burnsDatasetInteractiveVisionLanguage2022](https://doi.org/10.48550/arXiv.2202.02312)
contains 6k fine-grained instructions mixed with infeasible ones, over
for 125 apps spanning 15 app categories. From the data perspective,
annotating this type of problem is labor intensive and usually does not
scale well.
In comparison, the ScreenQA dataset is single-screen, focused on screen
contents, and based on the RICO screenshots.
### Document image understanding
Document image understanding[^2] concerns understanding documents
represented in pixels or scanned, photographed formats. This domain is
similar to mobile app screens for its text-heavy and non-sequential
nature. The most noticeable dataset is
RVL-CDIP [harleyEvaluationDeepConvolutional2015](https://doi.org/10.1109/ICDAR.2015.7333910),
a 400k-image subset from
IIT-CDIP [lewisBuildingTestCollection2006](https://doi.org/10.1145/1148170.1148307), a collection
of low-resolution noisy documents, with balanced 16 document-level
classes. FUNSD [jaumeFUNSDDatasetForm2019](https://arxiv.org/pdf/1905.13538) extracted
a 199 scanned form images from RVL-CDIP and annotated them with bounding
boxes and 4 text-segment-level classes.
SROIE [huangICDAR2019CompetitionScanned2019](https://doi.org/10.1109/ICDAR.2019.00244) has 1k
scanned receipt images for text localization, OCR, and key information
extraction of 4 entity types.
CORD [parkCORDConsolidatedReceipt2019](None) contains 11k
scanned receipt images, annotated with 9 classes and 54 subclasses for
text segments in OCR boxes. These earlier works are more about
classification for text segments or for the whole document image.
A more recent work, DocVQA [mathewDocVQADatasetVQA2021](https://doi.org/10.1109/WACV48630.2021.00225),
uses a question answering format for span/segment extraction, with an
annotation of 50k questions over 12k rectified, higher resolution
document images. DocVQA is highly related to ScreenQA for its 2D
arrangement of texts and for its extractive question answering format.
We believe that the techniques developed for screens and document images
are cross applicable.
### Visual question answering
Visual question
answering (VQA) [antolVQAVisualQuestion2015](https://doi.org/10.1109/ICCV.2015.279) and screen
UI are oftentimes mentioned together, especially in the latter
community, because of their vision-language multimodal nature. However,
VQA is distinctively different from screen understanding for two
reasons: 1) The visual context for VQA is usually light in, or even free
from, any text, while screen UI is the opposite, and 2) The images for
VQA are typically photos of natural or daily scenes with objects, while
screen UIs are information oriented and arranged in a certain visual
structure. There are some VQA variants comparatively closer to screen
UI, to mention a few: VQA for texts on objects in photos, e.g.,
VizWiz [gurariVizWizGrandChallenge2018](https://doi.org/10.1109/CVPR.2018.00380) and
TextVQA [singhVQAModelsThat2019](https://doi.org/10.1109/CVPR.2019.00851), and VQA for figures and
charts, e.g., DVQA [kafleDVQAUnderstandingData2018](https://doi.org/10.1109/CVPR.2018.00592),
FigureQA [kahouFigureQAAnnotatedFigure2018](None), and
LEAF-QA [chaudhryLEAFQALocateEncode2020](https://doi.org/10.1109/WACV45572.2020.9093269). These VQA tasks
may appear as part of screens but in general are different problems.
## Question answering
Question answering tasks can be categorized by 1) open- or
closed-domain, 2) answer formats and 3) main capacities to evaluate.[^3]
The common answer formats include
span [rajpurkarSQuAD1000002016a](https://doi.org/10.18653/v1/D16-1264),
entity [talmorWebKnowledgeBaseAnswering2018](https://doi.org/10.18653/v1/N18-1059), multiple
choice [mihaylovCanSuitArmor2018](https://doi.org/10.18653/v1/D18-1260), and
generation [xiongTWEETQASocialMedia2019](https://doi.org/10.18653/v1/P19-1496). The capacities
to evaluate range from reading
comprehension [yangWikiQAChallengeDataset2015](https://doi.org/10.18653/v1/D15-1237), multi-hop
reasoning [yangHotpotQADatasetDiverse2018](https://doi.org/10.18653/v1/D18-1259), [chenFinQADatasetNumerical2021](https://doi.org/10.18653/v1/2021.emnlp-main.300),
logic reasoning [yuReClorReadingComprehension2020](None), and
commonsense
reasoning [talmorCommonsenseQAQuestionAnswering2019](https://doi.org/10.18653/v1/N19-1421).
From this question answering perspective, ScreenQA is a closed-domain
question answering task that expects answers by span (or UI element
phrase) selection for screen reading comprehension. As described in
Section [sec:problem_setting], we
instructed the data annotators to avoid multi-hop, mathematical
counting, and logic reasoning, in order to focus on the fundamental
screen comprehension capacity.
[^1]: Also videos, if we consider consecutive screenshots. We leave out
the video modality here in the context of annotating the underlying
RICO screenshots.
[^2]: Also referred to as document analysis and recognition (DAR) or
simply document understanding.
[^3]: Here we only include one or two examples per format and per
capacity. This is by no means to be comprehensive.
# Annotation Method [sec:annotation_method]
ScreenQA annotation process.
We perform several steps to collect the ScreenQA annotations, as
depicted in Figure 1. Each step is described below.
In-sync VH with occluded UI elements.In-sync VH for main content, with ghosting VH from hamburger
menu.Out-of-sync VH for main content, though top bar VH is
in-sync.
## Pre-filtering [sec:pre-filtering]
The pre-filtering stage filters out 1) screenshots from non-English
apps[^1], and 2) screenshots whose view hierarchies (VHs) are out of
sync with the main contents. It is a known issue that in the RICO
dataset, some screenshots and their corresponding view hierarchies are
not perfectly in sync: there exists certain time difference between view
hierarchy extraction and screenshot capturing. We want to remove those
screenshots to ensure that all ScreenQA annotations are not subject to
such data noises.
Classifying the sync quality is tricky, even for human readers. One may
not be able to differentiate between occlusion, ghosting, and the actual
out-of-sync. See Figure [fig:vh-sync] for examples.
Accordingly, we instructed the annotators to focus on the main content
area of the screen and make sure the bounding boxes in that area are not
corrupted, as this is where most contents of interest and questions come
from.
We use 27 annotators to perform this step. Among RICO’s 66k unique
screenshots, about 11k screenshots are from non-English apps, and
about 13k screenshots have out-of-sync view hierarchies.[^2] With the
union of these two filtered out, there remains about 51k screenshots
from English apps with in-sync VHs.
## Question annotations [sec:question-annotation]
For question annotation, we asked the annotators to frame questions
given a screenshot as the context. The annotators were expected to
compose 1) natural, daily-life questions as if using the app. 2) The
composed questions should inquire information that can directly read off
from the screen and 3) should not require logical reasoning, counting
and calculation, mathematical comparison, etc. We further required the
annotators 4) not to ask questions about any advertisement on the
screen.
The annotation UI is depicted in
Appendix [appendix:question_annotation_ui].
We asked the annotators to compose up to five questions given a
screenshot in the first pass. In the second pass, we asked for up to
three questions given a screenshot and the questions previously
composed. Each pass involved one annotator for each screenshot and
whoever annotated the screenshot before is excluded from being assigned
to the same screenshot. This ensures that every screenshot is assigned
precisely two annotators to compose questions. We chose this sequential
process 1) to avoid tricky deduplication of similar questions, and 2) to
encourage annotators to diversify their questions. Note that the same
set of annotators were involved in the both passes such that each
annotator had an opportunity to develop its own question style in the
first pass before seeing others’ in the second pass. This makes sure
that we still have certain numbers of question styles in the dataset
before they converge to each other in repeated passes.
We again involved the 27 annotators. The first pass of question
annotation generated 46k questions. The second pass added additional 36k
questions. These amount to a total of 82k questions, leaving about 15k
screenshots with no questions annotated, due to lack of interesting
contents.
## Answer annotations [sec:answer-annotation]
We used the total 82k questions of 35k screenshots from the previous
two-pass question annotation step to further annotate the corresponding
answers. The annotator who composed the question is excluded from
annotating its own answer to avoid potential biases. The answer
annotation UI is shown in
Appendix [appendix:answer_annotation_ui].
Given an example, which contains a screenshot and a question, the
annotators are tasked to
1. Fix any grammatical errors or typos of the given question without
altering its intention.
2. Answer the question, based on the context of the given screenshot,
by 1) selecting bounding boxes from the underlying view hierarchy
leaf nodes that contain the relevant answers, or drawing bounding
boxes if no suitable leaf nodes can be used, and 2) ranking the
answers in descending order of relevance if applicable, or by the
common reading order.
3. Additionally also provide a full-sentence answer to the question.
4. Consider two exceptions: 1) The question may be incomprehensible or
2) the screenshot does not contain the answer to the question, due
to the questioner’s lack of understanding of the app. Then the
example should be marked as “invalid question” and “not answerable
from the screenshot”, respectively.
5. One answer is annotated for the train split, and three for the
validation and the test splits. This is to improve the evaluation
quality. The data split details will be described in
Section 1.5.
The “invalid question” annotations are then filtered out, and the
questions that have no other answer annotations are excluded from the
overall ScreenQA dataset, as they are considered incorrectly annotated
during the question annotation phase.
## Not-answerable question annotations [sec:not-answerable-question-annotation]
Chart showing the fraction of questions with answers and
not answerable. Note that validation and test splits on average have
roughly 3 answers per question, so there are cases when some annotators
considered a question to be not answerable, while others provided an
answer to that same question. Specifically, the validation and the test
splits have 2.18% and 3.23% of such questions (the segments in red).
Question: ‘What is the name of the
application?’Question: ‘What is the contact email for tech
support?’Question: ‘What is the date of version
1.3.1?’
The questions marked as “not answerable from the screenshot” represent a
special category of questions that check model
overtriggering (attempting to answer those which are not supposed to be
answered). Being able to come to a conclusion that the answer is not
present on the screen is an important aspect of screen understanding.
Note that it is possible that one annotator considered a question to be
not answerable, and another provided an answer to that same question.
As described in
Section 1.2, the first two passes
of question annotations aimed to compose questions that can be answered
from the screen, so as expected, the fraction of not answerable
questions was small. We then had a third pass of question annotation to
raise this fraction to nearly 10%, see
Figure 5. For this, we used nearly 5k
screenshots selected randomly from those where there were no such
questions yet. In this pass, we asked annotators for exactly one
additional question per screenshot that had some relation to the
information there, but could not be answered. See examples in
Figure [fig:no-answer-examples].
Answer annotation was not used for these 5k questions.
## Dataset statistics [sec:dataset-statistics]
The ScreenQA dataset contains 35,352 screenshots and 86,025 questions.
It is split into train, validation and test sets in approximately
80-10-10 ratio, see
Table 1. Note that all questions for
the same screenshot belong to only one split.
[^1]: This is different from “non-English screenshots”, as translation
and dictionary apps could pose confusion.
[^2]: This out-of-sync number is different
from [liMappingNaturalLanguage2020a](https://doi.org/10.18653/v1/2020.acl-main.729) because we focus
on the main content area.
# Annotation Analysis [sec:annotation_analysis]
We analyze the annotations of questions and answers in this section.
| Category | % | Examples | | |
|:---|---:|:---|:---|:---|
| 1-2 (lr)3-4 UI selection & config | 18.1 | Which option is selected? | What is the selected ringtone? | |
| Quantity number | 11.7 | How many unread messages? | How many pictures are there in Western Europe? | |
| App name | 10.4 | What is the name of the application? | What is the app name? | |
| Date time | 9.4 | When was “Heal the Living” released? | When is happy hour? | |
| Price | 3.4 | How much is the gift bonus in 3rd place? | What is the price? | |
| Name of item | 3.3 | What is the name of the drug? | What is the name of chef? | |
| User name | 2.8 | What is the name of the user? | What is the username on telegram? | |
| Duration | 2.5 | What is the duration of video? | How long is the song? | |
| Enum. of avail. options | 2.5 | Which social media options are given there? | What are the options available for logging in? | |
| Address and direction | 2.4 | What is the current location? | What is the service zip code? | |
| Email address | 2.4 | What is an email address? | What is customer service email? | |
| Person’s name | 2.1 | Who sang the song? | What is the last name? | |
| Signup/login | 1.6 | Which application can be used to sign up / login? | What are the alternative choices for signing up? | |
| Version information | 1.6 | What is the version number? | What is the new feature in version v3.1.3? | |
| Weather | 1.5 | What is the range of temperature shown on Sunday? | What is the weather forecast for Sunday? | |
| Score & value | 1.4 | What is height/weight of the person? | What is the score? | |
| Yes/No | 1.1 | Is there any travel plans? | Is there any favorite? | |
| Phone number | 1.0 | What is the phone number? | What is the prefix for the international mobile number? | |
| \# of Stars | 0.8 | What is the star rating? | How many stars are given to the product? | |
| Share/sharing | 0.8 | Which application can be used to share? | Where can I share this application? | |
| Age | 0.8 | How old is ...? | What is the age? | |
| Percentage | 0.7 | What is the percentage of ... ? | What is the brightness percentage for foreground? | |
| Settings | 0.6 | What is the setting of ... ? | Which settings are switched on? | |
| Quantity amount | 0.6 | How much fat is there? | What is the amount? | |
| Permission | 0.5 | Which application is asking for permissions? | What permissions are required for MyCarTracks? | |
| \# of Likes | 0.5 | How many likes for ... ? | How many likes does ... get? | |
| Country | 0.5 | What is the name of the country? | Which country has the +54 code? | |
| Distance | 0.5 | What is the visibility distance? | How far is it from ... ? | |
| \# of Reviews | 0.4 | What is the number of comments on ... ? | How many comments? | |
| Website | 0.3 | What is the url? | What’s the website address? | |
| Gender | 0.3 | What is the gender? | Which gender is displayed on the screen? | |
| How to | 0.3 | How to start on boot? | How to pronounce his name? | |
| Currency | 0.3 | What is the currency? | What is the currency for the price? | |
| Unit of measurement | 0.2 | What is the unit of temperature? | What is the unit of weight and length? | |
| Language | 0.1 | Which language is used in the setting? | Which language is being translated into which language? | |
| Color | 0.0 | What is the UI color? | What is the amount of green color? | |
| 1-2 (lr)3-4 Others | 12.8 | What’s the average speed? | What is the user’s middle initial | |
| | | What is the spending limit? | Which team has 41 points? | |
| 1-2 Total | 100.0 | | | |
Number of composed questions per screenshot. Number of bounding boxes used to answer the question.
## Question analysis [sec:question-analysis]
We collected overall 86k questions over 35k unique screenshots from
RICO. Among the 86k questions, there are 47.5k unique questions.[^1]
Some screenshots receive more questions because they usually contain
more information to be asked about. Yet, the histogram still exhibits a
reasonable exponential decay with a mild slope, as depicted in
Figure 1.
To further understand what questions have been asked, we categorize the
questions using regular expressions based on a list of empirically
determined question categories. The categories are meant to provide a
rough overview of the question annotations and by no means to provide a
precise categorization. The distribution and examples by these
categories are tabulated in
Table [tab:q_cate]. Note that the questions
were not composed at the annotators’ full discretion: They are
conditioned on the given screenshots. That is to say, the distribution
is implicitly influenced by the RICO crawling process. For example, as
RICO crawled screen traces from freshly installed apps and did not login
an account, a noticeable number of the screen traces end at a login
page. This in turn translates to a higher percentage of questions asked
about app names, email addresses, permissions to login, etc.
## Answer analysis [sec:answer-analysis]
We analyze the answer annotations in two aspects: 1) How often do we
need more than one bounding box and its text to answer the question,
and 2) How often do human annotators find the view hierarchy useful to
provide a minimal answer to the question.
Figure 2 illustrates the histogram of number of
bounding boxes used in each answer. About 84% of answers contain a
single bounding box. Among these single-bounding-box answers, 51% uses a
VH leaf node directly, while 49% uses a manually drawn bounding box. If
we consider all answers together, excluding cases when there is no
answer, still 51% uses VH leaf nodes entirely, while 48% uses manually
drawn bounding boxes. That is, for about half of the total number of
screenshots, human annotators preferred to manually draw the bounding
boxes in order to provide answers that minimally satisfy the question.
This observation reflects the necessity not to require the view
hierarchy input for ScreenQA as described in
Task [task:sreenqa].
Interestingly, there exist some cases, about 0.8% of the questions, that
the human annotators used a mixture of VH leaf nodes and manually drawn
bounding boxes as their full answer. By inspecting those cases, we found
that these usually happen 1) when the answer is an enumeration of
“inhomogeneous” options that are organized differently on the screen,
such as using email vs. other APIs to login, and 2) when an answer needs
multiple parts to be complete, such as a date consisting of year, month,
and day scattered on the calendar UI, and a temperature or a measurement
requiring a number followed by the corresponding unit. These parts may
not be displayed in the same way, resulting in lack of useful VH leaf
nodes for some of the parts.
[^1]: Note that it is natural and valid to ask the same common questions
over various screenshots, for example, “Which option is selected on
the screen?” and “What is the email address?”
# Baselines [sec:baseline]
# Results [sec:result]
# Conclusion
In this work, we proposed the ScreenQA task. We annotated a large-scale
ScreenQA dataset, which contains 86,025 question-answer pairs. Compared
to other vision-language multimodal problems, such as document image
understanding and visual question answering, ScreenQA poses its unique
challenges: rich in text, diverse in apps, and blended with icons and
symbols. We hope to use the ScreenQA task and the dataset to encourage
the community to look into this screen content understanding problem, as
it enables new technologies and new user experiences.
# Acknowledgements
The authors would like to thank Srinivas Sunkara for his valuable
discussions and comments on this manuscript.
# Data annotation interfaces for question and answer collection [appendix:annotation_ui]
## Question annotation interface [appendix:question_annotation_ui]
The question annotation interface is shown in
Figure [fig:question-annotation-ui].
Question annotation was performed in a sequential manner, the later and
non-overlapping annotators can see all previous questions to diversify
question framing and avoid duplication. We also used the sequential
process to provide more feedback and training to the annotators for
quality improvement.
## Answer annotation interface [appendix:answer_annotation_ui]
The answer annotation interface is shown in
Figure [fig:answer-annotation-ui].
Answer annotators were tasked to determine if the question is valid and
if the question is answerable from the screen context. If both are
positive, the annotators need to answer the questions by 1) selecting or
drawing the bounding boxes of UI elements, 2) filling the text for each
selected/drawn bounding box on right right, and 3) ranking them
appropriately. The annotators were also tasked to review and make
necessary corrections if the question has grammatical errors or typos.
DocVQA: A Dataset for VQA on Document Images
2020-07-01
Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa.org
Show Paper Content
# Introduction
Research in Document Analysis and Recognition (DAR) is generally focused
on information extraction tasks that aim to convert information in
document images into machine readable form, such as character
recognition [doermann2014handbook](http://arxiv.org/pdf/1509.03456v1), table
extraction [kavasidis2019saliency](http://arxiv.org/pdf/1804.06236v1) or key-value pair
extraction [palm2017cloudscan](http://arxiv.org/pdf/1708.07403v1). Such algorithms tend to
be designed as task specific blocks, blind to the end-purpose the
extracted information will be used for.
Progressing independently in such information extraction processes has
been quite successful, although it is not necessarily true that holistic
document image understanding can be achieved through a simple
constructionist approach, building upon such modules. The scale and
complexity of the task introduce difficulties that require a different
point of view.
In this article we introduce Document Visual Question Answering
(DocVQA), as a high-level task dynamically driving DAR algorithms to
conditionally interpret document images. By doing so, we seek to inspire
a “purpose-driven” point of view in DAR research.
Q: Mention the
ZIP code written?
A: 80202
Q: What date is
seen on the seal at the top of the letter?
A: 23 sep 1970
Q: Which company
address is mentioned on the letter?
A: Great western sugar Co.
Example question-answer pairs from DocVQA. Answering
questions in the new dataset require models not just to read text but
interpret it within the layout/structure of the document.
In case of Document VQA, as illustrated in Figure
1, an intelligent reading
system is expected to respond to ad-hoc requests for information,
expressed in natural language questions by human users. To do so,
reading systems should not only extract and interpret the textual
(handwritten, typewritten or printed) content of the document images,
but exploit numerous other visual cues including layout (page structure,
forms, tables), non-textual elements (marks, tick boxes, separators,
diagrams) and style (font, colours, highlighting), to mention just a
few.
Departing from generic VQA [vqa2](https://arxiv.org/pdf/1612.00837) and Scene Text VQA
[textvqa](http://arxiv.org/pdf/1811.11903v1), [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) approaches, the document images
warrants a different approach to exploit all the above visual cues,
making use of prior knowledge of the implicit written communication
conventions used, and dealing with the high-density semantic information
conveyed in such images. Answers in case of document VQA cannot be
sourced from a closed dictionary, but they are inherently open ended.
Previous approaches on bringing VQA to the documents domain have either
focused on specific document elements such as data
visualisations [dvqa](http://arxiv.org/pdf/1810.02358v2), [kahou2017figureqa](http://arxiv.org/pdf/2109.02226v1) or on specific
collections such as book covers [mishra2019ocr](None). In
contrast to such approaches, we recast the problem to its generic form,
and put forward a large scale, varied collection of real documents.
Main contributions of this work can be summarized as following:
- We introduce DocVQA, a large scale dataset of $12,767$ document
images of varied types and content, over which we have defined
$50,000$ questions and answers. The questions defined are
categorised based on their reasoning requirements, allowing us to
analyze how DocVQA methods fare for different question types.
- We define and evaluate various baseline methods over the DocVQA
dataset, ranging from simple heuristic methods and human performance
analysis that allow us to define upper performance bounds given
different assumptions, to state of the art Scene Text VQA models and
NLP models.
# Related Datasets and Tasks
Machine reading comprehension (MRC) and open-domain question answering
(QA) are two problems which are being actively pursued by Natural
Language Processing (NLP) and Information Retrieval (IR) communities. In
MRC the task is to answer a natural language question given a question
and a paragraph (or a single document) as the context. In case of open
domain QA, no specific context is given and answer need to be found from
a large collection (say Wikipedia) or from Web. MRC is often modelled as
an extractive QA problem where answer is defined as a span of the
context on which the question is defined. Examples of datsets for
extractive QA include SQuAD 1.1 [squad](http://arxiv.org/pdf/1606.02270v2),
NewsQA [newsqa](None) and Natural
Questions [naturalquestions](http://arxiv.org/pdf/2105.00811v1). MS
MARCO [ms_marco](http://arxiv.org/pdf/1611.09268v3) is an example of a QA dataset for
abstractive QA where answers need to be generated not extracted.
Recently Transformer based pretraining methods like Bidirectional
Encoder Representations from Transformers (BERT) [bert](None)
and XLNet [xlnet](http://arxiv.org/pdf/1906.08237v2) have helped to build QA models
outperforming Humans on reading comprehension on
SQuAD [squad](http://arxiv.org/pdf/1606.02270v2). In contrast to QA in NLP where context is
given as computer readable strings, contexts in case of DocVQA are
document images.
Visual Question Answering (VQA) aims to provide an accurate natural
language answer given an image and a natural language question. VQA has
attracted an intense research effort over the past few
years [vqa2](https://arxiv.org/pdf/1612.00837), [agrawal2017c](None), [johnson2017clevr](http://arxiv.org/pdf/1612.06890v1). Out of a
large body of work on VQA, scene text VQA branch is the most related to
our work. Scene text VQA refers to VQA systems aiming to deal with cases
where understanding scene text instances is necessary to respond to the
questions posed. The ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) and
TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) datasets were introduced in parallel in
2019 and were quickly followed by more
research [singh2019strings](http://arxiv.org/pdf/1904.08920v2), [gao2020multi](http://arxiv.org/pdf/2003.13962v1), [wang2020general](http://arxiv.org/pdf/2002.10215v2).
The ST-VQA dataset [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) has $31,000\texttt{+}$
questions over $23,000\texttt{+}$ images collected from different public
data sets. The TextVQA dataset [textvqa](http://arxiv.org/pdf/1811.11903v1) has
$45,000\texttt{+}$ questions over $28,000\texttt{+}$ images sampled from
specific categories of the OpenImages
dataset [OpenImages2](http://arxiv.org/pdf/1809.05929v7) that are expected to contain text.
Another dataset named OCR-VQA [mishra2019ocr](None) comprises
more than 1 million question-answer pairs over 207K+ images of book
covers. The questions in this dataset are domain specific, generated
based on template questions and answers extracted from available
metadata.
Industry-wise distribution of the documents.Year wise distribution of the documents.Various types of documents used.
Scene text VQA
methods [m4c](http://arxiv.org/pdf/1911.06258v3), [gao2020multi](http://arxiv.org/pdf/2003.13962v1), [textvqa](http://arxiv.org/pdf/1811.11903v1), [gomez2020multimodal](http://arxiv.org/pdf/2006.00923v2)
typically make use of pointer mechanisms in order to deal with
out-of-vocabulary (OOV) words appearing in the image and provide the
open answer space required. This goes hand in hand with the use of word
embeddings capable of encoding OOV words into a pre-defined semantic
space, such as FastText [bojanowski2017enriching](http://arxiv.org/pdf/2102.02270v2) or
BERT [bert](None). More recent, top-performing methods in this
space include M4C [m4c](http://arxiv.org/pdf/1911.06258v3) and
MM-GNN [gao2020multi](http://arxiv.org/pdf/2003.13962v1) models.
Parallelly there have been works on certain domain specific VQA tasks
which require to read and understand text in the images. The DVQA
dataset presented by Kafle [kafle2020answering](http://arxiv.org/pdf/1908.01801v2), [dvqa](http://arxiv.org/pdf/1810.02358v2)
comprises synthetically generated images of bar charts and template
questions defined automatically based on the bar chart metadata. The
dataset contains more than three million question-answer pairs over
300,000 images.
FigureQA [kahou2017figureqa](http://arxiv.org/pdf/2109.02226v1) comprises over one million
yes or no questions, grounded on over 100,000 images. Three different
types of charts are used: bar, pie and line charts. Similar to DVQA,
images are synthetically generated and questions are generated from
templates. Another related QA task is Textbook Question Answering
(TQA) [textbookqa](http://arxiv.org/pdf/2010.00562v1) where multiple choice questions are
asked on multimodal context, including text, diagrams and images. Here
textual information is provided in computer readable format.
Compared to these existing datasets either concerning VQA on real word
images, or domain specific VQA for charts or book covers, the proposed
DocVQA comprise document images. The dataset covers a multitude of
different document types that include elements like tables, forms and
figures , as well as a range of different textual, graphical and
structural elements.
# DocVQA
In this section we explain data collection and annotation process and
present statistics and analysis of DocVQA.
## Data Collection
**Document Images:** Images in the dataset are sourced from documents in
UCSF Industry Documents Library[^1]. The documents are organized under
different industries and further under different collections. We
downloaded documents from different collections and hand picked pages
from these documents for use in the dataset. Majority of documents in
the library are binarized and the binarization has taken on a toll on
the image quality. We tried to minimize binarized images in DocVQA since
we did not want poor image quality to be a bottleneck for VQA. We also
prioritized pages with tables, forms, lists and figures over pages which
only have running text.
The final set of images in the dataset are drawn from pages of $6,071$
industry documents. We made use of documents from as early as 1900 to as
recent as 2018.
( [fig:doc_year_distr]). Most of
the documents are from the 1960-2000 period and they include
typewritten, printed, handwritten and born-digital text. There are
documents from all 5 major industries for which the library hosts
documents — tobacco, food, drug, fossil fuel and chemical. We use many
documents from food and nutrition related collections, as they have a
good number of non-binarized images. . See
[fig:industry_distr] for
industry wise distribution of the $6071$ documents used. The documents
comprise a wide variety of document types as shown
in [fig:doc_type_distr].
**Questions and Answers:** Questions and answers on the selected
document images are collected with the help of remote workers, using a
Web based annotation tool. The annotation process was organized in three
stages. In stage 1, workers were shown a document image and asked to
define at most 10 question-answer pairs on it. We encouraged the workers
to add more than one ground truth answer per question in cases where it
is warranted.
The 9 question types and share of questions in each
type.
Workers were instructed to ask questions which can be answered using
text present in the image and to enter the answer verbatim from the
document. This makes VQA on the DocVQA dataset an extractive QA problem
similar to extractive QA tasks in NLP [squad](http://arxiv.org/pdf/1606.02270v2), [newsqa](None) and
VQA in case of ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1).
Top 15 most frequent questions.Top 15 most frequent answers.Top 15 non numeric answers.Questions with a particular length.Answers with a particular length./papers/doc_ai_databases/arXiv-2007.00398v3_md/Images/contexts with a particular length
The second annotation stage aims to verify the data collected in the
first stage. Here a worker was shown an image and questions defined on
it in the first stage (but not the answers from the first stage), and
was required to enter answers for the questions. In this stage workers
were also required to assign one or more question types to each
question. The different question types in DocVQA are discussed
in [sec:stats_analysis]. During the
second stage, if the worker finds a question inapt owing to language
issues or ambiguity, an option to flag the question was provided. Such
questions are not included in the dataset.
If none of the answers entered in the first stage match exactly with any
of the answers from the second stage, the particular question is sent
for review in a third stage. Here questions and answers are editable and
the reviewer either accepts the question-answer (after editing if
necessary) or ignores it. The third stage review is done by the authors
themselves.
## Statistics and Analysis [sec:stats_analysis]
The DocVQA comprises $50,000$ questions framed on $12,767$ images. The
data is split randomly in an $80-10-10$ ratio to train, validation and
test splits. The train split has $39,463$ questions and $10,194$ images,
the validation split has $5,349$ questions and $1,286$ images and the
test split has $5,188$ questions and $1,287$ images.
As mentioned before, questions are tagged with question type(s) during
the second stage of the annotation process.
[fig:question_types] shows the 9
question types and percentage of questions under each type. A question
type signifies the type of data where the question is grounded. For
example, ‘table/list’ is assigned if answering the question requires
understanding of a table or a list. If the information is in the form of
a key:value, the ‘form’ type is assigned. ‘Layout’ is assigned for
questions which require spatial/layout information to find the answer.
For example, questions asking for a title or heading, require one to
understand structure of the document. If answer for a question is based
on information in the form of sentences/paragraphs type assigned is
‘running text’. For all questions where answer is based on handwritten
text, ‘handwritten’ type is assigned. Note that a question can have more
than one type associated with it. (Examples from DocVQA for each
question type are given in the supplementary.)
In the following analysis we compare statistics of questions, answers
and OCR tokens with other similar datasets for VQA — VQA
2.0 [vqa2](https://arxiv.org/pdf/1612.00837), TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) and
ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1) and SQuAD 1.1 [squad](http://arxiv.org/pdf/1606.02270v2)
dataset for reading comprehension. Statistics for other datasets are
computed based on their publicly available data splits. For statistics
on OCR tokens, for DocVQA we use OCR tokens generated by a commercial
OCR solution. For VQA 2.0, TextVQA and ST-VQA we use OCR tokens made
available by authors of LoRRA [textvqa](http://arxiv.org/pdf/1811.11903v1) and
M4C [m4c](http://arxiv.org/pdf/1911.06258v3) as part of the MMF [mmf](https://github.com/facebookresearch/mmf)
framework.
Word clouds of words in answers (left) and words spotted on
the document images in the dataset (right)[fig:compare_question_lengths]
shows distribution of question lengths for questions in DocVQA compared
to other similar datasets. The average question length is is $8.12$,
which is second highest among the compared datasets. . In DocVQA$35,362$
($70.72\%$) questions are unique.
[fig:top_questions] shows the top
$15$ most frequent questions and their frequencies. There are questions
repeatedly being asked about dates, titles and page numbers. A sunburst
of first 4 words of the questions is shown
in [fig:sunburst_4grams].
It can be seen that a large majority of questions start with “what is
the”, asking for date, title, total, amount or name.
Distribution of answer lengths is shown
in [fig:compare_answer_lengths].
We observe in the figure that both DocVQA and SQuAD 1.1 have a higher
number of longer answers compared to the VQA datasets. The average
answer length is $2.17$.
$63.2\%$ of the answers are unique , which is second only to SQuAD
1.1$(72.5\%)$. The top $15$ answers in the dataset are shown
in [fig:top_anwers].
We observe that almost all of the top answers are numeric values, which
is expected since there are a good number of document images of reports
and invoices.
In [fig:top_non_numeric_Answers]
we show the top $15$ non numeric answers. These include named entities
such as names of people, institutions and places. The word cloud on the
left in [fig:wordcloud] shows frequent words
in answers. Most common words are names of people and names of calendar
months.
In [fig:compare_document_lengths]
we show the number of images (or ‘context’s in case of SQuAD 1.1)
containing a particular number of text tokens. Average number of text
tokens in an image or context is the highest in the case of
DocVQA($182.75$). It is considerably higher compared to SQuAD 1.1 where
contexts are usually small paragraphs whose average length is $117.23$.
In case of VQA datasets which comprise real world images average number
of OCR tokens is not more than $13$. Word cloud on the right
in [fig:wordcloud] shows the most common
words spotted by the OCR on the images in DocVQA. We observe that there
is high overlap between common OCR tokens and words in answers.
Distribution of questions by their starting 4-grams. Most
questions aim to retrieve common data points in documents such as date,
title, total mount and page number.
# Baselines [sec:baselines]
In this section we explain the baselines we use, including heuristics
and trained models.
## Heuristics and Upper Bounds [sec:heuristics]
Heuristics we evaluate are: (i) **Random answer:** measures performance
when we pick a random answer from the answers in the train split. (ii)
**Random OCR token:** performance when a random OCR token from the given
document image is picked as the answer. (iii) **Longest OCR token** is
the case when the longest OCR token in the given document is selected as
the answer. (iv) **Majority answer** measures the performance when the
most frequent answer in the train split is considered as the answer.
We also compute following upper bounds: (i) **Vocab UB:** This upper
bound measures performance upper bound one can get by predicting correct
answers for the questions, provided the correct answer is present in a
vocabulary of answers, comprising all answers which occur more than once
in the train split. (ii) **OCR substring UB:** is the upper bound on
predicting the correct answer provided the answer can be found as a
substring in the sequence of OCR tokens. The sequence is made by
serializing the OCR tokens recognized in the documents as a sequence
separated by space, in top-left to bottom-right order. (iii) **OCR
subsequence UB:** upper bound on predicting the correct answer, provided
the answer is a subsequence of the OCR tokens’ sequence.
## VQA Models [sec: vqa models]
For evaluating performance of existing VQA models on DocVQA we employ
two models which take the text present in the images into consideration
while answering questions – Look, Read, Reason & Answer
(LoRRA) [textvqa](http://arxiv.org/pdf/1811.11903v1) and Multimodal Multi-Copy
Mesh(M4C) [m4c](http://arxiv.org/pdf/1911.06258v3).
**LoRRA:** follows a bottom-up and top-down
attention [topdown_bottomup](https://arxiv.org/pdf/1707.07998) scheme with additional
bottom-up attention over OCR tokens from the images. In LoRRA, tokens in
a question are first embedded using a pre-trained embedding
(GloVe [glove](http://arxiv.org/pdf/1608.02094v1)) and then these tokens are iteratively
encoded using an LSTM [lstm](http://arxiv.org/pdf/2103.15232v1) encoder. The model uses two
types of spatial features to represent the visual information from the
images - (i) grid convolutional features from a
Resnet-152 [resnet](https://arxiv.org/pdf/1512.03385) which is pre-trained on
ImageNet [imagenet](http://arxiv.org/pdf/1903.10412v1) and (ii) features extracted from
bounding box proposals from a Faster R-CNN [faster-r-cnn](http://arxiv.org/pdf/1506.01497v3)
object detection model, pre-trained on Visual
Genome [visual_genome](http://arxiv.org/pdf/1602.07332v1). OCR tokens from the image are
embedded using a pre-trained word embedding
(FastText [fasttext](http://arxiv.org/pdf/2102.02270v2)). An attention mechanism is used to
compute an attention weighed average of the image features as well the
OCR tokens’ embeddings. These averaged features are combined and fed
into an output module. The classification layer of the model, predicts
an answer either from a fixed vocabulary (made from answers in train
set) or copy an answer from a dynamic vocabulary which essentially is
the list of OCR tokens in an image. Here the copy mechanism can copy
only one of the OCR tokens from the image. Consequently it cannot output
an answer which is a combination of two or more OCR tokens.
**M4C**: uses a multimodal transformer and an iterative answer
prediction module. Here tokens in questions are embedded using a BERT
model [bert](None). Images are represented using (i) appearance
features of the objects detected using a Faster-RCNN pretrained on
Visual Genome [visual_genome](http://arxiv.org/pdf/1602.07332v1) and (ii) location
information - bounding box coordinates of the detected objects. Each OCR
token recognized from the image is represented using (i) a pretrained
word embedding (FastText), (ii) appearance feature of the token’s
bounding box from the same Faster R-CNN which is used for appearance
features of objects (iii) PHOC [phoc](http://arxiv.org/pdf/1712.07487v1) representation of
the token and (iv) bounding box coordinates of the token. Then these
feature representations of the three entities (question tokens, objects
and OCR tokens) are projected to a common, learned embedding space. Then
a stack of Transformer [attention_is_all_you_need](http://arxiv.org/pdf/2107.08000v1) layers
are applied over these features in the common embedding space. The
multi-head self attention in transformers enable both inter-entity and
intra-entity attention. Finally, answers are predicted through iterative
decoding in an auto-regressive manner. Here the fixed vocabulary used
for the closed answer space is made up of the most common answer words
in the train split. Note that in this case the fixed vocabulary
comprises of answer words, not answers itself as in the case of LoRRA.
At each step in the decoding, the decoded word is either an OCR token
from the image or a word from the fixed vocabulary of common answer
words.
In our experiments we use original LoRRA and M4C models and few variants
of these models. Document images in DocVQA usually contain higher number
of text tokens compared to images in scene text VQA datasets. Hence we
try out larger dynamic vocabularies (i.e. more OCR tokens are considered
from the images) for both LoRRA and M4C. For both the models we also
evaluate performance when no fixed vocabulary is used.
Since the notion of visual objects in real word images is not directly
applicable in case of document images, we also try out variants of LoRRA
and M4C where features of objects are omitted.
## Reading Comprehension Models [sec:RC_models]
In addition to the VQA models which can read text, we try out extractive
question answering / reading comprehension models from NLP. In
particular, we use BERT [bert](None) question answering models.
BERT is a method of pre-training language representations from
unlabelled text using
transformers [attention_is_all_you_need](http://arxiv.org/pdf/2107.08000v1). These
pretrained models can then be used for downstream tasks with just an
additional output layer. In the case of extractive Question Answering,
this is an output layer to predict start and end indices of the answer
span.
# Experiments [sec:experiments]
In this section we explain evaluation metrics and our experimental
settings and report results of experiments.
## Evaluation Metrics [sec:evaluation]
Two evaluation metrics we use are Average Normalized Levenshtein
Similarity (ANLS) and Accuracy (Acc.). ANLS was originally proposed for
evaluation of VQA on ST-VQA [st-vqa_challenge](None). The
Accuracy metric measures percentage of questions for which the predicted
answer matches exactly with any of the target answers for the question.
Accuracy metric awards a zero score even when the prediction is only a
little different from the target answer. Since no OCR is perfect, we
propose to use ANLS as our primary evaluation metric, so that minor
answer mismatches stemming from OCR errors are not severely penalized.
| | | | | |
|:-------------------|:-----:|:-----:|:-----:|:-----:|
| | val | | test | |
| Baseline | ANLS | Acc. | ANLS | Acc. |
| Human | \- | \- | 0.981 | 94.36 |
| Random answer | 0.003 | 0.00 | 0.003 | 0.00 |
| Rnadom OCR token | 0.013 | 0.52 | 0.014 | 0.58 |
| Longest OCR token | 0.002 | 0.05 | 0.003 | 0.07 |
| Majority answer | 0.017 | 0.90 | 0.017 | 0.89 |
| Vocab UB | \- | 31.31 | \- | 33.78 |
| OCR substring UB | \- | 85.64 | \- | 87.00 |
| OCR subsequence UB | \- | 76.37 | \- | 77.00 |
Evaluation of different heuristics and upper bounds. Predicting random
answers or majority answer do not even yield 1% accuracy. Answers are a
substring of the serialized OCR output in more than 85% of the cases.
## Experimental setup [sec: experimental setup]
For measuring human performance , we collect answers for all questions
in test split, with help a few volunteers from our institution.
In all our experiments including heuristics and trained baselines, OCR
tokens we use are extracted using a commercial OCR application. For the
heuristics and upper bounds we use a vocabulary $4,341$ answers which
occur more than once in the train split.
For LoRRA and M4C models we use official implementations available as
part of the MMF framework [mmf](https://github.com/facebookresearch/mmf). The training settings
and hyper parameters are same as the ones reported in the original
works. The fixed vocabulary we use for LoRRA is same as the vocabulary
we use for computing vocabulary based heuristics and upper bounds. For
M4C the fixed vocabulary we use is a vocabulary of the $5,000$ most
frequent words from the answers in the train split.
For QA using BERT, three pre-trained BERT models[^2] from the
Transformers library [huggingface](http://arxiv.org/pdf/1910.03771v5) are used. The models
we use are bert-base-uncased, bert-large-uncased-whole-word-masking and
bert-large-uncased-whole-word-masking-finetuned-squad. We abbreviate the
model names as bert-base, bert-large and bert-large-squad respectively.
Among these, bert-large-squad is a pre-trained model which is also
finetuned on SQuAD 1.1 for question answering. In case of extractive
question answering or reading comprehension datasets ‘contexts’ on which
questions are asked are passages of electronic text. But in
DocVQA‘contexts’ are document images. Hence to finetune the BERT QA
models on DocVQA we need to prepare the data in SQuAD style format where
the answer to a question is a ‘span’ of the context, defined by start
and end indices of the answer. To this end we first serialize the OCR
tokens recognized on the document images to a single string, separated
by space, in top-left to bottom-right order. To approximate the answer
spans we follow an approach proposed in
TriviaQA [triviaqa](None), which is to find the first match of
the answer string in the serialized OCR string.
The bert-base model is finetuned on DocVQA on 2 Nvidia GeForce 1080 Ti
GPUs, for 2 epochs, with a batch size of 32. We use Adam
optimizer [adam](None) with a learning rate of $5e-05$. The
bert-large and bert-large-squad models are finetuned on 4 GPUs for 6
epochs with a batch size of 8, and a learning rate of $2e-05$.
| | | | | | |
|:-----------------|:----------------|:---------:|:-----:|:---------:|:-----:|
| | | val | | test | |
| Pretrained model | DocVQA finetune | ANLS | Acc. | ANLS | Acc. |
| bert-base | | 0.556 | 45.6 | 0.574 | 47.6 |
| bert-large- | | 0.594 | 49.28 | 0.610 | 51.08 |
| bert-large-squad | | 0.462 | 36.72 | 0.475 | 38.26 |
| bert-large-squad | | **0.655** | 54.48 | **0.665** | 55.77 |
Performance of BERT question answering models. A BERT~LARGE~ model which
is fine tuned on both SQuAD 1.1 [squad](http://arxiv.org/pdf/1606.02270v2) and DocVQA
performs the best.
## Results [sec:results]
Results of all heuristic approaches and upper bounds are reported
in [tab:human_heuristics]. We can
see that none of the heuristics get even a $1\%$ accuracy on the
validation or test splits.
Q: What is the
underlined heading just above the table?
GT: Indications for implantation
M4C best: indications for
implantation
BERT best: total aneurism
Human: indications for
implantation
Q: What is the
Extension Number as per the voucher?
GT: (910) 741-0673
M4C best: 963.12
BERT best: (910) 741-0673
Human: (910) 741-0673
Q: How many
boxed illustrations are there ?
GT: 9
M4C best: 4
BERT best: 4
Human: 9
*OCR substring UB* yields more than $85\%$ accuracy on both validation
and test splits. It has a downside that the substring match in all cases
need not be an actual answer match. For example if the answer is “2"
which is the most common answer in the dataset, it will match with a “2"
in “2020" or a “2" in “2pac”. This is the reason why we evaluate the
*OCR subsequence UB*. An answer is a sub sequence of the serialized OCR
output for around $76\%$ of the questions in both validation and test
splits.
Results of our trained VQA baselines are shown
in [tab:vqa_results]. First rows for
both the methods report results of the original model proposed by the
respective authors. In case of LoRRA the original setting proposed by
the authors yields the best results compared to the variants we try out.
With no fixed vocabulary, the performance of the model drops sharply
suggesting that the model primarily relies on the fixed vocabulary to
output answers. Larger dynamic vocabulary results in a slight
performance drop suggesting that incorporating more OCR tokens from the
document images does little help. Unlike LoRRA, M4C benefits from a
larger dynamic vocabulary. Increasing the size of the dynamic vocabulary
from $50$ to $500$ improves the ANLS by around $50\%$. And in case of
M4C, the setting where features of objects are omitted, performs
slightly better compared to the original setting.
Best baselines from VQA space and reading comprehension
space pitted against the human performance for different question types.
We need models which can understand figures and text on photographs
better. We need better handwriting recognizers too!
Results of the BERT question answering models are reported
in [tab:bert_results]. We observe
that all BERT models perform better than the best VQA baseline using
M4C(last row in [tab:vqa_results]). The best
performing model out of all the baselines analysed is the
bert-large-squad model, finetuned on DocVQA. Answers predicted by this
model match one of the target answers exactly for around $55\%$ of the
questions.
In [fig:performance_question_type]
we show performance by question type. We compare the best models among
VQA models and BERT question answering models against the human
performance on the test split. We observe that the human performance is
uniform while the models’ performance vary for different question types.
In [fig:qualitative_results]
we show a few qualitative results from our experiments.
# Conclusion
We introduce a new data set and an associated VQA task with the aim to
inspire a “purpose-driven” approach in document image analysis and
recognition research. Our baselines and initial results motivate
simultaneous use of visual and textual cues for answering questions
asked on document images. This could drive methods that use the
low-level cues (text, layout, arrangements) and high-level goals
(purpose, relationship, domain knowledge) in solving problems of
practical importance.
**Acknowledgements**
We thank Amazon for supporting the annotation effort, and Dr. R.
Manmatha for many useful discussions and inputs. This work is partly
supported by MeitY, Government of India, the project TIN2017-89779-P, an
Amazon AWS Research Award and the CERCA Programme.
# Screen grabs of Annotation Tool [appendix:screen grabs]
As mentioned in Section 3.1 in the main paper, annotation process
involves three stages. In
[fig:ann_stage1],
[fig:ann_stage2]
and [fig:ann_stage3] we show screen
grabs from stage 1, stage 2 and stage 3 of the annotation process
respectively.
# Examples of Question Types [appendix:question_types]
We define 9 question types, based on the kind of reasoning required to
answer a question. Question types are assigned at the second stage of
the annotation. We discuss the question types in Section 3.2. in the
main paper.
Examples for types *form*, *yes/no* and *layout* are shown
in [fig:question_type_examples
yesno and layout]. Examples for a question based on a handwritten
date in a form (types *form* and *handwritten*) are shown
in [fig:question_type_examples
handwritten date form]. An example for a question based on
information in the form of sentences or paragraphs ( type *running
text*) is shown
in [fig:question_type
running text]. Examples for types *photograph* and *table* are shown
in [fig:question_types
photo and table]. An example for a question based on a plot (type
*figure*) is shown in [fig:question_type_examples
figure]. In all examples a crop of the original image is shown below
the original image, for better viewing of the image region where the
question is based on.
Q: Is it an
existing item ?
Question types:form and yes/no
A: yes
Q: What is the
date given at the top left?
Question types:layout
A: 03/17/98
Q: What is the
date written next to RSM approval?
Question types:form and handwritten
A: 3-17-98
Q: If the
request needs to be warehoused by RJR, what needs to be done
?
Question types:running text
A: write to RJR
Q: Whose picture
is given?
Question types:photograph and layout
A: Dr. Dwayne G. Westfall
Q: What is the
average sucrose % for N level 501+ ?
Question types:table
A: 15.9
Q: What is the
highest value for “Intake, mg/1000kcal" plotted on the ‘X’ axis of the
graph?
Question types:figure
A: 300
# Additional Qualitative Examples [appendix:Additional Qualitative Examples]
Here we show more qualitative results from our baseline experiments.
These results supplement the Results section (Section 5.3 ) in the main
paper.
Remember that BERT [bert](None) question answering model is
designed to answer questions asked on sentences or paragraphs of text (
reading comprehension).
In [fig:qual : bert wines] we
show two examples where the model answers questions outside the ambit of
reading comprehension style question answering.
In [fig:qual : m4c_wins] we show
examples where the M4C [m4c](http://arxiv.org/pdf/1911.06258v3) model outperforms the BERT
model to answer questions based on text seen on pictures or photographs.
Such questions are similar to questions in
TextVQA [textvqa](http://arxiv.org/pdf/1811.11903v1) or ST-VQA [stvqa_iccv](http://arxiv.org/pdf/2304.01603v1)
datasets where M4C model yield state-of-the-art results.
In [fig:qual : inconsistent]
we show an example where both the models yield inconsistent results when
posed with questions of similar nature, highlighting lack of reasoning
behind answering.
In [fig: qual: reasoning] we show
two examples where both the M4C and BERT model fail to answer questions
which require understanding of a figure or a diagram.
In [fig: qual: ocr error] we show
how OCR errors have resulted in wrong answers although the models manage
to ground the questions correctly.
Q: What is the
total cost for Fat cell size (Mt. SInai) in the -05 year ?
GT: $35,864
M4C best: 4400
BERT best: $35 , 864
Human: $35,864
Q: What is the
first recipe on the page?
GT: hawaiian fruit cake
M4C best: island desserts (continued from
cake
BERT best: hawaiian fruit cake
Human: hawaiian fruit cake
Q: What is
written inside logo in the bottom of the document?
GT: let yourself grow!
M4C best: yourself grow!
BERT best: < noprediction>
Human: let yourself grow!
Q: What Tobacco
brand of GPI is shown in the picture?
GT: Prince
M4C best: prince
BERT best: < noprediction>
Human: prince
Q: What was the
committee strength for the first meeting?
GT: 6
M4C best: 6
BERT best: 6
Human: 6
Q: What was the
committee strength for the last meeting?
GT: 5
M4C best: 6
BERT best: 6
Human: 5
Q: What is the
position above "vice chairman" ?
GT: chairman
M4C best: legal counsel
BERT best: legal counsel
Human: chairman
Q: What is the
highest value shown on the vertical axis?