- Authors: Jane Swingler, Andrew Wihardja, Arif Syraj
- Organisation: University of San Francisco
- Available for training on Hugging Face here.
This repository contains MISCHA-QA, a dataset of synthetic charts specifically designed to demonstrate various forms of misleading visual representations. The dataset includes both misleading and non-misleading examples of bar charts, line graphs, and pie charts. In total, the dataset comprises 8205 charts. The purpose of this dataset is to aid in the development and fine-tuning of models that can detect misinformation in charts.
The dataset includes charts with various types of misleading features. Below are descriptions of each type, along with examples:
Non-Zero Baseline
The y-axis does not start at zero, which can exaggerate differences between data points. For example, Figure 1 displays a bar chart showing support for gun control legislation, where the y-axis starts at fifty percent instead of zero. This makes the differences between years appear more significant than they actually are, exaggerating the upward trend in support for gun control legislation.
Inconsistent Time Intervals
The x-axis has inconsistent intervals, which can distort the perception of trends over time. For example, Figure 3 displays a line graph showing annual CO2 emissions with inconsistent time intervals. This may exaggerate or obscure the trends in CO2 emissions. In contrast, Figure 4 shows the same data with consistent time intervals, providing a more accurate representation of the trend.
Non-Sum to 100%
The segments of the pie chart do not sum to 100%, which can misrepresent the proportions of the categories. For example, Figure 5 displays a pie chart showing global energy consumption by source, where the sum of all segments is less than 100%, giving a false representation of the total energy distribution.
Over-Segmentation
The pie chart includes too many small segments, making it difficult to accurately compare the sizes of the segments, which can obscure the true distribution of the data. For example, Figure 6 displays a pie chart showing the voter support distribution among U.S. presidential candidates in 2024, with many small segments that make it difficult to interpret the overall distribution of voter support.
A total of 4000 chart datasets were created using a large language model (GPT-4). These datasets were designed to simulate realistic but potentially misleading visual representations. The scripts used for LLM dataset generation can be found here.
A total of 4005 charts were created using Python scripts and algorithms. The raw data points, including values and x-axis data for time series, were programmatically generated to diversify the dataset, as LLM-generated data points were often repetitive and limited in range. The titles, sources, and x-axis categories for categorical data were enriched using GPT-4 to enhance the dataset. The scripts used for Python data generation can be found here.
| Chart Type | Total Charts | Misleading | Non-Misleading | Misleading Feature | Count |
|---|---|---|---|---|---|
| Bar Charts | 4,759 | 2,343 | 2,416 | Non-Zero Baseline | 1,653 |
| Inconsistent Time Intervals | 690 | ||||
| Time Series Line Charts | 1,150 | 730 | 420 | Inconsistent Time Intervals | 730 |
| Pie Charts | 2,296 | 1,150 | 1,146 | Non-Sum to 100% | 239 |
| Over-Segmentation | 662 | ||||
| Overall Summary | 8,205 | 4,223 | 3,982 |
Each chart's annotation JSON file is structured to provide detailed information about the chart, including its characteristics, whether it's misleading, and the context in which it was created. Below is the structure of the annotation JSON file:
- ID: A unique identifier for the chart.
- Title: A descriptive title for the chart.
- Image: The file path to the image of the chart.
- Chart Type: Specifies the type of chart (e.g., bar, line, pie).
- Domain: The domain or context of the chart (e.g., economics, health, etc.). This field can be empty if the domain is unspecified.
- Is Misleading: A boolean value (
YesorNo) indicating whether the chart is misleading. - Misleading Feature: Describes the specific misleading feature present in the chart, if applicable.
- Conversations: A list of query-label pairs that provide insights or explanations about the chart. Each conversation entry contains:
- Query: A question or prompt related to the chart.
- Label: The answer or explanation corresponding to the query.
Below is an example of a chart's annotation in JSON format:
{
"id": "pie2239",
"title": "Distribution of Educational Resources",
"image": "output_charts\\pie2239.png",
"visualisation_type": "pie",
"domain": "Education",
"is_misleading": "Yes",
"misleading_feature": "Non-Sum to 100",
"conversations": [
{
"query": "What are the segments of the pie chart?",
"label": "Textbooks, Online Courses, Tutoring Services"
},
{
"query": "What is the sum of the segments of the pie chart?",
"label": "120"
},
{
"query": "Is this pie chart misleading? Explain",
"label": "Yes, this chart is misleading because the sum of the segments is 120, which exceeds 100. Pie charts should have segments that sum to 100 to accurately represent the distribution of the whole."
}
]
}
MISCHA-QA/
├── bar_charts/
│ ├── images/
│ │ ├── bar1.png
│ │ ├── bar2.png
│ │ ├── ...
│ ├── annotations/
│ │ ├── bar1.json
│ │ ├── bar2.json
│ │ ├── ...
├── line_graphs/
│ ├── images/
│ │ ├── line1.png
│ │ ├── line2.png
│ │ ├── ...
│ ├── annotations/
│ │ ├── line1.json
│ │ ├── line2.json
│ │ ├── ...
├── pie_charts/
│ ├── images/
│ │ ├── pie1.png
│ │ ├── pie2.png
│ │ ├── ...
│ ├── annotations/
│ │ ├── pie1.json
│ │ ├── pie2.json
│ │ ├── ...
This dataset is intended for research and educational purposes, particularly in the fields of data visualization, misinformation detection, and machine learning. The dataset is also available in Parquet format, ready for training here: Hugging Face.
Contributions to expand the dataset, improve annotations, or add new types of misleading features are welcome. Please submit a pull request or open an issue to discuss your contribution.
For any questions or inquiries, please contact Jane Swingler at jeswingler@usfca.edu.





