MISCHA-QA Dataset

Authors: Jane Swingler, Andrew Wihardja, Arif Syraj
Organisation: University of San Francisco
Available for training on Hugging Face here.

What is MISCHA-QA?

This repository contains MISCHA-QA, a dataset of synthetic charts specifically designed to demonstrate various forms of misleading visual representations. The dataset includes both misleading and non-misleading examples of bar charts, line graphs, and pie charts. In total, the dataset comprises 8205 charts. The purpose of this dataset is to aid in the development and fine-tuning of models that can detect misinformation in charts.

Misleading Features and Examples

The dataset includes charts with various types of misleading features. Below are descriptions of each type, along with examples:

Bar Charts

Non-Zero Baseline

The y-axis does not start at zero, which can exaggerate differences between data points. For example, Figure 1 displays a bar chart showing support for gun control legislation, where the y-axis starts at fifty percent instead of zero. This makes the differences between years appear more significant than they actually are, exaggerating the upward trend in support for gun control legislation.

Figure 1: Non-Zero Baseline Example (GPT-4 Generated Context)

Figure 2: Same Data with a Zero Baseline (GPT-4 Generated Context)

Inconsistent Time Intervals

The x-axis has inconsistent intervals, which can distort the perception of trends over time. For example, Figure 3 displays a line graph showing annual CO2 emissions with inconsistent time intervals. This may exaggerate or obscure the trends in CO2 emissions. In contrast, Figure 4 shows the same data with consistent time intervals, providing a more accurate representation of the trend.

Figure 3: Inconsistent Time Intervals Example (GPT-4 Generated Context)

Figure 4: Same Data with Consistent Time Intervals (GPT-4 Generated Context)

Pie Charts

Non-Sum to 100%

The segments of the pie chart do not sum to 100%, which can misrepresent the proportions of the categories. For example, Figure 5 displays a pie chart showing global energy consumption by source, where the sum of all segments is less than 100%, giving a false representation of the total energy distribution.

Figure 5: Non-Sum to 100% Example (GPT 4 Generated Context)

Over-Segmentation

The pie chart includes too many small segments, making it difficult to accurately compare the sizes of the segments, which can obscure the true distribution of the data. For example, Figure 6 displays a pie chart showing the voter support distribution among U.S. presidential candidates in 2024, with many small segments that make it difficult to interpret the overall distribution of voter support.

Figure 6: Over-Segmentation Example (GPT 4 Generated Context)

Chart Generation

LLM-Generated Charts

A total of 4000 chart datasets were created using a large language model (GPT-4). These datasets were designed to simulate realistic but potentially misleading visual representations. The scripts used for LLM dataset generation can be found here.

Python-Generated Charts

A total of 4005 charts were created using Python scripts and algorithms. The raw data points, including values and x-axis data for time series, were programmatically generated to diversify the dataset, as LLM-generated data points were often repetitive and limited in range. The titles, sources, and x-axis categories for categorical data were enriched using GPT-4 to enhance the dataset. The scripts used for Python data generation can be found here.

Dataset Breakdown

Chart Type	Total Charts	Misleading	Non-Misleading	Misleading Feature	Count
Bar Charts	4,759	2,343	2,416	Non-Zero Baseline	1,653
				Inconsistent Time Intervals	690
Time Series Line Charts	1,150	730	420	Inconsistent Time Intervals	730
Pie Charts	2,296	1,150	1,146	Non-Sum to 100%	239
				Over-Segmentation	662
Overall Summary	8,205	4,223	3,982

Data Annotation

Each chart's annotation JSON file is structured to provide detailed information about the chart, including its characteristics, whether it's misleading, and the context in which it was created. Below is the structure of the annotation JSON file:

ID: A unique identifier for the chart.
Title: A descriptive title for the chart.
Image: The file path to the image of the chart.
Chart Type: Specifies the type of chart (e.g., bar, line, pie).
Domain: The domain or context of the chart (e.g., economics, health, etc.). This field can be empty if the domain is unspecified.
Is Misleading: A boolean value (Yes or No) indicating whether the chart is misleading.
Misleading Feature: Describes the specific misleading feature present in the chart, if applicable.
Conversations: A list of query-label pairs that provide insights or explanations about the chart. Each conversation entry contains:
- Query: A question or prompt related to the chart.
- Label: The answer or explanation corresponding to the query.

Example Annotation JSON

Below is an example of a chart's annotation in JSON format:

{
    "id": "pie2239",
    "title": "Distribution of Educational Resources",
    "image": "output_charts\\pie2239.png",
    "visualisation_type": "pie",
    "domain": "Education",
    "is_misleading": "Yes",
    "misleading_feature": "Non-Sum to 100",
    "conversations": [
        {
            "query": "What are the segments of the pie chart?",
            "label": "Textbooks, Online Courses, Tutoring Services"
        },
        {
            "query": "What is the sum of the segments of the pie chart?",
            "label": "120"
        },
        {
            "query": "Is this pie chart misleading? Explain",
            "label": "Yes, this chart is misleading because the sum of the segments is 120, which exceeds 100. Pie charts should have segments that sum to 100 to accurately represent the distribution of the whole."
        }
    ]
}

Directory Structure

MISCHA-QA/
├── bar_charts/
│   ├── images/
│   │   ├── bar1.png
│   │   ├── bar2.png
│   │   ├── ...
│   ├── annotations/
│   │   ├── bar1.json
│   │   ├── bar2.json
│   │   ├── ...
├── line_graphs/
│   ├── images/
│   │   ├── line1.png
│   │   ├── line2.png
│   │   ├── ...
│   ├── annotations/
│   │   ├── line1.json
│   │   ├── line2.json
│   │   ├── ...
├── pie_charts/
│   ├── images/
│   │   ├── pie1.png
│   │   ├── pie2.png
│   │   ├── ...
│   ├── annotations/
│   │   ├── pie1.json
│   │   ├── pie2.json
│   │   ├── ...

Usage

This dataset is intended for research and educational purposes, particularly in the fields of data visualization, misinformation detection, and machine learning. The dataset is also available in Parquet format, ready for training here: Hugging Face.

Contributing

Contributions to expand the dataset, improve annotations, or add new types of misleading features are welcome. Please submit a pull request or open an issue to discuss your contribution.

Contact

For any questions or inquiries, please contact Jane Swingler at jeswingler@usfca.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
bar_charts		bar_charts
example_images		example_images
line_graphs		line_graphs
pie_charts		pie_charts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MISCHA-QA Dataset

What is MISCHA-QA?

Misleading Features and Examples

Bar Charts

Pie Charts

Chart Generation

LLM-Generated Charts

Python-Generated Charts

Dataset Breakdown

Data Annotation

Example Annotation JSON

Directory Structure

Usage

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MISCHA-QA Dataset

What is MISCHA-QA?

Misleading Features and Examples

Bar Charts

Pie Charts

Chart Generation

LLM-Generated Charts

Python-Generated Charts

Dataset Breakdown

Data Annotation

Example Annotation JSON

Directory Structure

Usage

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages