Skip to content

janeswingler/MISCHA-QA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MISCHA-QA Dataset


  • Authors: Jane Swingler, Andrew Wihardja, Arif Syraj
  • Organisation: University of San Francisco
  • Available for training on Hugging Face here.

What is MISCHA-QA?

This repository contains MISCHA-QA, a dataset of synthetic charts specifically designed to demonstrate various forms of misleading visual representations. The dataset includes both misleading and non-misleading examples of bar charts, line graphs, and pie charts. In total, the dataset comprises 8205 charts. The purpose of this dataset is to aid in the development and fine-tuning of models that can detect misinformation in charts.

Misleading Features and Examples

The dataset includes charts with various types of misleading features. Below are descriptions of each type, along with examples:

Bar Charts

Non-Zero Baseline

The y-axis does not start at zero, which can exaggerate differences between data points. For example, Figure 1 displays a bar chart showing support for gun control legislation, where the y-axis starts at fifty percent instead of zero. This makes the differences between years appear more significant than they actually are, exaggerating the upward trend in support for gun control legislation.

Non-Zero Baseline Example

Figure 1: Non-Zero Baseline Example (GPT-4 Generated Context)

Zero Baseline Example

Figure 2: Same Data with a Zero Baseline (GPT-4 Generated Context)

Inconsistent Time Intervals

The x-axis has inconsistent intervals, which can distort the perception of trends over time. For example, Figure 3 displays a line graph showing annual CO2 emissions with inconsistent time intervals. This may exaggerate or obscure the trends in CO2 emissions. In contrast, Figure 4 shows the same data with consistent time intervals, providing a more accurate representation of the trend.

Inconsistent Time Intervals Example

Figure 3: Inconsistent Time Intervals Example (GPT-4 Generated Context)

Consistent Time Intervals Example

Figure 4: Same Data with Consistent Time Intervals (GPT-4 Generated Context)

Pie Charts

Non-Sum to 100%

The segments of the pie chart do not sum to 100%, which can misrepresent the proportions of the categories. For example, Figure 5 displays a pie chart showing global energy consumption by source, where the sum of all segments is less than 100%, giving a false representation of the total energy distribution.

Non-Sum to 100% Example

Figure 5: Non-Sum to 100% Example (GPT 4 Generated Context)

Over-Segmentation

The pie chart includes too many small segments, making it difficult to accurately compare the sizes of the segments, which can obscure the true distribution of the data. For example, Figure 6 displays a pie chart showing the voter support distribution among U.S. presidential candidates in 2024, with many small segments that make it difficult to interpret the overall distribution of voter support.

Over-Segmentation Example

Figure 6: Over-Segmentation Example (GPT 4 Generated Context)

Chart Generation

LLM-Generated Charts

A total of 4000 chart datasets were created using a large language model (GPT-4). These datasets were designed to simulate realistic but potentially misleading visual representations. The scripts used for LLM dataset generation can be found here.

Python-Generated Charts

A total of 4005 charts were created using Python scripts and algorithms. The raw data points, including values and x-axis data for time series, were programmatically generated to diversify the dataset, as LLM-generated data points were often repetitive and limited in range. The titles, sources, and x-axis categories for categorical data were enriched using GPT-4 to enhance the dataset. The scripts used for Python data generation can be found here.

Dataset Breakdown

Chart Type Total Charts Misleading Non-Misleading Misleading Feature Count
Bar Charts 4,759 2,343 2,416 Non-Zero Baseline 1,653
Inconsistent Time Intervals 690
Time Series Line Charts 1,150 730 420 Inconsistent Time Intervals 730
Pie Charts 2,296 1,150 1,146 Non-Sum to 100% 239
Over-Segmentation 662
Overall Summary 8,205 4,223 3,982

Data Annotation

Each chart's annotation JSON file is structured to provide detailed information about the chart, including its characteristics, whether it's misleading, and the context in which it was created. Below is the structure of the annotation JSON file:

  • ID: A unique identifier for the chart.
  • Title: A descriptive title for the chart.
  • Image: The file path to the image of the chart.
  • Chart Type: Specifies the type of chart (e.g., bar, line, pie).
  • Domain: The domain or context of the chart (e.g., economics, health, etc.). This field can be empty if the domain is unspecified.
  • Is Misleading: A boolean value (Yes or No) indicating whether the chart is misleading.
  • Misleading Feature: Describes the specific misleading feature present in the chart, if applicable.
  • Conversations: A list of query-label pairs that provide insights or explanations about the chart. Each conversation entry contains:
    • Query: A question or prompt related to the chart.
    • Label: The answer or explanation corresponding to the query.

Example Annotation JSON

Below is an example of a chart's annotation in JSON format:

{
    "id": "pie2239",
    "title": "Distribution of Educational Resources",
    "image": "output_charts\\pie2239.png",
    "visualisation_type": "pie",
    "domain": "Education",
    "is_misleading": "Yes",
    "misleading_feature": "Non-Sum to 100",
    "conversations": [
        {
            "query": "What are the segments of the pie chart?",
            "label": "Textbooks, Online Courses, Tutoring Services"
        },
        {
            "query": "What is the sum of the segments of the pie chart?",
            "label": "120"
        },
        {
            "query": "Is this pie chart misleading? Explain",
            "label": "Yes, this chart is misleading because the sum of the segments is 120, which exceeds 100. Pie charts should have segments that sum to 100 to accurately represent the distribution of the whole."
        }
    ]
}

Directory Structure

MISCHA-QA/
├── bar_charts/
│   ├── images/
│   │   ├── bar1.png
│   │   ├── bar2.png
│   │   ├── ...
│   ├── annotations/
│   │   ├── bar1.json
│   │   ├── bar2.json
│   │   ├── ...
├── line_graphs/
│   ├── images/
│   │   ├── line1.png
│   │   ├── line2.png
│   │   ├── ...
│   ├── annotations/
│   │   ├── line1.json
│   │   ├── line2.json
│   │   ├── ...
├── pie_charts/
│   ├── images/
│   │   ├── pie1.png
│   │   ├── pie2.png
│   │   ├── ...
│   ├── annotations/
│   │   ├── pie1.json
│   │   ├── pie2.json
│   │   ├── ...

Usage

This dataset is intended for research and educational purposes, particularly in the fields of data visualization, misinformation detection, and machine learning. The dataset is also available in Parquet format, ready for training here: Hugging Face.

Contributing

Contributions to expand the dataset, improve annotations, or add new types of misleading features are welcome. Please submit a pull request or open an issue to discuss your contribution.

Contact

For any questions or inquiries, please contact Jane Swingler at jeswingler@usfca.edu.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors