SciDraw-6K

A curated dataset of 6,291 scientific illustrations synthesized by Google Gemini, with aligned prompts in 11 languages, powering sci-draw.com — a public scientific drawing service.

This repository contains the code and documentation that accompany the dataset. The dataset itself (metadata + images) is hosted on the Hugging Face Hub and archived on Zenodo; this repository does not redistribute any data files.

At a glance


Rows	6,291
Categories	8 (biomedical, chemistry, materials, electronics, environment, ai_system, physics, other)
Languages	11 (en, zh, ja, ko, de, fr, es, pt_br, zh_tw, it, ru)
Source model	Google Gemini (`gemini-2.5-flash-image`, `gemini-3-pro-image-preview`, `gemini-3.1-flash-image-preview`)
Time span	2026-01 – 2026-04
Total size	~19 GB images + 48 MB Parquet metadata
Data license	CC BY 4.0
Code license	MIT

Quickstart

Install dependencies:

pip install -r requirements.txt

Load the metadata (no image download) via the datasets library:

from datasets import load_dataset

ds = load_dataset("SciDrawAI/SciDraw-6K", split="train")
print(ds)
print(ds[0]["prompts"]["en"])

Or stream directly from the Parquet file without pulling images:

import pandas as pd

df = pd.read_parquet(
    "hf://datasets/SciDrawAI/SciDraw-6K/metadata.parquet"
)
print(df["release_category"].value_counts())

Download the full dataset (images included) to local disk:

python scripts/download.py --output ./scidraw6k

Regenerate the paper's statistical figures from metadata:

python scripts/compute_stats.py --metadata ./scidraw6k/metadata.parquet --out ./stats

Verify image integrity against published SHA-256 hashes:

python scripts/verify_integrity.py --root ./scidraw6k

Repository layout

scidraw-6k/
├── README.md
├── LICENSE                       # MIT (code)
├── CITATION.cff                  # cite via Zenodo DOI
├── requirements.txt
├── schema/
│   └── metadata_schema.json      # JSON Schema for metadata rows
├── scripts/
│   ├── download.py               # fetch metadata + images from the HF Hub
│   ├── compute_stats.py          # reproduce category / language / length / time / model figures
│   ├── make_splits.py            # reproduce prompt-grouped train/val/test splits
│   └── verify_integrity.py       # re-hash images and check against metadata SHA-256 column
├── examples/
│   ├── load_dataset.py           # smallest possible load example
│   ├── filter_by_category.py     # category-wise subsetting
│   └── retrieval_demo.py         # few-shot prompt retrieval with sentence-transformers
└── docs/
    └── construction_pipeline.md  # extended version of paper Section 3

Metadata schema

Each row in metadata.parquet / metadata.jsonl contains:

Field	Type	Description
`id`	string	Unique image identifier
`image`	string	Relative path to the image file (e.g. `images/biomedical/gal_xxx.png`)
`image_ext`	string	File extension (usually `png`)
`raw_category`	string	Original fine-grained category label
`release_category`	string	Normalized 8-class category
`category`	string	Alias of `release_category`
`prompts`	object	11-language prompt object: `original`, `en`, `zh`, `ja`, `ko`, `de`, `fr`, `es`, `pt_br`, `zh_tw`, `it`, `ru`
`gemini_model`	string \| null	Gemini model identifier (null for ~7% of rows)
`generation_type`	string \| null	Generation type (e.g. `text_to_image`)
`created_at`	string	ISO 8601 timestamp
`image_sha256`	string	SHA-256 of the image bytes (used by `verify_integrity.py`)

The canonical machine-readable schema is schema/metadata_schema.json.

Application: sci-draw.com

SciDraw-6K is the substrate of sci-draw.com, a public scientific drawing service. The dataset powers three concrete functions of the live service:

Template seeding — curated prompts serve as one-click templates, organized by category, with aligned 11-language prompts letting users start in their preferred language.
Few-shot prompt rewriting — user requests are rewritten using nearest-neighbour prompts from SciDraw-6K as in-context exemplars before dispatch to Gemini.
Regression evaluation — a held-out slice is used as a regression suite when the underlying image-generation model is upgraded.

See examples/retrieval_demo.py for a minimal implementation of the few-shot retrieval pattern.

Visit sci-draw.com to try the service.

Citation

Please cite the Zenodo record when using SciDraw-6K:

@misc{chen2026scidraw6k,
  author       = {Chen, Davie},
  title        = {{SciDraw-6K}: A Multilingual Scientific Illustration
                  Dataset Generated by {Google Gemini}},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.19642870},
  url          = {https://doi.org/10.5281/zenodo.19642870}
}

GitHub also exposes a "Cite this repository" button sourced from CITATION.cff.

Related resources

Service: sci-draw.com — public scientific drawing platform powered by this dataset
Dataset: SciDrawAI/SciDraw-6K on Hugging Face
Archive / DOI: 10.5281/zenodo.19642870

License

Data (images and prompts, hosted on Hugging Face and Zenodo): Creative Commons Attribution 4.0 International (CC BY 4.0).
Code (this repository): MIT License.

Downstream users are responsible for independently verifying current licensing and redistribution constraints of Google Gemini's outputs before republishing mirrors or derived artefacts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciDraw-6K

At a glance

Quickstart

Repository layout

Metadata schema

Application: sci-draw.com

Citation

Related resources

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
schema		schema
scripts		scripts
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SciDraw-6K

At a glance

Quickstart

Repository layout

Metadata schema

Application: sci-draw.com

Citation

Related resources

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages