🔬 S1-MMAlign: A Large-Scale Multi-Disciplinary Scientific Multimodal Dataset

license

cc-by-nc-4.0

task_categories

image-to-text

visual-question-answering

feature-extraction

language

en

🔬 S1-MMAlign: A Large-Scale Multi-Disciplinary Scientific Multimodal Dataset

Bridging the semantic gap in AI for Science: A massive dataset of 15.5M+ image-text pairs across 9 STEM disciplines, featuring AI-enhanced captions for superior cross-modal alignment.

Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions.

S1-MMAlign aims to bridge this gap. Unlike simple "image-reading," scientific understanding requires traversing multiple semantic layers involving variables, structures, hypotheses, and inferences. This dataset is built to address this "short board" in current data resources.

Dataset Information

Total Image-Text Pairs: > 15,500,000

Source Papers: ~ 2,500,000

Disciplines Covered: 9 Major STEM Fields

Alignment Improvement: +18.21% (CLIP Score vs. Raw Data)

License: CC BY-NC 4.0

How was the data processed?

To address the pervasive issue of weak alignment in raw scientific captions, we introduced an AI-ready semantic enhancement pipeline. We utilized the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts.

Technical validation demonstrates comprehensive quality improvements across intrinsic metrics and downstream tasks: SciBERT-based pseudo-perplexity metrics verify reduced semantic ambiguity and enhanced scientific linguistic fluency, CLIP scores show an 18.21% uplift in image-text alignment (with a 27.77% decrease in score variance), and fine-tuning on S1-MMAlign consistently boosts performance on scientific multimodal benchmarks including zero-shot captioning, visual question answering, and cross-modal scientific reasoning.

Recommendation: Please use the recaption field for model pre-training.

image_path: The relative path to the image file.
recaption (Recommended): The AI-enhanced caption generated by our pipeline (Qwen-VL). It synthesizes context from the paper abstract and citations to provide a semantically rich description, significantly outperforming the raw caption in alignment and quality.
caption: The original, raw caption extracted from the paper figures (often noisy or sparse).
metadata: Additional information including source paper arxiv_id and title.

Note on File Structure

The relative paths of the images provided in the jsonl file must follow the file structure we provide in order to be used correctly. Please ensure you maintain the directory hierarchy after downloading and decompressing the dataset. Do not flatten the folder structure, as the metadata relies on specific relative paths.

Citation

If you find this dataset useful, please cite our work:

@article{s1mmalign2026,
  title={S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure–Text Understanding},
  author={He Wang and Longteng Guo and Pengkang Huo and Xuanxu Lin and Yichen Yuan and Jie Jiang and Jing Liu},
  journal={ArXiv preprint},
  url={https://arxiv.org/abs/2601.00264}, 
  year={2026}
}

License and Copyright

This dataset is released under the CC BY-NC 4.0 license for research and non-commercial use only.

Non-Commercial: Commercial use of the dataset or any images is strictly prohibited.
Copyrights: The images contained in this dataset are extracted from publicly accessible scientific publications. All copyrights of the original figures remain with their original authors or publishers.
Compliance: Users must ensure their use complies with the copyrights of the original publications.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 S1-MMAlign: A Large-Scale Multi-Disciplinary Scientific Multimodal Dataset

Dataset Information

How was the data processed?

Note on File Structure

Citation

License and Copyright

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🔬 S1-MMAlign: A Large-Scale Multi-Disciplinary Scientific Multimodal Dataset

Dataset Information

How was the data processed?

Note on File Structure

Citation

License and Copyright

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages