immersioncoolingdatabase

Tools for auto-generating a database of thermophysical properties of immersion cooling fluids. The pipeline scrapes scientific literature from Elsevier, RSC, and Springer, then applies a customised ChemDataExtractor (v1.5) to extract properties including dielectric constant, thermal conductivity, dynamic viscosity, and flash point for candidate coolant compounds.

Installation

Install the public ChemDataExtractor base (v1.3):

conda install -c chemdataextractor chemdataextractor

Download the required data files (ML models, dictionaries, etc.):

cde data download

Install the dependency packages for the bespoke immersion-cooling version (chemdataextractor_immersion v1.5):

pip install -r requirements.txt

Web Scraping

All scrapers live in web-scrap/. They retrieve full-text articles from each publisher and save them locally for downstream extraction. Search queries are driven by keywords.txt (one query per line; lines starting with # are ignored).

keywords.txt format example:

("dielectric fluid" OR "heat transfer fluid") AND ("dielectric constant" OR "dielectric strength")
("biobased hydrocarbons" OR "natural esters") AND ("thermal conductivity" OR viscosity OR "flash point")

Elsevier

Uses the Elsevier ScienceDirect API. Requires an API key set in web-scrap/elsevier.py.

Run from inside web-scrap/:

python elsevier_scraper.py

Reads queries from keywords.txt (must be present in web-scrap/)
Searches years 2019–2025 for each query
Downloads full-text XML files to ../data/elsevier_papers/<query_folder>/
Completed queries are checkpointed in completed_queries.txt so interrupted runs can resume

RSC

Uses Selenium with a headless Chrome browser to search and download from the Royal Society of Chemistry. Requires chromedriver (managed automatically via webdriver-manager).

Run from inside web-scrap/ or the repository root:

# Use queries from keywords.txt (default: ../keywords.txt)
python rsc_scraper.py

# Pass queries directly on the command line
python rsc_scraper.py "immersion cooling" "dielectric fluid battery"

# Set page range and output directory explicitly
python rsc_scraper.py --page-start 1 --page-end 10 --output-dir ../data/rsc_papers

# All options
python rsc_scraper.py --help

Option	Short	Default	Description
`--keywords-file`	`-k`	`../keywords.txt`	Path to the keywords file
`--output-dir`	`-o`	`../data`	Root directory for saved HTML and metadata
`--page-start`	`-s`	`1`	First search-results page to fetch
`--page-end`	`-e`	`10`	Last search-results page to fetch (inclusive)
`--verbose`	`-v`	—	Enable DEBUG-level logging

Each article is saved as <output_dir>/<query_folder>/<safe_doi>/article.html alongside a metadata.json file. Already-downloaded articles are skipped automatically.

Flattening RSC output with `article_doi.py`

After the RSC scrape, the articles are nested inside per-query and per-DOI subdirectories. article_doi.py flattens this into a single folder of DOI-named HTML files, ready for bulk extraction.

python article_doi.py --source rsc_filtered --output rsc_articles

Option	Default	Description
`--source`	`rsc_filtered`	Root folder containing category/DOI subdirectories
`--output`	`rsc_articles`	Destination folder for the flat DOI HTML files

Duplicate DOI filenames (same article found under multiple queries) are disambiguated automatically with a numeric suffix.

Springer

Uses the Springer Nature API. Requires an API key and the target query/year configured directly in web-scrap/springer_scraper.py:

QUERY = "immersion cooling"
YEARS = [2023, 2024, 2025]
API_KEY = "your_springer_api_key"

Then run from inside web-scrap/:

python springer_scraper.py

Downloads article XML files to data/springer_papers/.

Property Extraction

Once articles have been downloaded, run extract.py to extract thermophysical property records using ChemDataExtractor. Provide the folder of HTML/XML files, an output directory, a slice range over the paper list, and a filename stem for the output JSON lines file.

python extract.py --input_dir <paper_folder> --output_dir <output_folder> --start 0 --end 100 --save_name raw_data

Argument	Description
`--input_dir`	Folder containing `.html` or `.xml` article files (required)
`--output_dir`	Folder where the output JSON lines file will be saved (required)
`--start`	Start index into the sorted paper list (default: 0)
`--end`	End index into the sorted paper list (default: 1)
`--save_name`	Stem of the output file, saved as `<save_name>.json` (default: `raw_data`)

For example, to process all RSC articles collected above:

python extract.py --input_dir rsc_articles/ --output_dir outputs/ --start 0 --end 500 --save_name raw_data_rsc

The output is a JSON lines file (one record per line) where each line contains a serialised property record with compound name, value, units, specifier, and article metadata (DOI, title, journal, date).

Data Cleaning

clean_immersion.py reads the JSON lines output from extract.py, converts it to a flat CSV, removes duplicates, and filters out physically implausible values.

Edit the input and output paths at the bottom of the script, then run:

python clean_immersion.py

The script:

Converts nested CDE records into a flat table with columns: Property, Name, Specifier, Raw_value, Raw_unit, Value, Conditions, Unit, DOI, Title, Journal, Date, Warning
Removes duplicate rows on (Property, DOI, Name, Value)
Filters DielectricConstant records to the physically valid range (1.0–200.0)
Saves the result as a CSV file

Acknowledgements

This project was supported by the Department of Mechanical Engineering at the University of Michigan. The author also gratefully acknowledges the technical assistance and guidance provided by the publishers throughout the development of this research. Furthermore, the use of research resources was made possible through the support of the University of Michigan's institutional facilities.

Citation

The ChemDataExtractor framework underlying the property extraction is described in:

@article{huang2020database,
  title={A database of battery materials auto-generated using ChemDataExtractor},
  author={Huang, Shu and Cole, Jacqueline M},
  journal={Scientific Data},
  volume={7},
  number={1},
  pages={1--13},
  year={2020},
  publisher={Nature Publishing Group}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
chemdataextractor_immersion		chemdataextractor_immersion
filtering		filtering
parse		parse
save		save
test		test
web-scrap		web-scrap
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
article_doi.py		article_doi.py
clean_immersion.py		clean_immersion.py
completed_queries.txt		completed_queries.txt
database.py		database.py
extract.py		extract.py
find_paper.py		find_paper.py
keywords.txt		keywords.txt
log.txt		log.txt
merge.py		merge.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

immersioncoolingdatabase

Installation

Web Scraping

Elsevier

RSC

Flattening RSC output with `article_doi.py`

Springer

Property Extraction

Data Cleaning

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

immersioncoolingdatabase

Installation

Web Scraping

Elsevier

RSC

Flattening RSC output with article_doi.py

Springer

Property Extraction

Data Cleaning

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Flattening RSC output with `article_doi.py`

Packages