This repository has the corpus preprocessing and training instructions for the models and corpora from the paper Enhancing Clinical Models with Pseudo Data for De-identification.
- Install the MIMIC-III database as described in the mimic package install section.
- Uncompress the pseudo sources:
tar jxf pseudo-source.tar.bz2 - Install Python dependencies:
pip install -r src/python/requirements-all.txt - Load the SQLite DB from downloaded lists:
./cpbert load - Create the admission files:
./cpbert admids - Create the masked and pseudo corpora files:
./cpbert process <admission ID file> - Create admission IDs to process:
./cpbert adms --shuffle -s 10 -o adm-ids - Process the first set of 10:
./cpbert process adm-ids/0000 -d pseudos - Create the corpus file:
find pseudos -name \*-pseudo.txt -exec cat {} >> pseudo-corpus.txt \; - Confirm corpus status as newlines, words, and byte counts:
wc pseudo-corpus.txt - Follow the instructions to reproduce the de-identification results.
The pretrained, de-identification models and pseudo corpus are available upon request. All require proper documentation of certification by Physionet as explained in the paper.
Copyright (c) 2025 Paul Landes