-
Notifications
You must be signed in to change notification settings - Fork 0
Frequently Asked Questions
This page answers common questions about setting up and using pat2vec.
pat2vec is a Python-based tool designed to transform raw electronic health records (EHR) into structured, time-series feature vectors. This process makes the data suitable for machine learning tasks, particularly binary classification. It can aggregate data at the patient level or construct detailed longitudinal timelines.
The install_pat2vec.sh script for Unix/Linux includes a --proxy flag specifically for this purpose. This flag tells pip to use your organization's internal package mirror.
./install_pat2vec.sh --proxyIf you are using Windows or the basic install.sh script, you will need to configure pip to use your proxy manually. This is typically done by setting the http_proxy and https-proxy environment variables or by creating and configuring a pip.conf/pip.ini file.
You need to have a pre-trained MedCAT model pack (.zip file). These are typically pretrained trained and then fine tuned with exports from MedCAT trainer for your specific use case and data.
Once you have the model pack, place it in the medcat_models/ directory, which should be in the same parent folder as your pat2vec repository clone. The installation script creates this directory for you. See https://github.com/CogStack/MedCAT.
your_project_folder/
├── medcat_models/
│ └── your_model.zip <-- Place it here
└── pat2vec/
Your credentials should be placed in a file named credentials.py in the parent directory of your pat2vec clone. The install_pat2vec.sh script automatically copies a template for you. If you installed manually, you can copy pat2vec/pat2vec/config/credentials_template.py to the parent directory and edit it.
IMPORTANT: This file contains sensitive information and should never be committed to version control. The root .gitignore file of this project should already be configured to ignore credentials.py.
The structure should look like this:
your_project_folder/
├── credentials.py <-- Edit this file
└── pat2vec/
snomed_methods is a helper repository containing utility functions and methods related to SNOMED-CT and other clinical terminologies used in conjunction with this project. It is a dependency for certain feature extraction methods and is cloned automatically by the install_pat2vec.sh script.
- Check Python Version: Ensure you are using Python 3.10 or higher.
-
Check
venv: Make sure thepython3-venvpackage (or your OS equivalent) is installed. -
Run with
--force: If you have a partially completed or corrupted installation, try running the script again with the--forceflag. This will remove the existingpat2vec_envdirectory and perform a clean installation../install_pat2vec.sh --force
-
Check Permissions: Ensure you have write permissions in the directory where you are running the script. The script needs to create directories and files one level above the
pat2vecdirectory. - Review Logs: Read the error messages in the terminal carefully. They often point to the exact package or command that failed.
Your primary input should be a CSV file. The only strict requirement is that this file must contain a column named client_idcode which holds the unique identifiers for each patient in your cohort.
If you are performing time-series analysis, you will also need a column containing the reference date for each patient (e.g., a diagnosis date) to align the data correctly.
How do I choose which features to extract? Feature extraction is controlled via the main_options_dict dictionary in your configuration file. Each feature type can be enabled or disabled by setting it to True or False. The modular design allows you to easily enable or disable features based on your research needs. Example configuration snippet:
main_options_dict = {
'demo': True, # Enable demographic information
'bmi': True, # Enable BMI information
'bloods': True, # Enable blood-related information
'drugs': True, # Enable drug-related information
'diagnostics': True, # Enable diagnostic information
'core_02': True, # Enable core_02 information
'bed': True, # Enable bed information
'vte_status': True, # Enable VTE status information
'hosp_site': True, # Enable hospital site information
'core_resus': True, # Enable core resuscitation information
'news': True, # Enable NEWS (National Early Warning Score)
'smoking': True, # Enable smoking-related information
'annotations': True, # Enable EPR document annotations via MedCat
'annotations_mrc': True,# Enable MRC annotations via MedCat
'negated_presence_annotations': False, # Disable negated presence annotations
'appointments': False, # Disable appointments information
'annotations_reports': False, # Disable reports information
'textual_obs': False, # Disable textual observations
}config_obj = config_class( main_options=main_options_dict, # ... other configuration parameters ) This dictionary is then passed to the config_class constructor via the main_options parameter to control which features are extracted during processing.RetryClaude does not have the ability to run the code it generates yet.