MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS. Preprint arXiv.
A demo application is available at MedCAT. Please note that this was trained on MedMentions and contains a small portion of UMLS.
Please use Discussions as type of interest group, or place where to ask questions and write suggestions without opening an Issue.
A guide on how to use MedCAT is available in the tutorial folder. Read more about MedCAT on Towards Data Science.
- Treatment with ACE-inhibitors is not associated with early severe SARS-Covid-19 infection in a multi-site UK acute Hospital Trust
- Supplementing the National Early Warning Score (NEWS2) for anticipating early deterioration among patients with COVID-19 infection
- Comparative Analysis of Text Classification Approaches in Electronic Health Records
- Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset
- MedCATtrainer - an interface for building, improving and customising a given Named Entity Recognition and Linking (NER+L) model (MedCAT) for biomedical domain text.
- MedCATservice - implements the MedCAT NLP application as a service behind a REST API.
- iCAT - A docker container for CogStack/MedCAT/HuggingFace development in isolated environments.
- Install MedCAT
pip install --upgrade medcat
- Get the scispacy models:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz
-
Download the Vocabulary and CDB from the Models section below
-
Quickstart:
from medcat.cat import CAT
from medcat.utils.vocab import Vocab
from medcat.cdb import CDB
vocab = Vocab()
# Load the vocab model you downloaded
vocab.load_dict('<path to the vocab file>')
# Load the cdb model you downloaded
cdb = CDB()
cdb.load_dict('<path to the cdb file>')
# create cat
cat = CAT(cdb=cdb, vocab=vocab)
# Test it
text = "My simple document with kidney failure"
doc_spacy = cat(text)
# Print detected entities
print(doc_spacy.ents)
# Or to get an array of entities, this will return much more information
#and usually easier to use unless you know a lot about spaCy
doc = cat.get_entities(text)
print(doc)A basic trained model is made public for the vocabulary and CDB. It is trained for the ~ 35K concepts available in MedMentions. It is quite limited
so the performance might not be the best.
Vocabulary Download - Built from MedMentions
CDB Download - Built from MedMentions
(Note: This is was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)
If you have access to UMLS or SNOMED-CT and can provide some proof (a screenshot of the UMLS profile page is perfect, feel free to redact all information you do not want to share), contact us - we are happy to share the pre-built CDB and Vocab for those databases.
Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS
The vocabulary was compiled from Wiktionary In total ~ 800K unique words
A big thank you goes to spaCy and Hugging Face - who made life a million times easier.
@misc{kraljevic2020multidomain,
title={Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit},
author={Zeljko Kraljevic and Thomas Searle and Anthony Shek and Lukasz Roguski and Kawsar Noor and Daniel Bean and Aurelie Mascio and Leilei Zhu and Amos A Folarin and Angus Roberts and Rebecca Bendayan and Mark P Richardson and Robert Stewart and Anoop D Shah and Wai Keong Wong and Zina Ibrahim and James T Teo and Richard JB Dobson},
year={2020},
eprint={2010.01165},
archivePrefix={arXiv},
primaryClass={cs.CL}
}