This is a Streamlit application that facilitates the matching of variable names in a dataset to that of a target codebook, dramatically speeding up the first and often most tedious step in developing a common data model.
The Metadata Harmonisation Interface provides a convenient portal to match variables from an incoming dataset to a target set of ontologies. In this way, the tool provides a similar role to that of the White Rabbit tool utilized by the OHDSI community.
This tool differentiates itself by using Large Language Models to:
- Generate variable descriptions where none have been provided.
- Recommend the most likely target variable to map to.
- Support the creation and testing of variable transformation instructions.
- Provide a confidence score alongside mapping recommendations.
This dramatically speeds up the mapping process.
The easiest way to run the Metadata Harmonisation Tool is with Docker and Docker Compose. This method handles all Python dependencies and configuration for you.
- Option A: Docker (recommended)
- Option B: No Docker
- Docker: Install Docker Desktop for your operating system (Windows, Mac, or Linux).
- Choose how Ollama is provided (two supported options):
- Option A (Recommended): Ollama runs in Docker Compose
- No local Ollama install required.
- Compose will start an
ollamacontainer and pre-download the default models on first run. - If models are not present after startup, pull them inside the container:
docker exec -it ollama-server ollama pull llama3.1:8b docker exec -it ollama-server ollama pull nomic-embed-text docker exec -it ollama-server ollama ls
- Option B: Ollama runs on your host machine (outside Docker)
- Install Ollama from ollama.ai and ensure it is running.
- Pull example models:
ollama pull llama3.1:8b ollama pull nomic-embed-text
- Option A (Recommended): Ollama runs in Docker Compose
- Optional cloud providers (configured in the sidebar: AI Configuration):
- OpenAI: Requires an
OPENAI_API_KEY. - Anthropic (chat-only): Requires an
ANTHROPIC_API_KEY. - Azure OpenAI: Requires an Azure endpoint + API key.
- OpenAI: Requires an
git clone https://github.com/atwine/Metadata-Harmonisation-Tool.git
cd Metadata-Harmonisation-ToolCreate a .env file from the example template. This file will hold your local configuration.
# On Linux or macOS
cp .env.example .env
# On Windows
copy .env.example .envOpen the .env file and set OLLAMA_BASE_URL depending on your chosen Ollama option:
- Option A (Ollama in Docker Compose): set
OLLAMA_BASE_URL=http://ollama:11434 - Option B (host Ollama, app in Docker on Windows/Mac): set
OLLAMA_BASE_URL=http://host.docker.internal:11434
Note: when using Option A, the Ollama container is published on host port 11435 by default to avoid conflicts with an existing host Ollama (11434). You normally don't need to change this.
Important
- For stability, set the Ollama Base URL in your
.envfile. Due to Streamlit's page reruns, the sidebar Base URL field can reset when navigating between pages. DefiningOLLAMA_BASE_URLin.envkeeps the URL consistent across the app.- Recommended values:
- Local Ollama (no Docker):
http://localhost:11434- Docker Compose (container-to-container):
http://ollama:11434- App in Docker, Ollama on host (Windows/Mac):
http://host.docker.internal:11434
This app supports multiple AI providers. Configure it in the sidebar (AI Configuration):
- Ollama (Local): set Base URL and choose local chat + embedding models.
- Recommended: set
OLLAMA_BASE_URLin.envfor persistence; the sidebar Base URL is a temporary override and may reset on page changes.
- Recommended: set
- OpenAI: provide
OPENAI_API_KEY, choose chat model + embedding model. - Anthropic (chat-only): provide
ANTHROPIC_API_KEYand choose a Claude model.- Note: Anthropic does not provide embeddings; features that need embeddings require Ollama/OpenAI/Azure.
- Azure OpenAI: provide your Azure endpoint + API key, and set deployment/model names.
Use Docker Compose to build the image and start the application.
-
Option A (Ollama in Docker Compose):
docker compose -f docker/docker-compose.yml up --build -d
-
Option B (host Ollama, start only the app container):
- Ensure host Ollama is running and models are downloaded.
- Set
OLLAMA_BASE_URL=http://host.docker.internal:11434in.env. - Start only the app service (no Docker Ollama):
[src: https://docs.docker.com/compose/how-tos/production/]
docker compose -f docker/docker-compose.yml up --build --no-deps -d metadata-harmonisation-tool
-
--build: Builds the Docker image from the Dockerfile. You only need to do this the first time or when code changes. -
-d: Runs the container in detached mode (in the background).
Once the container is running, open your web browser and navigate to:
You should now see the Metadata Harmonisation Tool interface.
If you prefer not to use Docker, you can set up a local Python environment using Conda or Micromamba.
Note: If you just installed Conda and the conda or conda activate commands are not recognized, initialize your shell once and then restart your terminal:
- Windows PowerShell:
conda init powershell - Windows Command Prompt (cmd):
conda init cmd.exe - macOS/Linux (bash):
conda init bash
# Create and activate environment
conda env create -f environment.yml
conda activate harmonisation_env
pip install -r requirements.txt
# Configure the app
# 1) Copy .env and set OLLAMA_BASE_URL as described above (Quick Start β Step 2)
# 2) If using local Ollama, ensure models are present
ollama pull llama3.1:8b
ollama pull nomic-embed-text
# Run the app
cd app/
streamlit run app.pyVerify your setup:
- In the sidebar, open "AI Configuration" β run the Connection Test
- Or via CLI:
python validate_config.py --all-providersOpen http://localhost:8501 in your browser.
The Docker image is built using a multi-stage Dockerfile to create an optimized and secure production image. You can build it manually with:
docker build -f docker/Dockerfile -t atwine/metadata-harmonisation-tool:latest .Scripts are provided to simplify pushing the image to Docker Hub.
- Login to Docker Hub:
docker login
- Run the push script:
# On Linux or macOS chmod +x docker/push-to-dockerhub.sh ./docker/push-to-dockerhub.sh # On Windows .\docker\push-to-dockerhub.bat
A validation script is included to check your configuration and test AI provider connectivity.
python validate_config.py --all-providersUse the sidebar AI Configuration panel:
- Request timeout (seconds) controls the maximum time to wait for AI provider responses before timing out.
This platform is built to harmonise incoming datasets to a single set of target_variables (codebook). An example codebook is included by default. A new codebook can be uploaded under the Upload Codebook tab. New codebooks should be in .csv format and contain two columns variable_name and description. It is recommended (but not need for this tool) that these variables be linked to standardised ontologies.
From here incoming study data whichs need to be mapped to the target codebook can be uploaded. The following documents can be uploaded:
- Study Name (required)
- Study Description (optional)
- Variables Table (required)
- File Format: .csv
- File Contains: variable names or descriptions from the incoming dataset
- 2 columns with headers: variable_name, description
- Example_data (optional)
- File Format: .csv
- column headers should correspond to variable name in the dataset variables table.
- Contextual Documents (optional)
- File Format: .pdf
- If the uploaded variables table contains missing variable descriptions a large language model will be used to populate the descriptions. Uploading a study protocol or some other relevant documentation can help inhance this process.
Once studies have been uploaded, you can run the variable description completion and ontology recommendation engines. You will be given the option to fine-tune the LLM prompt used by the description completion engine.
Before running the Recommendation Engine, ensure an AI provider is configured and connected in the sidebar (AI Configuration). If you use Ollama, ensure it is running and models are available.
Once step 1 & 2 have been completed a recommendations algorithm will suggest the most likely variable mappings for each added dataset. The user will be presented with an interface to select the correct mappings from a list of suggested mappings. Thus the actual mapping process remains manual.
Once the mapping process has been completed. Each study that has been fully mapped will be available for download as a .csv file. The mapping result is simply a table mapping each dataset variable name to a corresponding codebook variable name.
The Metadata Harmonisation Interface comprises two key parts:
First, the LLM-based description generator provides a way to quickly and easily extract variable description information from complex free-text documents such as study protocols or journal articles. While in an ideal world, descriptions should come from a codebook and match standardised ontologies, this is often not the case. The description generator works by taking in a PDF document and converting it to plain text using the pdfminer python package. Next, we use a text-splitter from the LangChain suite of python functions. This works by recursively splitting the text by special characters (\n\n, \n, ) until a text length of 1000 characters is reached. An overlap of 20 characters between chunks is preserved to ensure no information is lost. An embedding model is then used to get a vector representation of each chunk. This information is stored as a simple Numpy array. A prompt is constructed by taking an already completed variable and description pair and retrieving the most relevant context, calculated as the spatial distance between the chunk embeddings and the variable name embedding.
The second step is the ontology recommendation engine. This again uses text embeddings to retrieve vector representations of variable names and descriptions for both the target codebook and incoming datasets. Recommendations are then calculated using the spatial distance between vectors, weighted 80/20 to descriptions. The interface utilizes DuckDB to retrieve these recommendations from plain CSV files.
- Connection Failed Error: If the application in Docker can't connect to Ollama, ensure
OLLAMA_BASE_URLin your.envfile is set correctly:- Option A (Ollama in Docker Compose):
http://ollama:11434 - Option B (host Ollama on Windows/Mac):
http://host.docker.internal:11434
- Option A (Ollama in Docker Compose):
- Connection Test fails but Ollama is reachable: If
/api/tagsresponds but the app's Connection Test fails, Ollama likely has no models yet (first run download). You can pull the starter models inside the Ollama container and verify:docker exec -it ollama-server ollama pull llama3.1:8b docker exec -it ollama-server ollama pull nomic-embed-text docker exec -it ollama-server ollama ls
- Check Container Logs: If the app fails to start, check the logs for errors:
docker logs metadata-harmonisation-tool
- Is Ollama running?: Verify the Ollama application is running on your host system.
- Models not found?: Run
ollama lsto confirmllama3.1:8bandnomic-embed-textare downloaded. [src: https://docs.ollama.com/cli] - Firewall: Ensure no firewall or antivirus software is blocking the connection to
http://localhost:11434.
- Direct transformations are evaluated using a restricted evaluator (simple arithmetic on variable
xonly). - Categorical mappings are parsed using
ast.literal_eval()(noeval()), and must be Python dict literals.
Please report any issues to the GitHub repository. For more information or support, contact:
- Peter Marsh:
peter.marsh@uct.ac.za - Atwine Mugume:
twinmugume@gmail.com
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.


