Code repository for the Master's thesis by Martin Richter.
This thesis investigates if mechanisms for protein-ligand binding can be learnt from synthetically generated traces. First, a framework to synthetically generate Biolayer Interferometry (BLI)-like data is created. This data is used to train and evaluate a machine-learning based classifier. The approach is evaluated on synthetic data from three reaction mechanisms with above 90 % accuracy on an independent test set. To evaluate real-world performance, a case-study on experimental data that was wrongly fit in the database is performed.
In MATLAB, use the BatchDatagen.mlx livescript to see how batches of synthetic data can be generated from predefined mechanisms and number of samples. Use the function generateDataBatch() to generate batches of data or the function generateData() to generate single randomized samples.
Experimental data used in this project comes from Proteinbase. Using the index file provided at proteinbase.com/download, experimental data can be downloaded with the DownloadProteinbase.ipynb Python script.
This work used Proteinbase by Adaptyv Bio under ODC-BY license.
In MATLAB, use the function classifyTraces() to classify an experiment. To see how to classify experimental data, view ExploreProteinBaseClassification.mlx.
In Python, see the TrainForProteinbase.ipynb notebook for the full model training, classification and evaluation pipeline.
In the available code, there are two main folders based on the programming languages used: MATLAB and Python. Both contain files with code based on their function: .m and .py files contain useful functions and structures; Live script files and Jupyter notebooks (.mlx and .ipynb) are used for calling those functions in order to explore and analyse data, save results and create visualisations. These files are organised into folders based on the script purpose.
MATLAB
The MATLAB section contains scripts for synthetic data generation and the baseline classifier. MATLAB R2023b with the System Identification Toolbox was used.
Python
The Python section is focused on preprocessing experimental data, creating Machine-Learning based algorithms and visualizing results. Python version 3.12 was used. The full list of used packages can be found in the file pythonPackages.txt. The core packages are listed below:
- PyTorch 2.6.0 (+ Cuda 1.2.6)
- NumPy 2.1.2
- Optuna 4.7.0
- Jupyter
The root folder is split into 3 main folders based on their purpose:
This folder contains experimental data used in the project. In this archive it contains only the Proteinbase index file, whis can be used with the script in DownloadProteinbase.ipynb to download the raw data. The data in this folder is referenced from code and reused from both Python and MATLAB.
The MATLAB folder contains all MATLAB code used in this project. This includes scripts exploring the mathematics, scripts running and evaluating written functions and other miscellaneous scripts. There are also multiple subfolders:
- Functions — contains all written functions for the project, that are called from the live scripts.
- GeneratedData — contains the synthetic data generated by the batch generation algorithms.
- ExportImg — contains visualisations used in progress presentations and this thesis.
The Python folder contains all Python code used for this project. In the Functions folder, packages for working with kinetic data, utilities for training neural networks and the networks (that were all created specifically for this project) are stored. In the root folder are located Jupyter notebooks that use the Functions, do the analyses and save results and visuals. Additional folders contain cached datasets, trained models and other output files.
The end goal is to generate batches of data for usage in training a classifier with a single function call. Therefore, there are multiple wrapper layers present to simplify the usage. The structure is as follows:
SPRgenerateDataBatch— Organise generated experiments: Manage output directories, mechanisms, design matrix / table with labels, call functions for generating data.generateSPRtraces— Mechanism-nonspecific parts of generation on the level of one experiment: Generates a single experiment by calling the relevant mechanism-specific functions. Sets times and values shared among mechanisms. After generation, normalizes the result and adds noise.SPRgenerateXXX(whereXXXis replaced with a specific mechanism) — Mechanism-specific parts of generation on the level of one experiment: Holds parameter ranges specific to a reaction mechanism, handles random parameter selection, joins association and dissociation phase.SPRXXXEquation— Generates one phase of experiment: Handles the numerical integration and evaluation of the system of differential equations with specified parameters at specific timesteps. Computes the observable signal.
The systems of equations for the mechanisms are coded into MATLAB. First, a template is created, for example:
inducedFitEquationTemplate = @(t, y, L0_i) [
-k_p1*y(1)*L0_i+ k_m1*y(2); % d[E]/dt
+k_p1*y(1)*L0_i - k_m1*y(2) - k_p2*y(2) + k_m2*y(3); % d[EL]/dt
+k_p2*y(2) - k_m2*y(3); % d[E*L]/dt
];This template gets reused for every concentration in the experiment:
inducedFitEquations = @(t, y) inducedFitEquationTemplate(t, y, L0_i);And numerically evaluated at specified timesteps for the specified initial conditions:
y0 = [E0_i, EL0_i, EsL0_i];
resEq = ode45(inducedFitEquations, tspan, y0);
resNum = deval(resEq,inputTimestamps)';Finally, the observable is computed as a linear combination of the results multiplied by the effect of each species on the observable signal.
The commonly used functions have full documentation written in code, which describes input parameters and outputs.