Using machine learning to predict reaction mechanisms in Ligand-binding synthetic data

Code repository for the Master's thesis by Martin Richter.

This thesis investigates if mechanisms for protein-ligand binding can be learnt from synthetically generated traces. First, a framework to synthetically generate Biolayer Interferometry (BLI)-like data is created. This data is used to train and evaluate a machine-learning based classifier. The approach is evaluated on synthetic data from three reaction mechanisms with above 90 % accuracy on an independent test set. To evaluate real-world performance, a case-study on experimental data that was wrongly fit in the database is performed.

Where to start

Data generation

In MATLAB, use the BatchDatagen.mlx livescript to see how batches of synthetic data can be generated from predefined mechanisms and number of samples. Use the function generateDataBatch() to generate batches of data or the function generateData() to generate single randomized samples.

Experimental data

Experimental data used in this project comes from Proteinbase. Using the index file provided at proteinbase.com/download, experimental data can be downloaded with the DownloadProteinbase.ipynb Python script.

This work used Proteinbase by Adaptyv Bio under ODC-BY license.

Baseline classifier

In MATLAB, use the function classifyTraces() to classify an experiment. To see how to classify experimental data, view ExploreProteinBaseClassification.mlx.

Machine learning classifier

In Python, see the TrainForProteinbase.ipynb notebook for the full model training, classification and evaluation pipeline.

Code structure details

In the available code, there are two main folders based on the programming languages used: MATLAB and Python. Both contain files with code based on their function: .m and .py files contain useful functions and structures; Live script files and Jupyter notebooks (.mlx and .ipynb) are used for calling those functions in order to explore and analyse data, save results and create visualisations. These files are organised into folders based on the script purpose.

MATLAB

The MATLAB section contains scripts for synthetic data generation and the baseline classifier. MATLAB R2023b with the System Identification Toolbox was used.

Python

The Python section is focused on preprocessing experimental data, creating Machine-Learning based algorithms and visualizing results. Python version 3.12 was used. The full list of used packages can be found in the file pythonPackages.txt. The core packages are listed below:

PyTorch 2.6.0 (+ Cuda 1.2.6)
NumPy 2.1.2
Optuna 4.7.0
Jupyter

Folder Structure

The root folder is split into 3 main folders based on their purpose:

Experimental Data

This folder contains experimental data used in the project. In this archive it contains only the Proteinbase index file, whis can be used with the script in DownloadProteinbase.ipynb to download the raw data. The data in this folder is referenced from code and reused from both Python and MATLAB.

MATLAB

The MATLAB folder contains all MATLAB code used in this project. This includes scripts exploring the mathematics, scripts running and evaluating written functions and other miscellaneous scripts. There are also multiple subfolders:

Functions — contains all written functions for the project, that are called from the live scripts.
GeneratedData — contains the synthetic data generated by the batch generation algorithms.
ExportImg — contains visualisations used in progress presentations and this thesis.

Python

The Python folder contains all Python code used for this project. In the Functions folder, packages for working with kinetic data, utilities for training neural networks and the networks (that were all created specifically for this project) are stored. In the root folder are located Jupyter notebooks that use the Functions, do the analyses and save results and visuals. Additional folders contain cached datasets, trained models and other output files.

Synthetic data generation

The end goal is to generate batches of data for usage in training a classifier with a single function call. Therefore, there are multiple wrapper layers present to simplify the usage. The structure is as follows:

SPRgenerateDataBatch — Organise generated experiments: Manage output directories, mechanisms, design matrix / table with labels, call functions for generating data.
generateSPRtraces — Mechanism-nonspecific parts of generation on the level of one experiment: Generates a single experiment by calling the relevant mechanism-specific functions. Sets times and values shared among mechanisms. After generation, normalizes the result and adds noise.
SPRgenerateXXX (where XXX is replaced with a specific mechanism) — Mechanism-specific parts of generation on the level of one experiment: Holds parameter ranges specific to a reaction mechanism, handles random parameter selection, joins association and dissociation phase.
SPRXXXEquation — Generates one phase of experiment: Handles the numerical integration and evaluation of the system of differential equations with specified parameters at specific timesteps. Computes the observable signal.

The systems of equations for the mechanisms are coded into MATLAB. First, a template is created, for example:

inducedFitEquationTemplate = @(t, y, L0_i) [
    -k_p1*y(1)*L0_i+ k_m1*y(2);  % d[E]/dt
    +k_p1*y(1)*L0_i - k_m1*y(2) - k_p2*y(2) + k_m2*y(3); % d[EL]/dt
    +k_p2*y(2) - k_m2*y(3);  % d[E*L]/dt
];

This template gets reused for every concentration in the experiment:

inducedFitEquations = @(t, y) inducedFitEquationTemplate(t, y, L0_i);

And numerically evaluated at specified timesteps for the specified initial conditions:

y0 = [E0_i, EL0_i, EsL0_i];
resEq = ode45(inducedFitEquations, tspan, y0);
resNum = deval(resEq,inputTimestamps)';

Finally, the observable is computed as a linear combination of the results multiplied by the effect of each species on the observable signal.

Documentation

The commonly used functions have full documentation written in code, which describes input parameters and outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ExperimentalData/ProteinbaseFull		ExperimentalData/ProteinbaseFull
MATLAB		MATLAB
python		python
.gitignore		.gitignore
README.md		README.md
pythonEnvironmentPackageList.txt		pythonEnvironmentPackageList.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using machine learning to predict reaction mechanisms in Ligand-binding synthetic data

Where to start

Data generation

Experimental data

Baseline classifier

Machine learning classifier

Code structure details

Folder Structure

Experimental Data

MATLAB

Python

Synthetic data generation

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Using machine learning to predict reaction mechanisms in Ligand-binding synthetic data

Where to start

Data generation

Experimental data

Baseline classifier

Machine learning classifier

Code structure details

Folder Structure

Experimental Data

MATLAB

Python

Synthetic data generation

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages