Link molecular structures of chemicals (in form of topological fingerprints) with multiple targets.
The DFPL package requires a particular Python environment to work properly.
It consists of a recent Python interpreter and packages for data-science and neural networks.
The exact dependencies can be found in this
environment.yml which is the basis for creating
a conda environment and a Singularity container.
You have several ways to provide the correct environment to run code from the DFPL package.
- Use the automatically built Singularity container
- Build your own Singularity container following the steps here
- Set up a conda environment
In the following, you find details for option 1. and 3.
You need Singularity installed on your machine, and for details, please read the first section of this document. You can download the latest version of the container with
singularity pull --arch amd64 library://mai00fti/default/dfpl.sif:latestAfter that, you can run DFPL with
singularity run --nv dfpl.sif "python -m dfpl convert -f \"data\""or you can start a shell script (look at run-all-publication-cases.sh for an example)
singularity run --nv dfpl.sif ". ./scripts/run-all-cases.sh"It's also possible to get an interactive shell into the container
singularity shell --nv dfpl.sifNote: The Singularity container is intended to be used on HPC cluster where your ability to install software might be limited. For local testing or development, setting up the conda environment is preferable.
To use this tool outside the Singularity container first create the respective conda environment:
-
Create the conda env from scratch
From within the
deepFPlearndirectory, you can create the conda environment with the provided yaml file that contains all information and necessary packagesconda env create -f singularity_container/environment.yml
-
Activate the
dfpl_envenvironment withconda activate dfpl_env
-
Install the local
dfplpackage by callingconda develop dfpl
DFPL can calculate fingerprints of chemical structures from SMILES or INCHI representation. Therefore, e.g. CSV
input-files need to contain a "smiles" or "inchi" which is then used to calculate the fingerprints. There is an
example CSV file in the tests/directory directory and when you're training using the DFPL package, it will load the
input files and add fingerprints. You can test the conversion
import dfpl.fingerprint as fp
fp.importDataFile("tests/data/smiles.csv")If you're data is in CSV format, has a header row, and contains a "smiles" or an "inchi" column, you can use it as
input for training as it is. However, if you're data is in a different format, you can use function in the fingerprint
module to import it correctly.
The tests/data/inchi.tsv contains data in TSV format without a header row which makes it impossible to identify how to
import it automatically. You can use the import_function argument to tell importDataFile how it can turn your data
into a Pandas DataFrame that contains, e.g. an "inchi" column. After that DFPL can calculate and add the
fingerprints to the DataFrame
import pandas as pd
import dfpl.fingerprint as fp
data = fp.importDataFile(
"tests/data/inchi.tsv",
import_function=(lambda f: pd.read_table(f, names=["toxid", "inchi", "key"]))
)You can store the converted data as a "pickle" file which is a binary representation of the Pandas dataframe and can be used directly as input file for the DFPL program. The advantage is that the fingerprint calculation needs to be done only once and loading these files is fast.
data.to_pickle("output/path/file.pkl")Note that the file-extension needs to be "pkl" to be identified correctly by DFPL. Also, you might want to look at
the convert_all function in the fingerprint module that we use to convert different data-files all at once.
You have several options to work with the DFPL package. The package can be started as a program on the commandline and you can provide all necessary information as commandline-parameters. Check
python -m dfpl --help
python -m dfpl train --help
python -m dfpl predict --helpHowever, using JSON files that contain all train/predict options an easy way to preserve what was run and you can use them instead of providing multiple commandline arguments.
python -m dfpl train -f path/to/file.jsonSee, e.g. the JSON files under validation/case_XX for examples. Also, you can use the following to create template
JSON files for training or prediction
import dfpl.options as opts
train_opts = opts.TrainOptions()
train_opts.saveToFile("train.json")
predict_opts = opts.PredictOptions()
predict_opts.saveToFile("predict_bestER03.json")You can also work with the DFPL package from within an interactive Python session. Load the dfpl package in your
Python console and start the training/prediction functions yourself by providing instances
of dfpl.options.TrainingOptions or
dfpl.options.PredictOptions. You can also use load options from JSON files. Example
import dfpl.__main__ as main
import dfpl.options as opts
o = opts.TrainOptions.fromJson("/path/to/train.json")
main.train(o)Finally, if you like to work in PyCharm, you can also create a run-configuration in PyCharm using the dfpl/__main__.py
script and providing the commandline arguments there.
This work is still in progress. So, if you fork or use it, please contact me first: jana.schor@ufz.de