cd biosynthesis/multistep
conda env create -f environment.yml
conda activate biosynthesis
git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
pip install -e .
单步反应相关数据和代码放在singlestep/中:
biogenisis_reaction.txt, reactions.txt是训练用的原数据,前者是bio的,后者主要用其中np_like的数据;
word_preprocess.py处理数据的代码,将数据9:1分为train和valid,valid=test;
retro_en_de.yaml字典生成和模型训练的配置文件;
mol_trans/all_train, mol_trans/bio_train, mol_trans/nplike_train划分好的训练数据,all_train是bio&np_like混合的数据;
mol_trans/model训练好的模型, 20个模型是用all_train训练11w步到30w步得到的;
mol_trans/run保存了字典;
score_predictions.py结果评估代码;
修改配置文件后,使用
onmt_train -config retro_en_de.yaml
常用参数如下
onmt_translate -model mol_trans/model/new_all_step_140000.pt
-src mol_trans/bio_train/new-src-val.txt
-output mol_trans/pred.txt
-batch_size 64
-max_length 200
-beam_size 10
-n_best 10
-gpu 0
-replace_unk
其他参数见https://opennmt.net/OpenNMT-py/options/translate.html
python score_predictions.py -predictions xxx -targets xxx
-predictions 是预测文件; -targets 是ground truth
cd multistep
pip install -e packages/mlp_retrosyn
pip install -e packages/rdchiral
pip install -e .
pip install -e onmt
预测分子的接口保存在interface.py中
核心代码如下
def run(mol, top_k, building_block_pth='building_block.csv'):
os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1, 2, 3, 4, 5, 6, 7'
assert torch.cuda.is_available()
if 'building_block.csv' in os.walk('retro_star/dataset'):
os.remove('retro_star/dataset/building_block.csv')
shutil.copyfile(building_block_pth, 'retro_star/dataset/building_block.csv')
mol = Chem.MolToSmiles(Chem.MolFromSmarts(mol))
planner = RSPlanner(
gpu=1,
use_value_fn=True,
iterations=100,
expansion_topk=30,
top_k=top_k,
viz=False
)
result = planner.plan(mol)
mol_dict = {}
if result is None:
return None
for i, ele in enumerate(result):
ele_dict = {
i: {
'routes': ele[0],
'routes_score': ele[1]
}
}
mol_dict.update(ele_dict)
return mol_dict输入: mol为需要预测的分子,top_k指输出概率前k大的结果(受iteration影响,最终结果个数<=k),building_block_pth为building block的路径
输出: 一个字典{0: {routes: xxx, socre: xxx}, 1:{routse: xxx, socre: xxx}, ...}
相关代码在eval.py中,直接从运行日志中获取预测结果进行评估,具体细节代码中有注释