A Two-stage Hierarchical Transformer-based Representation Learning for Prediction and Classification of Biosynthetic Gene Clusters and their Functional Genomic Insights
This project implements a hybrid deep learning + machine learning pipeline for identifying Biosynthetic Gene Clusters (BGCs) and classifying their types using transformer-based feature extraction followed by XGBoost classification, along with SHAP-based biological interpretation to identify important k-mers and uncover biologically meaningful genomic patterns.
- 6-mer Feature Extraction
- Dataset Preparation (BGC vs Non-BGC)
- Grid Search Hyperparameter Optimization
- General Transformer Training (Binary Classification)
- Class-Specific Transformer Training (BGC Types)
- Feature Extraction & Fusion
- XGBoost Training + Cross Validation
- Test Evaluation
- SHAP-based Interpretation
- DNA sequences converted into k-mer frequency vectors
- Example: k = 6 → 1024 features
- Output: numerical feature matrix
-
Input CSV:
- Feature columns → k-mer features
- Last column → Label
- Class 0 → Non-BGC
- Class 1..N → BGC types
- Non-BGC → 0
- BGC → 1
- Learn global genomic patterns
- Separate BGC vs Non-BGC
- One transformer per BGC class
- Learn class-specific patterns
- General transformer features
- Class-specific transformer features
Final Features = [General Features + All Class Features]
-
Captures:
- Global genomic patterns
- Class-specific signatures
- Combined transformer features
- n_estimators = 400
- max_depth = 6
- learning_rate = 0.05
- subsample = 0.8
- colsample_bytree = 0.8
- objective = multi:softprob
- Stratified 5-Fold Cross Validation
- Accuracy
- Precision (weighted)
- Recall (weighted)
- F1-score
- AUROC
- Tune Transformer , XGBoost hyperparameters
- Best hyperparameters
- Grid search results
- Optimized model
- Final model trained on Train + Validation
- Evaluated on independent Test set
- Test metrics
- Final trained model
- Interpret model predictions
- Identify important k-mers
- Detect enriched sequence patterns
- Link k-mers to functional regions
- Understand genomic signatures of BGCs
- SHAP values
- Feature importance plots
- Top contributing k-mers
project/
│── data/
│ ├── raw_sequences.fasta
│ ├── labels.csv
│ ├── final_output.csv
│
│── models/
│ ├── PRETRAINED_TRANSFORMERS_BGC/
│ │ ├── bgc_general_transformer.keras
│ │ ├── best_bgc_general.keras
│ │ ├── bgc_class_transformer_<class>.keras
│ │ ├── best_bgc_class_<class>.keras
│ │ ├── scaler.pkl
│ │ ├── label_encoder.pkl
│ │
│ ├── xgboost_final_model.pkl
│
│
│── scripts/
│ ├── kmer_generation.py
│ ├── non_bgc_extraction.py
│ ├── grid_search.py
│ ├── bgc_classification.py
│ ├── shap_analysis.py
│
│── results/
│ ├── TRAINING_OUTPUT/
│ │ ├── X_train_combined.npy
│ │ ├── X_val_combined.npy
│ │ ├── X_test_combined.npy
│ │ ├── y_train.npy
│ │ ├── y_val.npy
│ │ ├── y_test.npy
│ │ ├── train_gen_features.npy
│ │ ├── test_gen_features.npy
│ │ ├── training_metadata.csv
│ │ ├── early_stopping_stats.csv
│ │ ├── class_info.csv
│ │ ├── general_transformer_history.csv
│ │ ├── class_<class>_history.csv
│ │
│ │
│ ├── shap_results/
│ │ ├── shap_values.npy
│ │ ├── shap_summary_plot.png
│ │ ├── shap_feature_importance.csv
│ │ ├── top_kmers.txt
- Hybrid Transformer + XGBoost pipeline
- Memory-optimized architecture
- Early stopping for stability
- Class-specific learning
- Feature fusion strategy
- Grid search optimization
- SHAP-based biological interpretation