A template-based modeling approach for predicting RNA 3D structures from sequence data. This project was developed as part of the Stanford RNA 3D Folding Competition.
I built this RNA structure prediction pipeline using template-based modeling (TBM), which is a method that transfers 3D coordinates from known RNA structures (templates) to predict new structures based on sequence similarity. The approach is fast, interpretable, and achieves competitive results on sequences with good template coverage.
On the competition test set, the system achieves:
- Mean TM-score: 0.834 (with V3 improvements: approximately 0.855)
- Success rate: 100% (12/12 sequences with valid predictions)
- Runtime: Under 2 minutes for all predictions
- 10 out of 12 test sequences have near-perfect templates (100% identity)
The system works best when good templates are available, which is the case for most sequences in the test set. For novel sequences without templates, I implemented an extended chain fallback that provides physically reasonable structures instead of failing completely.
The pipeline follows a straightforward approach:
- Template Search: Uses k-mer indexing to quickly find similar structures in a database of 3,156 known RNA structures
- Sequence Alignment: Aligns the query sequence to template sequences using BioPython
- Coordinate Transfer: Maps 3D coordinates from templates to the query based on the alignment
- Gap Filling: Uses linear interpolation for regions without template coverage
- Ensemble Methods: Combines multiple templates when beneficial (with smart thresholding)
One thing I learned is that ensemble methods don't always help. When you have a perfect template (99.9% or higher identity), averaging it with lower-quality templates actually makes things worse. So I implemented adaptive thresholding that uses single templates directly when they're excellent, and only activates ensemble methods when they might help.
- Python 3.8 or higher
- NumPy, SciPy, Pandas
- BioPython
# Clone the repository
git clone https://github.com/AKHIL-149/RNA.git
cd RNA
# Install dependencies
pip install -r requirements.txtThe main submission notebook is rna_structure_prediction_methodology_akhil.ipynb, which contains the complete pipeline with all improvements. You can run it locally or upload it to Kaggle.
For Kaggle:
- Upload
rna_structure_prediction_methodology_akhil.ipynbto Kaggle - Attach your
rna-predictionsdataset - Set Internet to OFF, Accelerator to None
- Click "Run All"
import pickle
from src.tbm import TBMPipeline
# Load training data
with open('data/train_coords_dict.pkl', 'rb') as f:
train_coords = pickle.load(f)
with open('data/train_sequences_dict.pkl', 'rb') as f:
train_sequences = pickle.load(f)
# Initialize pipeline
pipeline = TBMPipeline(train_coords, train_sequences)
# Generate predictions
query_seq = "AUGCAUGCAUGC..."
predictions = pipeline.predict(query_seq)RNA/
├── rna_structure_prediction_methodology_akhil.ipynb # Main submission notebook (V3 with improvements)
├── src/
│ ├── tbm/ # Template-based modeling implementation
│ │ ├── pipeline.py # Main TBM pipeline
│ │ ├── similarity.py # Template search with k-mer indexing
│ │ ├── adaptation.py # Coordinate transfer and gap filling
│ │ ├── ensemble.py # Multi-template ensemble methods
│ │ └── fragment_assembly.py # Fragment-based assembly (experimental)
│ └── evaluation/ # Evaluation metrics
│ └── rmsd_calculator.py
├── notebooks/ # Research and investigation notebooks
├── data/ # Training data (not included in repo)
├── DRfold2/ # Reference deep learning implementation
├── experiments/ # Evaluation scripts
└── docs/ # Documentation files
- V3_QUICK_START.md: Quick start guide for using the V3 improved notebook
- V3_IMPROVEMENTS.md: Detailed explanation of V3 improvements and changes
- NOVEL_APPROACHES_RESEARCH.md: Future research directions for improving the system
- GETTING_STARTED_RESEARCH.md: Guide for implementing research improvements
- WINNER_CODE_ANALYSIS.md: Analysis of competition winner's approach
For sequences with good templates, simple coordinate transfer works surprisingly well. The key is having a large, diverse template library and using smart ensemble strategies.
The biggest challenge is sequences without good templates (like R1117v2 in the test set). Template-based methods struggle here. For these cases, I added an extended chain fallback, but the real solution would be integrating deep learning methods like DRfold2.
This particular test set is favorable for template-based methods - 83% of sequences have near-perfect templates. In real applications with more novel sequences, performance would be lower (likely 0.45-0.55 TM-score range).
When templates are excellent, using them directly outperforms complex ensemble methods. The smart threshold that detects this case was one of the most impactful improvements.
If I continued this project, here's what I would focus on:
- Integrate Deep Learning: Add DRfold2 or similar for novel sequences without templates
- Better Gap Filling: Replace linear interpolation with physics-informed methods
- MSA Integration: Use multiple sequence alignments to improve template selection
- Active Learning: Identify low-confidence predictions for experimental validation
Improvements over the original submission:
- Extended chain fallback for sequences without templates
- Comprehensive error handling throughout the pipeline
- Confidence scoring for all predictions
- Better validation checks
- Offline biopython installation for Kaggle
Expected improvement: +0.02 mean TM-score (0.834 to approximately 0.855)
- Template-based modeling pipeline with k-mer indexing
- Multi-template ensemble methods with smart thresholding
- Five different prediction strategies per sequence
- Mean TM-score: 0.834
This project was developed for the Stanford RNA 3D Folding Competition (February - September 2025). The competition ran for seven months with a $75,000 prize pool. My submission was made after the deadline as a learning exercise.
The competition winner achieved 0.578 mean TM-score using a hybrid approach combining template-based modeling with deep learning (DRfold2). My pure TBM approach is competitive but has a gap on novel sequences, which would be addressed by integrating deep learning components.
- Zhang & Skolnick (2004). TM-score: A scoring function for protein structure template quality. Proteins, 57(4), 702-710.
- Stanford RNA 3D Structure Dataset: www.kaggle.com/competitions/stanford-rna-3d-folding/overview/citation
MIT License
Thanks to Stanford DasLab for the RNA 3D folding dataset and competition, and to the BioPython team for their structural biology tools.