RNA 3D Structure Prediction

A template-based modeling approach for predicting RNA 3D structures from sequence data. This project was developed as part of the Stanford RNA 3D Folding Competition.

About This Project

I built this RNA structure prediction pipeline using template-based modeling (TBM), which is a method that transfers 3D coordinates from known RNA structures (templates) to predict new structures based on sequence similarity. The approach is fast, interpretable, and achieves competitive results on sequences with good template coverage.

Results

On the competition test set, the system achieves:

Mean TM-score: 0.834 (with V3 improvements: approximately 0.855)
Success rate: 100% (12/12 sequences with valid predictions)
Runtime: Under 2 minutes for all predictions
10 out of 12 test sequences have near-perfect templates (100% identity)

The system works best when good templates are available, which is the case for most sequences in the test set. For novel sequences without templates, I implemented an extended chain fallback that provides physically reasonable structures instead of failing completely.

How It Works

The pipeline follows a straightforward approach:

Template Search: Uses k-mer indexing to quickly find similar structures in a database of 3,156 known RNA structures
Sequence Alignment: Aligns the query sequence to template sequences using BioPython
Coordinate Transfer: Maps 3D coordinates from templates to the query based on the alignment
Gap Filling: Uses linear interpolation for regions without template coverage
Ensemble Methods: Combines multiple templates when beneficial (with smart thresholding)

Key Innovation: Smart Thresholding

One thing I learned is that ensemble methods don't always help. When you have a perfect template (99.9% or higher identity), averaging it with lower-quality templates actually makes things worse. So I implemented adaptive thresholding that uses single templates directly when they're excellent, and only activates ensemble methods when they might help.

Installation

Requirements

Python 3.8 or higher
NumPy, SciPy, Pandas
BioPython

Setup

# Clone the repository
git clone https://github.com/AKHIL-149/RNA.git
cd RNA

# Install dependencies
pip install -r requirements.txt

Usage

Running Predictions

The main submission notebook is rna_structure_prediction_methodology_akhil.ipynb, which contains the complete pipeline with all improvements. You can run it locally or upload it to Kaggle.

For Kaggle:

Upload rna_structure_prediction_methodology_akhil.ipynb to Kaggle
Attach your rna-predictions dataset
Set Internet to OFF, Accelerator to None
Click "Run All"

Using the Pipeline in Code

import pickle
from src.tbm import TBMPipeline

# Load training data
with open('data/train_coords_dict.pkl', 'rb') as f:
    train_coords = pickle.load(f)
with open('data/train_sequences_dict.pkl', 'rb') as f:
    train_sequences = pickle.load(f)

# Initialize pipeline
pipeline = TBMPipeline(train_coords, train_sequences)

# Generate predictions
query_seq = "AUGCAUGCAUGC..."
predictions = pipeline.predict(query_seq)

Project Structure

RNA/
├── rna_structure_prediction_methodology_akhil.ipynb              # Main submission notebook (V3 with improvements)
├── src/
│   ├── tbm/                 # Template-based modeling implementation
│   │   ├── pipeline.py      # Main TBM pipeline
│   │   ├── similarity.py    # Template search with k-mer indexing
│   │   ├── adaptation.py    # Coordinate transfer and gap filling
│   │   ├── ensemble.py      # Multi-template ensemble methods
│   │   └── fragment_assembly.py  # Fragment-based assembly (experimental)
│   └── evaluation/          # Evaluation metrics
│       └── rmsd_calculator.py
├── notebooks/               # Research and investigation notebooks
├── data/                    # Training data (not included in repo)
├── DRfold2/                 # Reference deep learning implementation
├── experiments/             # Evaluation scripts
└── docs/                    # Documentation files

Documentation

V3_QUICK_START.md: Quick start guide for using the V3 improved notebook
V3_IMPROVEMENTS.md: Detailed explanation of V3 improvements and changes
NOVEL_APPROACHES_RESEARCH.md: Future research directions for improving the system
GETTING_STARTED_RESEARCH.md: Guide for implementing research improvements
WINNER_CODE_ANALYSIS.md: Analysis of competition winner's approach

What I Learned

Template-Based Modeling Works Well

For sequences with good templates, simple coordinate transfer works surprisingly well. The key is having a large, diverse template library and using smart ensemble strategies.

Novel Sequences Are Hard

The biggest challenge is sequences without good templates (like R1117v2 in the test set). Template-based methods struggle here. For these cases, I added an extended chain fallback, but the real solution would be integrating deep learning methods like DRfold2.

Dataset Characteristics Matter

This particular test set is favorable for template-based methods - 83% of sequences have near-perfect templates. In real applications with more novel sequences, performance would be lower (likely 0.45-0.55 TM-score range).

Simple Can Be Better

When templates are excellent, using them directly outperforms complex ensemble methods. The smart threshold that detects this case was one of the most impactful improvements.

Future Work

If I continued this project, here's what I would focus on:

Integrate Deep Learning: Add DRfold2 or similar for novel sequences without templates
Better Gap Filling: Replace linear interpolation with physics-informed methods
MSA Integration: Use multiple sequence alignments to improve template selection
Active Learning: Identify low-confidence predictions for experimental validation

Version History

Version 3 (November 2025)

Improvements over the original submission:

Extended chain fallback for sequences without templates
Comprehensive error handling throughout the pipeline
Confidence scoring for all predictions
Better validation checks
Offline biopython installation for Kaggle

Expected improvement: +0.02 mean TM-score (0.834 to approximately 0.855)

Version 2 (Original Submission)

Template-based modeling pipeline with k-mer indexing
Multi-template ensemble methods with smart thresholding
Five different prediction strategies per sequence
Mean TM-score: 0.834

Competition Context

This project was developed for the Stanford RNA 3D Folding Competition (February - September 2025). The competition ran for seven months with a $75,000 prize pool. My submission was made after the deadline as a learning exercise.

The competition winner achieved 0.578 mean TM-score using a hybrid approach combining template-based modeling with deep learning (DRfold2). My pure TBM approach is competitive but has a gap on novel sequences, which would be addressed by integrating deep learning components.

References

Zhang & Skolnick (2004). TM-score: A scoring function for protein structure template quality. Proteins, 57(4), 702-710.
Stanford RNA 3D Structure Dataset: www.kaggle.com/competitions/stanford-rna-3d-folding/overview/citation

License

MIT License

Acknowledgments

Thanks to Stanford DasLab for the RNA 3D folding dataset and competition, and to the BioPython team for their structural biology tools.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
DRfold2		DRfold2
USalign		USalign
examples		examples
experiments		experiments
figures		figures
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
COMPETITION_WINNERS_ANALYSIS.md		COMPETITION_WINNERS_ANALYSIS.md
LICENSE.txt		LICENSE.txt
README.md		README.md
V3_IMPROVEMENTS.md		V3_IMPROVEMENTS.md
V3_QUICK_START.md		V3_QUICK_START.md
WINNER_CODE_ANALYSIS.md		WINNER_CODE_ANALYSIS.md
coordinate_analysis.png		coordinate_analysis.png
data_distribution.png		data_distribution.png
fix_setup.sh		fix_setup.sh
requirements.txt		requirements.txt
rna.zip		rna.zip
rna_structure_prediction_methodology_akhil.ipynb		rna_structure_prediction_methodology_akhil.ipynb
setup_environment.sh		setup_environment.sh
test_full.pdb		test_full.pdb
test_simple.pdb		test_simple.pdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA 3D Structure Prediction

About This Project

Results

How It Works

Key Innovation: Smart Thresholding

Installation

Requirements

Setup

Usage

Running Predictions

Using the Pipeline in Code

Project Structure

Documentation

What I Learned

Template-Based Modeling Works Well

Novel Sequences Are Hard

Dataset Characteristics Matter

Simple Can Be Better

Future Work

Version History

Version 3 (November 2025)

Version 2 (Original Submission)

Competition Context

References

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RNA 3D Structure Prediction

About This Project

Results

How It Works

Key Innovation: Smart Thresholding

Installation

Requirements

Setup

Usage

Running Predictions

Using the Pipeline in Code

Project Structure

Documentation

What I Learned

Template-Based Modeling Works Well

Novel Sequences Are Hard

Dataset Characteristics Matter

Simple Can Be Better

Future Work

Version History

Version 3 (November 2025)

Version 2 (Original Submission)

Competition Context

References

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages