Skip to content

shehrozashoaib/LLM_Hackathon_Materials

Repository files navigation

RAG + LLMs for Perovskite Stability (T80) Discovery

Generate novel ABX₃ perovskite candidates with longer shelf-life, pre-rank them with a T80 surrogate model, then rerank the Top-25 with RAG-assisted LLMs (Qwen, Llama, Flan) using evidence from the literature.

Why this? Direct LLM finetuning on raw tabular data led to hallucinations. Using LLMs for what they do best—reading and synthesizing papers—yields consistent, defensible rerankings when paired with a transparent surrogate.

Pipeline


Repository

├─ train_surrogate_predictor.py   # train T80 surrogate & feature space
├─ generate_and_score.py          # sample/validate ABX3, predict T80, rank & export
├─ Qwen_Notebook.ipynb            # RAG + JSON scoring for Qwen
├─ lLama_Notebook.ipynb           # RAG + JSON scoring for Llama
├─ Flan_Notebook.ipynb            # RAG + JSON scoring for Flan
└─ Untitled Diagram.jpg           # schematic

Setup

# Python 3.10+ recommended
pip install -U numpy pandas scikit-learn xgboost joblib tqdm
# Notebooks may also need: transformers sentence-transformers chromadb

GPU is optional; it helps for embeddings/LLM steps in notebooks.


Data

  • rows_with_T80.csv — cleaned historical dataset with T80
  • A folder of papers (PDF/HTML) for retrieval (used by notebooks)

Quickstart

1) Train the surrogate

python train_surrogate_predictor.py \
  --data rows_with_T80.csv \
  --out t80_surrogate_xgb.joblib \
  --feature-space feature_space.json

Outputs

  • t80_surrogate_xgb.joblib — trained model pipeline
  • feature_space.json — allowed ions/tokens + radii map (for chemistry checks)

2) Generate & score candidates (ABX₃)

python generate_and_score.py \
  --goal 1000 \
  --allow-pb 1 \
  --n 8000 \
  --workers 6 \
  --model t80_surrogate_xgb.joblib \
  --feature-space feature_space.json \
  --existing rows_with_T80.csv

What happens

  • Samples A/B/X within feature space, enforces charge neutrality & tolerance factors (t, μ)
  • Predicts log10(T80) via surrogate; computes novelty vs historical tokens
  • Ranking: meets_goalpred_T80_hnovelty

Output

  • top_candidates.csv (up to top 500)

3) RAG + LLM reranking (Top-25)

Open and run these notebooks (set top_candidates.csv path at the top):

  • Qwen_Notebook.ipynb
  • lLama_Notebook.ipynb
  • Flan_Notebook.ipynb

Each notebook:

  1. Builds/uses a small RAG index of papers (Nature, Joule, NREL, …).
  2. For Top-25 candidates, retrieves evidence and prompts the model to return JSON:
    viability_score, consistency_score, risks, notes, cites.
  3. Saves per-model CSVs and an optional HTML leaderboard (side-by-side model Top-10 + sortable Top-25).

Typical outputs

  • reranked_qwen.csv, reranked_llama.csv, reranked_flan.csv
  • leaderboard_demo_side_by_side.html

(Optional) Combine per-model results into one Top-25 leaderboard

import json, pandas as pd

def _keyify(df):
    def canon(x):
        return json.dumps(json.loads(x), sort_keys=True) if isinstance(x, str) else json.dumps(x, sort_keys=True)
    def k(r): return "|".join([canon(r["A"]), canon(r["B"]), canon(r["X"]), str(r.get("additives",""))])
    if "cand_key" not in df.columns:
        df = df.copy(); df["cand_key"] = df.apply(k, axis=1)
    return df

base  = _keyify(pd.read_csv("top_candidates.csv")).sort_values("pred_T80_h", ascending=False).head(25)
qwen  = _keyify(pd.read_csv("reranked_qwen.csv"))
llama = _keyify(pd.read_csv("reranked_llama.csv"))
flan  = _keyify(pd.read_csv("reranked_flan.csv"))

out = (base[["cand_key","A","B","X","additives","pred_T80_h"]]
       .merge(qwen[["cand_key","final_qwen","llm_viability"]].rename(columns={"llm_viability":"llm_viability_qwen"}),  on="cand_key", how="left")
       .merge(llama[["cand_key","final_llama","llm_viability"]].rename(columns={"llm_viability":"llm_viability_llama"}), on="cand_key", how="left")
       .merge(flan[["cand_key","final_flan","llm_viability"]].rename(columns={"llm_viability":"llm_viability_flan"}),   on="cand_key", how="left"))
out.to_csv("leaderboard_top25_merged.csv", index=False)
print("saved leaderboard_top25_merged.csv")

Notes

  • Tune chemistry windows in generate_and_score.py (within_bounds for t, μ) and --novelty-floor.
  • Large models can be loaded 4/8-bit in notebooks to fit consumer GPUs.

License

MIT LICENSE.


Summary

Built on our T80 surrogate + cleaned MaterialsZone data, we added literature-grounded RAG+LLM reranking. This repo ships constrained Top-25 generation, JSON scoring/notes, and a sortable dashboard with side-by-side model ranks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors