Gordon: Towards Automating Scientific Taste

Overview

This repository contains code for a machine learning project aimed at evaluating the quality of scientific hypotheses. The project utilizes two-stage neural networks to predict the quality of scientific hypotheses based on paper abstracts from bioRxiv and medRxiv.

Data

The project utilizes data from:

bioRxiv and medRxiv: Paper titles and abstracts.
PubMed Relative Citation Ratio (RCR): Used as proxy labels for hypothesis quality.

The data processing pipeline includes:

PubMed Querying: Fetching relevant metadata.
Abstract Processing: Cleaning and preparing abstract text for embedding generation.
Hypothesis Distillation (using Llama): Extracting the core hypothesis as a question and summarizing the background from abstracts.
Embedding Generation: Creating text embeddings using both BioBERT and Llama models.

Usage

Training Models:

Run gordon_training.py to train the two-stage scoring model, including hyperparameter tuning using Optuna.
```
python gordon_training.py
```
Trained model weights for Stage 1 and Stage 2 will be saved as .pth files (e.g., gordonramsay_stage1_BioBERT.pth, gordonramsay_stage2_BioBERT.pth).

Using the Gordon Prediction Tool:

Run gordon.py: This script loads the trained two-stage scoring model (Llama or BioBERT, depending on configuration) and prompts you to enter two hypotheses and background texts. It then tells you how the two hypotheses compare with each other.
```
python gordon.py
```

Embeddings

The project supports two types of text embeddings:

BioBERT Embeddings: Biomedical domain-specific embeddings, efficient for biomedical text processing.
Llama Embeddings: General-purpose, high-quality embeddings from a Large Language Model, capturing broader semantic nuances.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
README.md		README.md
biobert_embeddings.py		biobert_embeddings.py
classifying_abstracts.py		classifying_abstracts.py
data_splitting.ipynb		data_splitting.ipynb
gordon.py		gordon.py
gordon_ranking.py		gordon_ranking.py
gordon_stage1_BioBERT.pth		gordon_stage1_BioBERT.pth
gordon_stage1_llama.pth		gordon_stage1_llama.pth
gordon_stage1_random.pth		gordon_stage1_random.pth
gordon_stage2_BioBERT.pth		gordon_stage2_BioBERT.pth
gordon_stage2_llama.pth		gordon_stage2_llama.pth
gordon_stage2_random.pth		gordon_stage2_random.pth
gordon_training.py		gordon_training.py
hypothesis_summarization_llama.py		hypothesis_summarization_llama.py
llama_embeddings.py		llama_embeddings.py
paper_processing.ipynb		paper_processing.ipynb
plots.ipynb		plots.ipynb
pubmed_query.py		pubmed_query.py
training_abstract_classifier.py		training_abstract_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gordon: Towards Automating Scientific Taste

Overview

Data

Usage

Training Models:

Using the Gordon Prediction Tool:

Embeddings

About

Uh oh!

Releases

Packages

Languages

changbenjamin/gordon

Folders and files

Latest commit

History

Repository files navigation

Gordon: Towards Automating Scientific Taste

Overview

Data

Usage

Training Models:

Using the Gordon Prediction Tool:

Embeddings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages