GPerturb: Additive, multivariate, sparse distributional regression model for perturbation effect estimation
This repository hosts the implementation of GPerturb (link), a Bayesian model identify and estimate interpretable sparse gene-level perturbation effects.
The package is developed under Python>=3.80, and requires the following packages
matplotlib>=3.70
numpy==1.26.4
pandas>=2.0.0
torch==2.2.2
The GPerturb package is tested on Windows, Mac OS and Ubuntu 16.04 systems.
The GPerturb can be installed using
pip install git+https://github.com/hwxing3259/GPerturb.git
from GPerturb import *
Alternatively, one could directly download the python file to the current working directory and call
from GPerturb_model import *
Here we use the SciPlex2 dataset from Lotfollahi et al 2023 as an example to demonstrate the Gaussian GPerturb pipeline
adata = sc.read('SciPlex2_new.h5ad')
torch.manual_seed(3141592)
# load data:
my_conditioner = pd.read_csv("SciPlex2_perturbation.csv", index_col=0)
my_conditioner = my_conditioner.drop('Vehicle', axis=1) # TODO: or retaining it
cond_name = list(my_conditioner.columns)
my_conditioner = torch.tensor(my_conditioner.to_numpy() * 1.0, dtype=torch.float)
my_conditioner = torch.pow(my_conditioner, 0.2) # a power transformation of dosages
my_observation = pd.read_csv("SciPlex2.csv", index_col=0)
print(my_observation.shape)
my_observation = torch.tensor(my_observation.to_numpy() * 1.0, dtype=torch.float)
gene_name = list(pd.read_csv('SciPlex2_gene_name.csv').to_numpy()[:, 0])
my_cell_info = pd.read_csv("SciPlex2_cell_info.csv", index_col=0)
my_cell_info.n_genes = my_cell_info.n_genes/my_cell_info.n_counts
my_cell_info.n_counts = np.log(my_cell_info.n_counts)
cell_info_names = list(my_cell_info.columns)
my_cell_info = torch.tensor(my_cell_info.to_numpy() * 1.0, dtype=torch.float)
output_dim = my_observation.shape[1]
sample_size = my_observation.shape[0]
hidden_node = 700
hidden_layer = 4
conditioner_dim = my_conditioner.shape[1]
cell_info_dim = my_cell_info.shape[1]
lr_parametric = 1e-3
tau = torch.tensor(1.).to(device)
parametric_model = GPerturb_Gaussian(conditioner_dim=conditioner_dim, output_dim=output_dim, base_dim=cell_info_dim,
data_size=sample_size, hidden_node=hidden_node, hidden_layer_1=hidden_layer,
hidden_layer_2=hidden_layer, tau=tau)
parametric_model.test_id = testing_idx = list(np.random.choice(a=range(my_observation.shape[0]), size=my_observation.shape[0] // 8, replace=False))
parametric_model = parametric_model.to(device)
# train the model from scratch
parametric_model.GPerturb_train(epoch=250, observation=my_observation, cell_info=my_cell_info, perturbation=my_conditioner,
lr=lr_parametric, device=device)
fitted_vals = Gaussian_estimates(model=parametric_model, obs=my_observation[parametric_model.test_id],
cond=my_conditioner[parametric_model.test_id], cell_info=my_cell_info[parametric_model.test_id])
The codes above takes roughly 1.5 hours to run on our desktop computer with 16GB RAM, a AMD Ryzen 7 5700X processor and a Nvidia RTX2060 GPU.
User needs to provide three data matrices: A
Codes for reproducing the LUHMES example: Link
Codes for reproducing the TCells example: Link
Codes for reproducing the SciPlex2 example: Link
Codes for reproducing the Replogle et al 2022 example: Link
Pre-trained models and datasets can be downloaded from: Link
