Here, we present a novel NMF approach that applies Graph Regularization Across Contextual KnowLedgE (GRACKLE) to learn latent representations. GRACKLE constrains the unsupervised NMF model using both sample similarity and gene similarity matrices. The model is flexible to incorporate multiple forms of sample metadata, such as phenotype or subtype labels, as well as molecular relationships, such as gene-gene interactions or pathway annotation.
A schematic of GRACKLE is shown below.
Before matrix decomposition, the sample metadata is transformed into the sample similarity matrix
where
The model is iteratively trained until a stopping criterion of relative change in
- R: A statistical computing language and environment.
- Download and install the latest version of R from {Link: CRAN https://cran.r-project.org/}
Clone the GRACKLE Repository: To get started, clone the GRACKLE repository from GitHub:
git clone https://github.com/lagillenwater/GRACKLE.git
cd GRACKLETo run GRACKLE, ensure you have the following R packages installed. Follow these steps:
-
Install Required Packages: Open R and run the following commands to install the necessary packages:
install.packages(c("tidyverse", "igraph", "parallel", "optparse", "devtools", "reticulate", "tensorflow"))
-
Install
sgnesRfrom GitHub: ThesgnesRpackage is required for simulating gene expression data. Install it using the following command:devtools::install_github("shaileshtripathi/sgnesR")
-
Set Up Python Environment: GRACKLE uses Python via the
reticulatepackage. Ensure you have a virtual environment set up and activate it:library(reticulate) use_virtualenv("/path/to/env")
-
Additional Steps To Install TensorFlow: After installing the
tensorflowR package, run the following command in R to install TensorFlow:library(tensorflow) install_tensorflow() -
Verify Installation: To ensure TensorFlow is installed correctly, run:
library(tensorflow) tf$constant("Hello, TensorFlow!")
-
GPU Support (Optional): If you want to enable GPU support, ensure you have the necessary CUDA and cuDNN libraries installed. Then, install TensorFlow with GPU support:
install_tensorflow(version = "gpu")
For more details, refer to the official TensorFlow for R documentation: TensorFlow for R.
To establish a performance benchmark, we compared GRACKLE to other NMF models using simulated gene expression generated data. The simulated profiles were based on transcription factor (TF)-gene regulatory relationships inferred from breast tissue from GTEx using the Passing Attributes between Networks for Data Assimilation (PANDA) algorithm (Glass et al. 2013; Lonsdale et al. 2013). Gene expression profiles were then simulated using the inferred gene regulatory network as input to the Stochastic Gene Network Simulator (SGNSim) (Ribeiro and Lloyd-Price 2007; Tripathi et al. 2017). In addition, since the aim of GRACKLE is to identify subgroup-specific gene regulatory patterns, we isolated 5 network modules and systematically upregulated genes in these 5 modules in a set of samples to simulate subgroup-specific gene regulatory activation.
We assessed the ability of GRACKLE and the other algorithms to decompose the gene expression data into latent variables that matched the simulated subgroups (i.e., gene module differential gene regulation) (Fig. 2). We randomly stratified the simulated profiles into a 70/30 training/testing split, then calculated the decomposed W and H matrices from the training data. We next projected the testing gene expression data into W using nonnegative least squares based on the trained H matrix. We calculated the accuracy of alignment to the subgroups based on whether the corresponding latent variable in the decomposed sample and gene matrices have the highest loadings for samples and genes of an assigned subgroup. For example, we consider a result to be accurate if the samples in subgroup A and the corresponding upregulated genes in module A have the highest loadings in LV1 of the decomposed sample and gene matrices W and H. We reported the average accuracy over all subgroups. The overall process is outlined in the schematic below:
In the simulation studies, we evaluated performance over λ_1 (penalization for sample similarity, S_S) and λ_2 (penalization for gene similarity, S_G) values [0, 1] at an interval of 0.1. We tested GRACKLE over varying levels of background gene expression noise, decreased network modularity, and increased network transitivity (i.e., the percent of graph nodes involved in triangles). We benchmarked GRACKLE against three comparable algorithms: NMF, GNMF, and a prior informed graph regularized NMF inspired by netNMF-sc, which we call pr-GNMF. The NMF model served as a baseline and corresponds to λ_1 =0 and λ_2 =0. For GNMF, the affinity matrix for the gene regularization was calculated with k-nearest neighbors and the λ parameter, which affects the degree of graph regularization, was optimized using the parameters defined by Cai et al. (number of nearest neighbors = 5, λ = [1,10, 10^2, 10^3, 10^4] (Cai et al. 2011) ). For pr-GNMF, the same GRN used for GRACKLE was used for regularization of the gene similarity matrix. For both GNMF and pr-GNMF, graph regularization was performed for the maximum value of λ_2 tested. To avoid overfitting, we calculated the average performance over 100 iterations.