Analysis of Reactome pathway activities in PBMCs with VEGA

In this tutorial, we will demonstrate a use case for VEGA on single-cell PBMCs from Kang et al. The dataset is composed of 2 conditions (cells stimulated with interferon-beta and control cells). We will analyze the GMVs (Gene Module Variables) activities of the two populations.

[1]:

# Load VEGA and packages
import sys
import vega
import pandas as pd
import scanpy as sc
# Ignore warnings in this tutorial
import warnings
warnings.filterwarnings('ignore')

Loading and preparing data

We will load the PBMC data. The data were already preprocessed. As shown, the data is comprised of 16893 cells with 6998 highly variable genes.

[2]:

adata = vega.data.pbmc()
print(adata)

AnnData object with n_obs × n_vars = 16893 × 6998
    obs: 'condition', 'n_counts', 'n_genes', 'mt_frac', 'cell_type'
    var: 'gene_symbol', 'n_cells'
    uns: 'cell_type_colors', 'condition_colors', 'neighbors'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    obsp: 'distances', 'connectivities'

VEGA is built around the scvi-tools ecosystem. As such, we need to run our own version of setup_anndata, which in addition to locating covariates and important additional information also initializes attributes in the Anndata object that will be used by VEGA.

[3]:

vega.utils.setup_anndata(adata)

Running VEGA and SCVI setup...
INFO     No batch_key inputted, assuming all cells are same batch
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Successfully registered anndata object containing 16893 cells, 6998 vars, 1 batches,
         1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra
         continuous covariates.
INFO     Please do not further modify adata until model is trained.

Creating and training the VEGA model

Similarly to scvi-tools, it is very straightforward to create and train VEGA models. In addition to specifying the input Anndata object, we also specify the path to .gmt files that will be used to initialize VEGA’s latent space, as well as the number of unnannotated variables to model with the add_nodes argument. Additionally, we force the weights of VEGA’s linear decoder to be positive to aid interpretability when running differential activity tests later.

Note 1: The model is initialized with a masked decoder as described in the original VEGA paper, but other regularization strategies are available (l1, ElasticNet, GelNet…)*

Note 2: Change the gmt_paths variable to match the location of your .gmt file. REACTOME file can be downloaded here

[4]:

model = vega.VEGA(adata,
                  gmt_paths='reactomes.gmt',
                  add_nodes=1,
                  positive_decoder=True)
print(model)

Using masked decoder
Constraining decoder to positive weights
VEGA model with the following parameters:
n_GMVs: 675, dropout_rate:0.1, z_dropout:0.3, beta:5e-05, positive_decoder:True
Model is trained: False

[5]:

model.train_vega(n_epochs=50)

[Epoch 1] | loss: 331.439 |
[Epoch 2] | loss: 236.868 |
[Epoch 3] | loss: 210.888 |
[Epoch 4] | loss: 203.683 |
[Epoch 5] | loss: 200.081 |
[Epoch 6] | loss: 197.597 |
[Epoch 7] | loss: 195.700 |
[Epoch 8] | loss: 194.076 |
[Epoch 9] | loss: 192.903 |
[Epoch 10] | loss: 191.478 |
[Epoch 11] | loss: 190.245 |
[Epoch 12] | loss: 189.227 |
[Epoch 13] | loss: 188.143 |
[Epoch 14] | loss: 187.366 |
[Epoch 15] | loss: 186.178 |
[Epoch 16] | loss: 185.538 |
[Epoch 17] | loss: 184.696 |
[Epoch 18] | loss: 183.901 |
[Epoch 19] | loss: 183.318 |
[Epoch 20] | loss: 182.713 |
[Epoch 21] | loss: 182.158 |
[Epoch 22] | loss: 181.271 |
[Epoch 23] | loss: 180.924 |
[Epoch 24] | loss: 180.391 |
[Epoch 25] | loss: 179.938 |
[Epoch 26] | loss: 179.593 |
[Epoch 27] | loss: 179.010 |
[Epoch 28] | loss: 178.464 |
[Epoch 29] | loss: 178.168 |
[Epoch 30] | loss: 177.891 |
[Epoch 31] | loss: 177.364 |
[Epoch 32] | loss: 177.091 |
[Epoch 33] | loss: 176.875 |
[Epoch 34] | loss: 176.007 |
[Epoch 35] | loss: 175.734 |
[Epoch 36] | loss: 175.610 |
[Epoch 37] | loss: 175.454 |
[Epoch 38] | loss: 174.753 |
[Epoch 39] | loss: 174.666 |
[Epoch 40] | loss: 174.361 |
[Epoch 41] | loss: 174.108 |
[Epoch 42] | loss: 173.987 |
[Epoch 43] | loss: 173.325 |
[Epoch 44] | loss: 173.272 |
[Epoch 45] | loss: 173.043 |
[Epoch 46] | loss: 172.780 |
[Epoch 47] | loss: 172.768 |
[Epoch 48] | loss: 172.313 |
[Epoch 49] | loss: 172.292 |
[Epoch 50] | loss: 171.890 |

Saving and reloading

To save and reload a trained VEGA model, simply use the following syntax. To verify that a model is trained, print the model variable.

[6]:

#model.save('./vega_pbmc/', save_adata=True, save_history=True)

[6]:

#model = vega.VEGA.load('.vega_pbmc/')
print(model)

VEGA model with the following parameters:
n_GMVs: 675, dropout_rate:0.1, z_dropout:0.3, beta:5e-05, positive_decoder:True
Model is trained: True

Analyzing model output

One primary feature of VEGA is to project the transcriptomes into an interpretable latent space representing the gene sets used to initalize the model.

[7]:

model.adata.obsm['X_vega'] = model.to_latent()

UMAP with VEGA latent space

We can now use the interoperability with Scanpy to get a UMAP representation of the data based on VEGA’s latent space.

[8]:

sc.pp.neighbors(model.adata, use_rep='X_vega', n_neighbors=15)
sc.tl.umap(model.adata, min_dist=0.5, random_state=42)

[9]:

sc.pl.umap(model.adata, color=["condition", "cell_type"], frameon=False)

../_images/tutorials_vega_tutorial_16_0.png

Analyzing GMV activities

We can now project cells into individual pathway components for biological processes of interest. Here, we are interested to see if we can use the interferon pathway activity to separate the 2 conditions. We expect interferon pathways to be activated in the ‘stimulated’ condition. Also, we use the BCR signaling pathway to segregate B-cells from the rest of the dataset.

[11]:

vega.plotting.gmv_plot(model.adata,
                       x='REACTOME_INTERFERON_ALPHA_BETA_SIGNALING',
                       y='REACTOME_SIGNALING_BY_THE_B_CELL_RECEPTOR_BCR',
                       color='condition')
vega.plotting.gmv_plot(model.adata,
                       x='REACTOME_INTERFERON_ALPHA_BETA_SIGNALING',
                       y='REACTOME_SIGNALING_BY_THE_B_CELL_RECEPTOR_BCR',
                       color='cell_type')

../_images/tutorials_vega_tutorial_18_0.png

../_images/tutorials_vega_tutorial_18_1.png

Differential activity testing between groups

Similarly to differential expression analysis in scRNA-Seq, we can use VEGA’s latent space for testing the difference in pathway activities between 2 groups of cells. This provides an interesting alternative to enrichment tests such as performed by GSEA.

[12]:

da_df = model.differential_activity(groupby='condition', fdr_target=0.1)
da_df.head()

Using VEGA's adata attribute for differential analysis
No reference group: running 1-vs-rest analysis for .obs[condition]

[12]:

	p_da	p_not_da	bayes_factor	is_da_alpha_0.66	differential_metric	delta	is_da_fdr_0.1	comparison	group1	group2
REACTOME_INTERFERON_SIGNALING	0.9852	0.0148	4.198217	True	7.322754	2.0	True	stimulated vs. rest	stimulated	rest
REACTOME_INTERFERON_ALPHA_BETA_SIGNALING	0.9838	0.0162	4.106411	True	7.976952	2.0	True	stimulated vs. rest	stimulated	rest
REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM	0.9824	0.0176	4.022100	True	6.879438	2.0	True	stimulated vs. rest	stimulated	rest
REACTOME_ANTIVIRAL_MECHANISM_BY_IFN_STIMULATED_GENES	0.9554	0.0446	3.064396	True	5.902566	2.0	True	stimulated vs. rest	stimulated	rest
REACTOME_RIG_I_MDA5_MEDIATED_INDUCTION_OF_IFN_ALPHA_BETA_PATHWAYS	0.9018	0.0982	2.217387	True	5.115499	2.0	True	stimulated vs. rest	stimulated	rest

As expected, most of the top hits in the stimulated popultation are related to the activation of interferon pathways and immune signaling. We can display the results in an analog of a volcano plot using bayes factors to encode significance.

[13]:

vega.plotting.volcano(model.adata, group1='stimulated', group2='rest', title='Stimulated vs. Rest')

../_images/tutorials_vega_tutorial_22_0.png

Additionally, we can visualize the weights attributed to each gene in a given pathway to understand the main drivers of variability in the dataset.

[14]:

vega.plotting.rank_gene_weights(model, gmv_list=['REACTOME_INTERFERON_ALPHA_BETA_SIGNALING'], n_genes=20, color_in_set=False)

../_images/tutorials_vega_tutorial_24_0.png

Exporting results

We can use pandas to export the differential activity results to look at it later.

[15]:

#da_df.to_csv('./da_results.tsv', sep='\t')