Facile Principal Components Analysis

Under Construction

Performs PCA
Easily explore PCs
Extract the highly loaded genes (and genesets) from each PC

Easy to run a PCA over the rnaseq data, or whatever other assy you got there (CNV, proteomics, take your pick)

library(FacileData)
library(FacileAnalysis)
library(dplyr)

pca.crc <- exampleFacileDataSet() %>% 
  filter_samples(indication == "CRC", sample_type == "tumor") %>% 
  fpca(assay_name = "rnaseq")

Vizualize the results:

viz(pca.crc, color_aes = "subtype_crc_cms", shape_aes = "sex")

Need to fix up report()

Not sure how to compare() two PCA results …

Exploring Feature Loadings on Principal Components

The principal components are a new basis of orthogonal axes that are linear combinations of the original axes, which in this examples are genes. By exploring which genes (and gene sets) are most highly loaded along each principal component, analysts can begin to better understand what biological processes might be driving the variability in their data.

The ranks(fpca()) result extracts the feature (gene) loadings onto the principal components. signature(fpca()) extracts the top N ntop features from each principal component.

The code below returns the top 10 features over the first three PCs. The score column corresponds to the loading:

pca.sig <- signature(pca.crc, dims = 1:3, ntop = 10)
pca.sig %>% 
  tidy() %>% 
  select(feature_id, symbol, dimension, score)

#> # A tibble: 60 × 4
#>    feature_id symbol  dimension  score
#>    <chr>      <chr>   <chr>      <dbl>
#>  1 340146     SLC35D3 PC1       0.0760
#>  2 58529      MYOZ1   PC1       0.0729
#>  3 7123       CLEC3B  PC1       0.0714
#>  4 259282     BOD1L   PC1       0.0684
#>  5 29988      SLC2A8  PC1       0.0683
#>  6 124912     SPACA3  PC1       0.0683
#>  7 4931       NVL     PC1       0.0680
#>  8 10564      ARFGEF2 PC1       0.0665
#>  9 3158       HMGCS2  PC1       0.0659
#> 10 2745       GLRX    PC1       0.0650
#> # … with 50 more rows

When we map the expression values of the genes to the PCA viz, we can see that the expression of the genes with high abs(score) values should track with its position on its principal component.

# We need to hack the expression values for these genes into the pca result for
# now, but we are working to upgrade the data retrieval abilities around
# FacileAnalysisResult objects so that this can be made effortless.
pcgenes <- c(IL8 = "3576", ARFGEF2 = "10564", BRIP1 = "83990")
pca.crc$result <- pca.crc$result %>% 
  with_assay_data(pcgenes)

# The following issue tracks our intent to enable easier query/retrieval of
# data from different places to make interactive exploration more facile:
# 
# https://github.com/facileverse/FacileData/issues/8
# 
# Instead of hacking the gene expression data back into the pca.crc$result
# tibble, we should be able to do something like this:
# 
#   viz(pca.crc, color_aes = "feature:IL8|name")

For example, we can see that IL8 is highly loaded on PC1:

viz(pca.crc, color_aes = "IL8", title = "IL8")

ARFGEF2 is also highly loaded on PC1, but witha flipped sign:

viz(pca.crc, color_aes = "ARFGEF2", title = "ARFGEF2")

and BRIP1 is a highly loaded gene on PC2

viz(pca.crc, color_aes = "BRIP1", title = "BRIP1")

Steve Lianoglou

5/17/2019

Exploring Feature Loadings on Principal Components