Runs a principal components analysis, the facile way. — compare.FacilePcaAnalysisResult • FacileAnalysis

Performs a principal components analysis over a specified assay from the (subset of) samples in a FacileDataStore.

# S3 method for FacilePcaAnalysisResult
compare(x, y, run_all = TRUE, rerun = TRUE, ...)

fpca(
  x,
  assay_name = NULL,
  dims = 5,
  features = NULL,
  filter = "variance",
  ntop = 1000,
  row_covariates = NULL,
  col_covariates = NULL,
  batch = NULL,
  main = NULL,
  ...
)

# S3 method for facile_frame
fpca(
  x,
  assay_name = NULL,
  dims = min(5, nrow(collect(x, n = Inf)) - 1L),
  features = NULL,
  filter = "variance",
  ntop = 1000,
  row_covariates = NULL,
  col_covariates = NULL,
  batch = NULL,
  main = NULL,
  custom_key = Sys.getenv("USER"),
  ...
)

# S3 method for matrix
fpca(
  x,
  dims = min(5, ncol(x) - 1L),
  features = NULL,
  filter = "default",
  ntop = 1000,
  row_covariates = NULL,
  col_covariates = NULL,
  batch = NULL,
  main = NULL,
  use_irlba = dims < 7,
  center = TRUE,
  scale. = FALSE,
  ...
)

Arguments

x: a facile data container (FacileDataSet), or a facile_frame (refer to the FacileDataStore (facile_frame) section.
rerun: when rerun = TRUE (default), the fpca(x) and fpca(y) will be rerun over the union of the features in x and y.
assay_name: the name of the assay to extract data from to perform the PCA. If not specified, default assays are taken for each type of assay container (ie. default_assay(facile container), "counts" for a DGEList, assayNames(SummarizedExperiment)[1L], etc.)
dims: the number of PC's to calculate (minimum is 3).
features: A feature descriptor of the features to use for the analysis. If NULL (default), then the specified filter strategy is used.
filter: A strategy used to identify which features to use for the dimensionality reduction. The current (and only choice) is "default", which takes the ntop features, sorted be decreasing variance.
ntop: the number of features (genes) to include in the PCA. Genes are ranked by decreasing variance across the samples in x.
row_covariates, col_covariates: data.frames that provie meta information for the features (rows) and samples (columns). The default is to get these values from "the obvious places" given x ($genes and $samples for a DGEList, or the sample and feature-level covariate database tables from a FacileDataSet, for example).
batch, main: specify the covariates to use for batch effect removal. Refer to the FacileData::remove_batch_effect() help for more information.

Value

an fpca result

Details

The FacilePcaAnalysisResult produced here can be used in "the usual" ways, ie. can be viz-ualized. shine() is 1/4th-implemented, and report() has not been worked on yet.

Importantly / interestingly, you can shoot this result into ffsea() to perform gene set enrichment analysis over a specified dimension to identify functional categories loaded onto differend PCs.

Comparing PCA Results

We can compare two PCA results. Currently this just means we compare the loadings of the features along each PC from fpca result x and y.

Batch Correction

Because we assume that PCA is performed on normalized data, we leverage the batch correction facilities provided by the batch and main parameters in the FacileData::fetch_assay_data() pipeline. If your samples have a "sex" covariate defined, for example, you can perform a PCA with sex-corrected expression values like so: fpca(samples, batch = "sex")

Features Used for PCA

By default, fpca() will assess the variance of all the features (genes) to perform PCA over, and will keep the top ntop ones. This behavior is determined by the following three parameters:

filter determines the method by which features are selected for analysis. Currently you can only choose "variance" (the default) or "none".
features determines the universe of features that are available for the analysis. When NULL (default), all features for the given assay will be loaded and filtered using the specification of the filter parameter. If a feature descriptor is provided and filter is not specified, then we assume that these are the exact features to run the analysis on, and filter defaults to "none". You may, however, intend for features to define the universe of features to use prior to filtering, perhaps to perform a PCA on only a certain set of genes (protein coding), but then filter those further by variance. In this case, you will need to pass in the feature descriptor for the universe of features you want to consider, then explicity set filter = "variance".
ntop the default "top" number of features to take when filtering by variance.

Development Notes

Follow progress on implementation of shine() and report() below:

Implement report()

Note that there are methods defined for other assay containers, like an edgeR::DGEList, limma::EList, and SummarizedExperiment. If these are called directly, their downstream use within the facile ecosystem isn't yet fully supported. Development of the FacileBioc package will address this.

Random Things to elaborate on

The code here is largely inspired by DESeq2's plotPCA.

You should look at factominer:

http://factominer.free.fr/factomethods/index.html
http://factominer.free.fr/graphs/factoshiny.html

Teaching and Tutorials

This looks like a useful tutorial to use when explaining the utility of PCA analysis: http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/

High-Dimensional Data Analysis course by Rafa Irizarry and Michael Love https://online-learning.harvard.edu/course/data-analysis-life-sciences-4-high-dimensional-data-analysis?category[]=84&sort_by=date_added&cost[]=free

FacileDataStore (facile_frame)

We enable the user to supply extra sample covariates that are not found in the FacileDataStore associated with these samples x by adding them as extra columns to x.

If manually provioded col_covariates have the same name as internal sample covariates, then the manually provided ones will supersede the internals.

Examples

efds <- FacileData::exampleFacileDataSet()
p1 <- efds %>%
  FacileData::filter_samples(indication == "CRC") %>%
  fpca()
p2 <- efds %>%
  FacileData::filter_samples(indication == "BLCA") %>%
  fpca()
pcmp <- compare(p1, p2)
efds <- FacileData::exampleFacileDataSet()

# A subset of samples ------------------------------------------------------
pca.crc <- efds %>%
  FacileData::filter_samples(indication == "CRC") %>%
  fpca()
if (interactive()) {
  # report(pca.crc, color_aes = "sample_type")
  shine(pca.crc)
  viz(pca.crc, color_aes = "sex")
}

# Regress "sex" out from expression data
pca.crcs <- FacileData::samples(pca.crc) %>%
  fpca(batch = "sex")
if (interactive()) {
  viz(pca.crcs, color_aes = "sex")
}

# Perform PCA on only the protein coding genes
genes.pc <- features(efds) %>% subset(meta == "protein_coding")
pca.crc.pc <- samples(pca.crc) %>%
  fpca(features = genes.pc, filter = "variance")

pca.gdb <- pca.crc %>%
  signature(dims = 1:3) %>%
  result() %>%
  sparrow::GeneSetDb()

# All samples --------------------------------------------------------------
pca.all <- fpca(efds)
if (interactive()) {
  viz(pca.all, color_aes = "indication", shape_aes = "sample_type")
  # report(pca.all, color_aes = "indication", shape_aes = "sample_type")
}