FacileDataSet is a reference implementation of a multi-assay datastore that implements the FacileData API. It uses a SQLite database to store feature- and sample-level metadata, and an HDF5 file to store any number of assays over its samples. Both of these technologies allow quick and efficient access to arbitrary subsets of data stored on disk without having to load all of the data into RAM, making this datastore ideal for housing data like the entirety of the data from The Cancer Genome Atlas.
This document first shows you how to create a
FacileDataSet, and later gets into the design and implentation details of the
FacileDataSet. These details are helpful to understand its performance characterstics (which are awesome), and how you might go about tweaking/improving them.
Warning: Once you get start to use a
FacileDataSet, you will find it hard to go back to “the normal way” of doing this!
The easiest way to create a
FacileDataSet is to pass a list of well-groomed
ExpressionSets) to the
FacileDataSet is stored on disk as a well-structured directory. You can look at the FacileDataSet Architecture section for the full details, but the important information to know is that:
SummarizedExperimentare stored in an HDF5 file.
pData, respectively) are stored in an SQLite database.
colDatadata.frames of each
SummarizedExperimentare stored in an SQLite table in an entity-attribute-value-table (EAV) format in order to support covariates that are both common to all
SummarizedExperiments and unique to others.
meta.yamlfile that stores important information about the dataset itself. Perhaps most important is the
sample_covariatessection that dictates how each covariate stored in the EAV table is decoded back into its native R-representation. More details are provided in the Entity-Attribute-Value Table section.
A FacileDataSet generally contains two or more facets, or related data collections. For example facets may be the various tissue data sets from the TCGA or a various clinical trial for a the same drug. A FacileDataSet should be prepared as a named list of
SummarizedExperiments or a named list of
ExpressionSets. It is expected that these objects will share some, but not necessarily all, of their sample annotation columns (
pData). Columns with the same names should also have the same encoding, factor levels, etc.
Annotations on the
SummarizedExperiments themselves and on the
colData columns are also used to fill out the meta.yaml file described above.
data.frames should have an attribute named ‘label’, which will be a named character vector with a description for each column. In the case of a
colData should have named list in the
metadata slot with a character description of each column.
ExpressionSets should have a short textual description of the facet/dataset in the
annotation slot. Similarly,
SummarizedExperiments should have a list in the
metadata slot with
description for the facet/dataset.
mcols for the
SummarizedExperiment, respectively, should have
source columns. See the manpage for
as.FacileDataSet for more details.
Given a list of
SummarizedExperiments, we collect the individual
colnames() match across these
data.frames, we assume that the covariates mean the same thing and are encoded using the same scheme (ie. factor levels match).
colnames() differ between
data.frames, even if it is just by a single letter, or their case, are treated as being different.
FacileDataSet stores the following data:
data.sqliteSQLite database that stores feature- and sample-level metadata.
data.h5HDF5 file that stores a multitude of dense assay matrices that are generated from the assays performed on the samples in the
meta.yamlfile tha contains information about the
FacileDataSet. To better understand the structure and contents of this file, you can refer to the following:
testdata/expected-meta.yamlfile for, which is an exemplar file for the
testdata/TestFacileTcgaDataSet, which consists of data extracted from two datasets (BLCA and BRCA) from the TCGA.
eav_metadata_createfunction, which describes in greater detail how we track a dataset’s sample-level covariates (aka, “pData” in the bioconductor world). In the meantime, a short description of the entries found in the
meta.yamlfile is provided here:
name: the name of the dataset (i.e.
"Mus musculus", etc.
default_assay: the name of the assay to use by default if none is specified in calls to [fetch_assay_data()], [with_assay_data()], etc. (kind of like how
"exprs"is the default assay used when working with a [Biobase::ExpressionSet])
datasets: a section tha enumerates the datasets included internally. The datasets are further enumerated.
sample_covariates: a section that enumerates the covariates that are tracked over the samples inside the
FacileDataSet(ie. a mapping of the
pDatafor the samples). Reference
?create_eav_metadatafor more information.
custom-annotationdirectory, which stores custom
sample_covariate(aka “pData”) information that analysts can identify and describe during the course of an analysis, or even add from external sources. Although this directory is required in the directory structure of a valid
FacileDataSet()constructor can be called with a custom
anno.dirparameter so that custom annotations are stored elsewhere.
Sample covariates (aka
pData) are encoded in an [entity-attribute-value (EAV) table][EAV]. Metadata about these covariates are stored in a
meta.yaml file in the
FacileDataSet directory which enables the
FacileDataSet to cast the value stored in the EAV table to its native R type. This function generates the list-of-list structure to represent the
sample_covariates section of the
pData covariates, each column is treated independently from the rest. There are some types of covariates which require multiple columns for proper encoding, such as encoding of survival information, which requires a pair of values that indicate the “time to event” and the status of the event (death or censored). In these cases, the caller needs to provide an entry in the
covariate_def list that describes which
pData columns (
varname) goes into the single facile covariate value.
Please refer to the Encoding Survival Covariates section for a more detailed description of how to define encoding survival information into the EAV table using the
covariate_def parameter. Further examples of how to encode other complex atributes will be added as they are required, but you can reference the Encoding Arbitrarily Complex Covariates section for some more information.
UPDATE: Survival covariates can now be encoded simply as a
survival::Surv object and provided as a column in the pData data.frame. The following describes the original, and still supported, method.
Survival data in R is typically encoded by two vectors. One vector that indicates the “time to event” (tte), and a second to indicate whether or not the denoted tte is an “event” (1) or “censored” (0).
Normally these vectors appear as two columns in an experiment’s
pData, and therefore need to be encoded into the
FacileDataSet’s EAV table. To do so, the pair of vectors are turned into a signed numeric value. The absolute value of the numeric indicates the “time to event” and the sign of the value indicates its censoring status.
Let’s assume we have
event_OS column that are used to encode a patient’s overall survival (time and censor status). To store this as an “OS” covariate in the EAV table, a
covariate_def list-of-list definition that captures this encoding would look like this:
covariate_def <- list( OS=list( class="right_censored", arguments=c(time="tte_OS", event="event_OS"), label="Overall Survival", type="clinical", description="Overall survival in days"))
Note how the name of the list-entry in
covariate_def defines the name of the covariate in the
class entry for the
OS definition indicates the type of variable this is. The
varname entry lists the columns in the
pData that are combined to make this value. The
names(varnames) correspond to the parameters in the [eav_encode_right_censored()] function. The analagous
meta.yaml entry in the
sample_covariates section for the
covariate_def entry looks like so:
sample_covariates: OS: class: right_censored label: "Overall Survival" type: "clinical" description: "Overall survival in days" colnames: ["tte_OS", "event_OS"] argnames: ["time", "event"]
To encode a new type of complex covariate from a wide
pData data.frame, we need to:
"right_censored") for use within a
eav_encode_<class>(arg1, arg2, ...)function which takes the R data vectors (arg1, arg2) and converts them into a single value for the EAV table.
eav_decode_<class>(x, attrname, def, ...)function which takes the single value in the EAV table and casts it back into the R-native data vector(s).
xis the vector of (character) values from the EAV table
attrnameis the name of the covariate in the EAV table
defis the definition-list for this covariate.
...allows each decode function to be further customized. [EAV]: https://en.wikipedia.org/wiki/Entity-attribute-value_model
The HDF5 file has one directory per assay. These directories have one matrix per dataset for the given assay.
For instance, the
FacileTCGADataSet HDF5 file has this structure:
. data.h5 ├── rnaseq │ ├── ACC │ ├── BLCA │ ├── BRCA │ ├── CESC │ ├── ... ├── cnv_score │ ├── ACC │ ├── BLCA │ ├── BRCA │ ├── CESC │ ├── ... ├── mirnaseq │ ├── ACC │ ├── BLCA │ ├── BRCA │ ├── CESC │ ├── ...