R/as.FacileDataSet.R
as.FacileDataSet.RdThis function assumes you are only extracting one assay from the assay
container and creating a FacileDataSet from it. If the bioc-containers
you are using can support more than one assay, specify which one to extract
using the source_assay parameter, otherwise the first assay in the
container will be taken. The source_assay you are using to initialize the
FacileDataSet here will become its default_assay when its used later.
The feature_info table is populated by the particular fData (mcols,
y$genes, etc.) from the assay container. The pData from the assay
container will be ingested as well, so do your best to ensure that this
represents the most complete version of the pData for the FacileDataSet
you will be creating.
as.FacileDataSet( x, path, assay_name, assay_type, source_assay, assay_description = paste("Description for ", assay_name), dataset_name = "DEFAULT_NAME", dataset_meta = NULL, organism = "unspecified", page_size = 2^12, cache_size = 2e+05, chunk_rows = 5000, chunk_cols = "ncol", chunk_compression = 5, covariate_def = NULL, ... ) # S3 method for list as.FacileDataSet( x, path, assay_name, assay_type, source_assay = NULL, assay_description = paste("Description for ", assay_name), dataset_name = "DEFAULT_NAME", dataset_meta = list(), organism = "unspecified", prune_dataset_meta = TRUE, page_size = 2^12, cache_size = 2e+05, chunk_rows = 5000, chunk_cols = "ncol", chunk_compression = 5, covariate_def = NULL, ... )
| x | The bioconductor assay container to extract data from |
|---|---|
| path | the directory to create the faciledataset into. Will create a default directory in the current working directory if not specified. This directory should not yet exist. |
| assay_name | The name to use when storing the assay matrix from
|
| assay_type | what type of assay is this? rnaseq, microarray, nanostring, isoseq (isoform expression), etc. |
| source_assay | the name of the assay element in |
| dataset_name | the |
| dataset_meta | a named (by names(x)) with meta data about the datasets
that appear in the list of datasets |
| organism | This is used to fetch the appropriate genesets when this
dataset is used with the facileexplorer. Put species name here, ie.
|
| page_size | parameter to tweak SQLite |
| cache_size | parameter to tweak SQLite |
| chunk_rows | parameter to tweak HDF5 |
| chunk_cols | parameter to tweak HDF5 |
| chunk_compression | parameter to tweak HDF5 |
| ... | more args |
A FacileDataSet can be created from a number of different Bioconductor
containers, such as a Biobase::ExpressionSet,
SummarizedExperiment::SummarizedExperiment, or an edgeR::DGEList. To
create a FacileDataSet that spans multiple Bioc containters, i.e. you may
have one ExpressionSet per indication in the TCGA. You can make
FacileDataSet to encompass the data from all of these indications by
providing a list of ExpressionSets. The list should have its names()
set to each of the TCGA indications ("BLCA", "BRCA", etc.) the data came
from.
The pData data.frame object will be picked off from all of the containers
provided in the list of datasets you are using to create the FacileDataSet.
dataset and sample_id columns will be forcibly added (or modified) as
columns to all of the individual pData data.frames.
In order to insert the entirety of the pData elements into the internal
sample_covariate table, we rely on the dplyr::bind_rows function to
create an uber data.frame which will be converted into an
entity-attribute-value table. Note that when row-binding, columns are matched
by name, and any missing columns with be filled with NA.
ExpressionSet pData data.frames should have an attribute called 'label', which
will be a named character vector with a description for each column. In the case of
a SummarizedExperiment, the colData should have named list in the metadata
slot with a character description of each column.
ExpressionSets should have a short textual description of the facet/dataset in
the annotation slot. Similarly, SummarizedExperiments should have a list
in the metadata slot with url and description for the facet/dataset.
Please ensure that the covariates across the pData data.frames have already
been harmonized!
The feature information (aka "fData") are stored in an internal
feature_info SQLite table within the FacileDataSet. The information to
populate this table will be retrieved from the corresponding fData-like
data.frame from the first given bioc-container in the list.
This data.frame must define the following columns:
"feature_type": string, one of: "entrez", "ensgid", "enstid",
"genomic", "custom".
"feature_id": string
"name": string
"meta": string
"effective_length": integer
"source": string