R/as.FacileDataSet.R
as.FacileDataSet.Rd
This function assumes you are only extracting one assay from the assay
container and creating a FacileDataSet
from it. If the bioc-containers
you are using can support more than one assay, specify which one to extract
using the source_assay
parameter, otherwise the first assay in the
container will be taken. The source_assay
you are using to initialize the
FacileDataSet
here will become its default_assay
when its used later.
The feature_info
table is populated by the particular fData
(mcols
,
y$genes
, etc.) from the assay container. The pData
from the assay
container will be ingested as well, so do your best to ensure that this
represents the most complete version of the pData for the FacileDataSet
you will be creating.
as.FacileDataSet( x, path, assay_name, assay_type, source_assay, assay_description = paste("Description for ", assay_name), dataset_name = "DEFAULT_NAME", dataset_meta = NULL, organism = "unspecified", page_size = 2^12, cache_size = 2e+05, chunk_rows = 5000, chunk_cols = "ncol", chunk_compression = 5, covariate_def = NULL, ... ) # S3 method for list as.FacileDataSet( x, path, assay_name, assay_type, source_assay = NULL, assay_description = paste("Description for ", assay_name), dataset_name = "DEFAULT_NAME", dataset_meta = list(), organism = "unspecified", prune_dataset_meta = TRUE, page_size = 2^12, cache_size = 2e+05, chunk_rows = 5000, chunk_cols = "ncol", chunk_compression = 5, covariate_def = NULL, ... )
x | The bioconductor assay container to extract data from |
---|---|
path | the directory to create the faciledataset into. Will create a default directory in the current working directory if not specified. This directory should not yet exist. |
assay_name | The name to use when storing the assay matrix from
|
assay_type | what type of assay is this? rnaseq, microarray, nanostring, isoseq (isoform expression), etc. |
source_assay | the name of the assay element in |
dataset_name | the |
dataset_meta | a named (by names(x)) with meta data about the datasets
that appear in the list of datasets |
organism | This is used to fetch the appropriate genesets when this
dataset is used with the facileexplorer. Put species name here, ie.
|
page_size | parameter to tweak SQLite |
cache_size | parameter to tweak SQLite |
chunk_rows | parameter to tweak HDF5 |
chunk_cols | parameter to tweak HDF5 |
chunk_compression | parameter to tweak HDF5 |
... | more args |
A FacileDataSet
can be created from a number of different Bioconductor
containers, such as a Biobase::ExpressionSet
,
SummarizedExperiment::SummarizedExperiment
, or an edgeR::DGEList
. To
create a FacileDataSet
that spans multiple Bioc containters, i.e. you may
have one ExpressionSet per indication in the TCGA. You can make
FacileDataSet
to encompass the data from all of these indications by
providing a list
of ExpressionSet
s. The list
should have its names()
set to each of the TCGA indications ("BLCA", "BRCA", etc.) the data came
from.
The pData
data.frame object will be picked off from all of the containers
provided in the list of datasets you are using to create the FacileDataSet.
dataset
and sample_id
columns will be forcibly added (or modified) as
columns to all of the individual pData
data.frames.
In order to insert the entirety of the pData
elements into the internal
sample_covariate
table, we rely on the dplyr::bind_rows
function to
create an uber data.frame
which will be converted into an
entity-attribute-value table. Note that when row-binding, columns are matched
by name, and any missing columns with be filled with NA
.
ExpressionSet
pData data.frames
should have an attribute called 'label', which
will be a named character vector with a description for each column. In the case of
a SummarizedExperiment
, the colData
should have named list in the metadata
slot with a character description of each column.
ExpressionSet
s should have a short textual description of the facet/dataset in
the annotation
slot. Similarly, SummarizedExperiment
s should have a list
in the metadata
slot with url
and description
for the facet/dataset.
Please ensure that the covariates across the pData
data.frames have already
been harmonized!
The feature information (aka "fData") are stored in an internal
feature_info
SQLite table within the FacileDataSet
. The information to
populate this table will be retrieved from the corresponding fData
-like
data.frame
from the first given bioc-container in the list.
This data.frame
must define the following columns:
"feature_type": string
, one of: "entrez"
, "ensgid"
, "enstid"
,
"genomic"
, "custom"
.
"feature_id": string
"name": string
"meta": string
"effective_length": integer
"source": string