This function assumes you are only extracting one assay from the assay container and creating a FacileDataSet from it. If the bioc-containers you are using can support more than one assay, specify which one to extract using the source_assay parameter, otherwise the first assay in the container will be taken. The source_assay you are using to initialize the FacileDataSet here will become its default_assay when its used later.

The feature_info table is populated by the particular fData (mcols, y$genes, etc.) from the assay container. The pData from the assay container will be ingested as well, so do your best to ensure that this represents the most complete version of the pData for the FacileDataSet you will be creating.

as.FacileDataSet(
  x,
  path,
  assay_name,
  assay_type,
  source_assay,
  assay_description = paste("Description for ", assay_name),
  dataset_name = "DEFAULT_NAME",
  dataset_meta = NULL,
  organism = "unspecified",
  page_size = 2^12,
  cache_size = 2e+05,
  chunk_rows = 5000,
  chunk_cols = "ncol",
  chunk_compression = 5,
  covariate_def = NULL,
  ...
)

# S3 method for list
as.FacileDataSet(
  x,
  path,
  assay_name,
  assay_type,
  source_assay = NULL,
  assay_description = paste("Description for ", assay_name),
  dataset_name = "DEFAULT_NAME",
  dataset_meta = list(),
  organism = "unspecified",
  prune_dataset_meta = TRUE,
  page_size = 2^12,
  cache_size = 2e+05,
  chunk_rows = 5000,
  chunk_cols = "ncol",
  chunk_compression = 5,
  covariate_def = NULL,
  ...
)

Arguments

x

The bioconductor assay container to extract data from

path

the directory to create the faciledataset into. Will create a default directory in the current working directory if not specified. This directory should not yet exist.

assay_name

The name to use when storing the assay matrix from x into the faciledataset.

assay_type

what type of assay is this? rnaseq, microarray, nanostring, isoseq (isoform expression), etc.

source_assay

the name of the assay element in x to extract for use.

dataset_name

the name attribute of the FacileDataSet meta.yaml file.

dataset_meta

a named (by names(x)) with meta data about the datasets that appear in the list of datasets x. List elements per dataset should minimally include a description and url string.

organism

This is used to fetch the appropriate genesets when this dataset is used with the facileexplorer. Put species name here, ie. "Homo sapiens", "Mus musculus", etc. Default is "unspecified", which isn't really helpful.

page_size

parameter to tweak SQLite

cache_size

parameter to tweak SQLite

chunk_rows

parameter to tweak HDF5

chunk_cols

parameter to tweak HDF5

chunk_compression

parameter to tweak HDF5

...

more args

Value

a FacileDataSet()

Details

A FacileDataSet can be created from a number of different Bioconductor containers, such as a Biobase::ExpressionSet, SummarizedExperiment::SummarizedExperiment, or an edgeR::DGEList. To create a FacileDataSet that spans multiple Bioc containters, i.e. you may have one ExpressionSet per indication in the TCGA. You can make FacileDataSet to encompass the data from all of these indications by providing a list of ExpressionSets. The list should have its names() set to each of the TCGA indications ("BLCA", "BRCA", etc.) the data came from.

Sample Covariates

The pData data.frame object will be picked off from all of the containers provided in the list of datasets you are using to create the FacileDataSet. dataset and sample_id columns will be forcibly added (or modified) as columns to all of the individual pData data.frames.

In order to insert the entirety of the pData elements into the internal sample_covariate table, we rely on the dplyr::bind_rows function to create an uber data.frame which will be converted into an entity-attribute-value table. Note that when row-binding, columns are matched by name, and any missing columns with be filled with NA.

ExpressionSet pData data.frames should have an attribute called 'label', which will be a named character vector with a description for each column. In the case of a SummarizedExperiment, the colData should have named list in the metadata slot with a character description of each column.

ExpressionSets should have a short textual description of the facet/dataset in the annotation slot. Similarly, SummarizedExperiments should have a list in the metadata slot with url and description for the facet/dataset.

Please ensure that the covariates across the pData data.frames have already been harmonized!

Feature meta-information

The feature information (aka "fData") are stored in an internal feature_info SQLite table within the FacileDataSet. The information to populate this table will be retrieved from the corresponding fData-like data.frame from the first given bioc-container in the list. This data.frame must define the following columns:

  • "feature_type": string, one of: "entrez", "ensgid", "enstid", "genomic", "custom".

  • "feature_id": string

  • "name": string

  • "meta": string

  • "effective_length": integer

  • "source": string