Sample covariates (aka pData) are encoded in an entity-attribute-value (EAV) table. Metadata about these covariates are stored in a meta.yaml file in the FacileDataSet directory which enables the FacileDataSet to cast the value stored in the EAV table to its native R type. This function generates the list-of-list structure to represent the sample_covariates section of the meta.yaml file.

eav_metadata_create(
  x,
  ignore = c("dataset", "sample_id"),
  covariate_def = list()
)

Arguments

x

a pData data.frame

ignore

the columns in x to not create covariate definitions for. This defaults to c("dataset", "sample_id") since we are in the facileverse.

covariate_def

a named list of covariate definitions. The names of this list are the names the covariates will be called in the target FacileDataSet. The values of the list are:

  • varname: a character() of the column name(s) in x that this sample covariate was derived from. If more than one column is to be used for the facile covariate conversion (e.g. if we are encoding survival), then provide a length() > 1 character vector with the names of the columns in x that were used for the encoding. If this were encoding survival this might be c("time", "event") columns, in that order.

  • label: a human readable label to use for this covariate in user facing scenarios in the facileverse.

  • class: the "facile class" of the covariate. This can either be categorical, real, or right_censored (for survival).

  • levels: (optional) if you want a categorical to be treated as a factor if it isn't already encoded as such in the pData itself, or if you want to rearrange the factor levels.

  • type: (optional) this is used a a "grouping" level, particularly in the FacileExplorer.

Value

a list-of-lists that encodes the sample_covariate section of the meta.yaml file for a FacileDataSet. Each list element will have the following elements:

  1. arguments: the name(s) of the columns from x used in this covariate description.

  2. class: "real", "categorical", (survival needs a bity of work)

  3. description: a string with minimal description

  4. type: this isn't really used in the dataset, but another application might want to group covariates by type.

Details

For simple pData covariates, each column is treated independently from the rest. There are some types of covariates which require multiple columns for proper encoding, such as encoding of survival information, which requires a pair of values that indicate the "time to event" and the status of the event (death or censored). In these cases, the caller needs to provide an entry in the covariate_def list that describes which pData columns (varname) goes into the single facile covariate value.

Please refer to the Encoding Survival Covariates section for a more detailed description of how to define encoding survival information into the EAV table using the covariate_def parameter. Further examples of how to encode other complex attributes will be added as they are required, but you can reference the Encoding Arbitrarily Complex Covariates section for some more information.

Encoding Survival Covariates

UPDATE: FacileData can now use survival data encoded as a survival::Surv object stored as a pData column. Read on for the original encoding strategy, which is still implemented.

Survival data in R is typically encoded by two vectors. One vector that indicates the "time to event" (tte), and a second to indicate whether or not the denoted tte is an "event" (1) or "censored" (0).

Normally these vectors appear as two columns in an experiment's pData, and therefore need to be encoded into the FacileDataSet's EAV table. To do so, the pair of vectors are turned into a signed numeric value. The absolute value of the numeric indicates the "time to event" and the sign of the value indicates its censoring status.

Let's assume we have tte_OS and event_OS column that are used to encode a patient's overall survival (time and censor status). To store this as an "OS" covariate in the EAV table, a covariate_def list-of-list definition that captures this encoding would look like this:

covariate_def <- list(
  OS=list(
    class="right_censored",
    arguments=list(time="tte_OS", event="event_OS"),
    label="Overall Survival",
    type="clinical",
    description="Overall survival in days"))

Note how the name of the list-entry in covariate_def defines the name of the covariate in the FacileDataSet. The class entry for the OS definition indicates the type of variable this is. The arguments section is only used when encoding a wide pData into the EAV value column. names(arguments) correspond to the parameters in the [eav_encode_right_censored()] function, and their values are the columns in the target pData that populate the respective parameters in the function call. The analagous meta.yaml entry in the sample_covariates section for the "OS" covariate_def entry looks like so:

sample_covariates:
  OS:
    class: right_censored
    arguments:
      time: tte_OS
      event: event_OS
    label: "Overall Survival"
    type: "clinical"
    description: "Overall survival in days"

Encoding Arbitrarily Complex Covariates

To encode a new type of complex covariate from a wide pData data.frame, we need to:

  1. Specify a new class (like "right_censored") for use within a FacileDataSet.

  2. Define an eav_encode_<class>(arg1, arg2, ...) function which takes the R data vectors (arg1, arg2) and converts them into a single value for the EAV table.

  3. Define a eav_decode_<class>(x, attrname, def, ...) function which takes the single value in the EAV table and casts it back into the R-naive data vector(s).

    • x is the vector of (character) values from the EAV table

    • attrname is the name of the covariate in the EAV table

    • def is the definition-list for this covariate.

    • ... allows each decode function to be further customized.

Examples

# covariate_def definition to take tte_OS and tte_event columns and turn # into a facile "OS" right_censored survival covariate cc <- list( OS=list( arguments=list(time="tte_OS", event="event_OS"), label="Overall Survival", class="right_censored", type="clinical", description="Overall survival in days"))