Curate data in bakRData object for statistical modeling

cBprocess creates the data structures necessary to analyze nucleotide recoding RNA-seq data with any of the statistical model implementations in bakRFit. The input to cBprocess must be an object of class bakRData.

Usage

cBprocess(
  obj,
  high_p = 0.2,
  totcut = 50,
  totcut_all = 10,
  Ucut = 0.25,
  AvgU = 4,
  Stan = TRUE,
  Fast = TRUE,
  FOI = c(),
  concat = TRUE
)

Arguments

obj: An object of class bakRData
high_p: Numeric; Any transcripts with a mutation rate (number of mutations / number of Ts in reads) higher than this in any no s4U control samples are filtered out
totcut: Numeric; Any transcripts with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out
totcut_all: Numeric; Any transcripts with less than this number of sequencing reads in any sample are filtered out
Ucut: Numeric; All transcripts must have a fraction of reads with 2 or less Us less than this cutoff in all samples
AvgU: Numeric; All transcripts must have an average number of Us greater than this cutoff in all samples
Stan: Boolean; if TRUE, then data_list that can be passed to 'Stan' is curated
Fast: Boolean; if TRUE, then dataframe that can be passed to fast_analysis() is curated
FOI: Features of interest; character vector containing names of features to analyze. If FOI is non-null and concat is TRUE, then all minimally reliable FOIs will be combined with reliable features passing all set filters (high_p, totcut, totcut_all, Ucut, and AvgU). If concat is FALSE, only the minimally reliable FOIs will be kept. A minimally reliable FOI is one that passes filtering with minimally stringent parameters.
concat: Boolean; If TRUE, FOI is concatenated with output of reliableFeatures

Value

returns list of objects that can be passed to TL_stan and/or fast_analysis. Those objects are:

Stan_data; list that can be passed to TL_stan with Hybrid_Fit = FALSE. Consists of metadata as well as data that 'Stan' will analyze. Data to be analyzed consists of equal length vectors. The contents of Stan_data are:
- NE; Number of datapoints for 'Stan' to analyze (NE = Number of Elements)
- NF; Number of features in dataset
- TP; Numerical indicator of s4U feed (0 = no s4U feed, 1 = s4U fed)
- FE; Numerical indicator of feature
- num_mut; Number of U-to-C mutations observed in a particular set of reads
- MT; Numerical indicator of experimental condition (Exp_ID from metadf)
- nMT; Number of experimental conditions
- R; Numerical indicator of replicate
- nrep; Number of replicates (analysis requires same number of replicates of all conditions)
- num_obs; Number of reads with identical data (number of mutations, feature of origin, and sample of origin)
- tl; Vector of label times for each experimental condition
- U_cont; Log2-fold-difference in U-content for a feature in a sample relative to average U-content for that sample
- Avg_Reads; Standardized log10(average read counts) for a particular feature in a particular condition, averaged over replicates
- Avg_Reads_natural; Unstandardized average read counts for a particular feature in a particular condition, averaged over replicates. Used for plotMA
- sdf; Dataframe that maps numerical feature ID to original feature name. Also has read depth information
- sample_lookup; Lookup table relating MT and R to the original sample name
Fast_df; A data frame that can be passed to fast_analysis. The contents of Fast_df are:
- sample; Original sample name
- XF; Original feature name
- TC; Number of T to C mutations
- nT; Number of Ts in read
- n; Number of identical observations
- fnum; Numerical indicator of feature
- type; Numerical indicator of s4U feed (0 = no s4U feed, 1 = s4U fed)
- mut; Numerical indicator of experimental condition (Exp_ID from metadf)
- reps; Numerical indicator of replicate
Count_Matrix; A matrix with read count information. Each column represents a sample and each row represents a feature. Each entry is the raw number of read counts mapping to a particular feature in a particular sample. Column names are the corresponding sample names and row names are the corresponding feature names.

Details

The 1st step executed by cBprocess is to find the names of features which are deemed "reliable". A reliable feature is one with sufficient read coverage in every single sample (i.e., > totcut_all reads in all samples), sufficient read coverage in at all replicates of at least one experimental condition (i.e., > totcut reads in all replicates for one or more experimental conditions) and limited mutation content in all -s4U control samples (i.e., < high_p mutation rate in all samples lacking s4U feeds). In addition, if analyzing short read sequencing data, two additional definitons of reliable features become pertinent: the fraction of reads that can have 2 or less Us in each sample (Ucut) and the minimum average number of Us for a feature's reads in each sample (AvgU). This is done with a call to reliableFeatures.

The 2nd step is to extract only reliableFeatures from the cB dataframe in the bakRData object. During this process, a numerical ID is given to each reliableFeature, with the numerical ID corresponding to their order when arranged using dplyr::arrange.

The 3rd step is to prepare a dataframe where each row corresponds to a set of n identical reads (that is they come from the same sample and have the same number of mutations and Us). Part of this process involves assigning an arbitrary numerical ID to each replicate in each experimental condition. The numerical ID will correspond to the order the sample appears in metadf. The outcome of this step is multiple dataframes with variable information content. These include a dataframe with information about read counts in each sample, one which logs the U-contents of each feature, one which is compatible with fast_analysis and thus groups reads by their number of mutations as well as their number of Us, and one which is compatible with TL_stan with StanFit == TRUE and thus groups ready by only their number of mutations. At the end of this step, two other smaller data structures are created, one which is an average count matrix (a count matrix where the ith row and jth column corresponds to the average number of reads mappin to feature i in experimental condition j, averaged over all replicates) and the other which is a sample lookup table that relates the numerical experimental and replicate IDs to the original sample name.

Examples

# \donttest{

# Load cB
data("cB_small")

# Load metadf
data("metadf")

# Create bakRData
bakRData <- bakRData(cB_small, metadf)

# Preprocess data
data_for_bakR <- cBprocess(obj = bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
# }