cBprocess
creates the data structures necessary to analyze nucleotide recoding RNA-seq data with any of the
statistical model implementations in bakRFit
. The input to cBprocess
must be an object of class
bakRData
.
Usage
cBprocess(
obj,
high_p = 0.2,
totcut = 50,
totcut_all = 10,
Ucut = 0.25,
AvgU = 4,
Stan = TRUE,
Fast = TRUE,
FOI = c(),
concat = TRUE
)
Arguments
- obj
An object of class bakRData
- high_p
Numeric; Any transcripts with a mutation rate (number of mutations / number of Ts in reads) higher than this in any no s4U control samples are filtered out
- totcut
Numeric; Any transcripts with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out
- totcut_all
Numeric; Any transcripts with less than this number of sequencing reads in any sample are filtered out
- Ucut
Numeric; All transcripts must have a fraction of reads with 2 or less Us less than this cutoff in all samples
- AvgU
Numeric; All transcripts must have an average number of Us greater than this cutoff in all samples
- Stan
Boolean; if TRUE, then data_list that can be passed to 'Stan' is curated
- Fast
Boolean; if TRUE, then dataframe that can be passed to fast_analysis() is curated
- FOI
Features of interest; character vector containing names of features to analyze. If
FOI
is non-null andconcat
is TRUE, then all minimally reliable FOIs will be combined with reliable features passing all set filters (high_p
,totcut
,totcut_all
,Ucut
, andAvgU
). Ifconcat
is FALSE, only the minimally reliable FOIs will be kept. A minimally reliable FOI is one that passes filtering with minimally stringent parameters.- concat
Boolean; If TRUE, FOI is concatenated with output of reliableFeatures
Value
returns list of objects that can be passed to TL_stan
and/or fast_analysis
. Those objects are:
Stan_data; list that can be passed to
TL_stan
with Hybrid_Fit = FALSE. Consists of metadata as well as data that 'Stan' will analyze. Data to be analyzed consists of equal length vectors. The contents of Stan_data are:NE; Number of datapoints for 'Stan' to analyze (NE = Number of Elements)
NF; Number of features in dataset
TP; Numerical indicator of s4U feed (0 = no s4U feed, 1 = s4U fed)
FE; Numerical indicator of feature
num_mut; Number of U-to-C mutations observed in a particular set of reads
MT; Numerical indicator of experimental condition (Exp_ID from metadf)
nMT; Number of experimental conditions
R; Numerical indicator of replicate
nrep; Number of replicates (analysis requires same number of replicates of all conditions)
num_obs; Number of reads with identical data (number of mutations, feature of origin, and sample of origin)
tl; Vector of label times for each experimental condition
U_cont; Log2-fold-difference in U-content for a feature in a sample relative to average U-content for that sample
Avg_Reads; Standardized log10(average read counts) for a particular feature in a particular condition, averaged over replicates
Avg_Reads_natural; Unstandardized average read counts for a particular feature in a particular condition, averaged over replicates. Used for
plotMA
sdf; Dataframe that maps numerical feature ID to original feature name. Also has read depth information
sample_lookup; Lookup table relating MT and R to the original sample name
Fast_df; A data frame that can be passed to
fast_analysis
. The contents of Fast_df are:sample; Original sample name
XF; Original feature name
TC; Number of T to C mutations
nT; Number of Ts in read
n; Number of identical observations
fnum; Numerical indicator of feature
type; Numerical indicator of s4U feed (0 = no s4U feed, 1 = s4U fed)
mut; Numerical indicator of experimental condition (Exp_ID from metadf)
reps; Numerical indicator of replicate
Count_Matrix; A matrix with read count information. Each column represents a sample and each row represents a feature. Each entry is the raw number of read counts mapping to a particular feature in a particular sample. Column names are the corresponding sample names and row names are the corresponding feature names.
Details
The 1st step executed by cBprocess
is to find the names of features which are deemed "reliable". A reliable feature is one with
sufficient read coverage in every single sample (i.e., > totcut_all reads in all samples), sufficient read coverage in at all replicates
of at least one experimental condition (i.e., > totcut reads in all replicates for one or more experimental conditions) and limited mutation content in all -s4U
control samples (i.e., < high_p mutation rate in all samples lacking s4U feeds). In addition, if analyzing short read sequencing data, two additional
definitons of reliable features become pertinent: the fraction of reads that can have 2 or less Us in each sample (Ucut) and the
minimum average number of Us for a feature's reads in each sample (AvgU). This is done with a call to reliableFeatures
.
The 2nd step is to extract only reliableFeatures from the cB dataframe in the bakRData
object. During this process, a numerical
ID is given to each reliableFeature, with the numerical ID corresponding to their order when arranged using dplyr::arrange
.
The 3rd step is to prepare a dataframe where each row corresponds to a set of n identical reads (that is they come from the same sample
and have the same number of mutations and Us). Part of this process involves assigning an arbitrary numerical ID to each replicate in each
experimental condition. The numerical ID will correspond to the order the sample appears in metadf. The outcome of this step is multiple
dataframes with variable information content. These include a dataframe with information about read counts in each sample, one which logs
the U-contents of each feature, one which is compatible with fast_analysis
and thus groups reads by their number of mutations as
well as their number of Us, and one which is compatible with TL_stan
with StanFit == TRUE and thus groups ready by only their number
of mutations. At the end of this step, two other smaller data structures are created, one which is an average count matrix (a count matrix
where the ith row and jth column corresponds to the average number of reads mappin to feature i in experimental condition j, averaged over
all replicates) and the other which is a sample lookup table that relates the numerical experimental and replicate IDs to the original
sample name.