Estimating kinetic parameters from nucleotide recoding RNA-seq data

bakRFit analyzes nucleotide recoding RNA-seq data to estimate kinetic parameters relating to RNA stability and changes in RNA stability induced by experimental perturbations. Several statistical models of varying efficiency and accuracy can be used to fit data.

Usage

bakRFit(
  obj,
  StanFit = FALSE,
  HybridFit = FALSE,
  high_p = 0.2,
  totcut = 50,
  totcut_all = 10,
  Ucut = 0.25,
  AvgU = 4,
  FastRerun = FALSE,
  FOI = c(),
  concat = TRUE,
  StanRateEst = FALSE,
  RateEst_size = 30,
  low_reads = 100,
  high_reads = 5e+05,
  chains = 1,
  NSS = FALSE,
  Chase = FALSE,
  BDA_model = FALSE,
  multi_pold = FALSE,
  Long = FALSE,
  kmeans = FALSE,
  ztest = FALSE,
  Fisher = TRUE,
  ...
)

Arguments

obj: bakRData object produced by bakRData, bakRFit object produced by bakRFit bakRFnData object produced by bakRFnData, or bakRFnFit object produced by bakRFit.
StanFit: Logical; if TRUE, then the MCMC implementation is run. Will only be used if obj is a bakRFit object
HybridFit: Logical; if TRUE, then the Hybrid implementation is run. Will only be used if obj is a bakRFit object
high_p: Numeric; Any features with a mutation rate (number of mutations / number of Ts in reads) higher than this in any -s4U control samples (i.e., samples that were not treated with s4U) are filtered out
totcut: Numeric; Any features with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out
totcut_all: Numeric; Any features with less than this number of sequencing reads in any sample are filtered out
Ucut: Numeric; All features must have a fraction of reads with 2 or less Us less than this cutoff in all samples
AvgU: Numeric; All features must have an average number of Us greater than this cutoff in all samples
FastRerun: Logical; only matters if a bakRFit object is passed to bakRFit. If TRUE, then the Stan-free model implemented in fast_analysis is rerun on data, foregoing fitting of either of the 'Stan' models.
FOI: Features of interest; character vector containing names of features to analyze
concat: Logical; If TRUE, FOI is concatenated with output of reliableFeatures
StanRateEst: Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis.
RateEst_size: Numeric; if StanRateEst is TRUE, then data from RateEst_size genes are used for mutation rate estimation. This can be as low as 1 and should be kept low to ensure maximum efficiency
low_reads: Numeric; if StanRateEst is TRUE, then only features with more than low_reads reads in all samples will be used for mutation rate estimation
high_reads: Numeric; if StanRateEst is TRUE, then only features with less than high_read reads in all samples will be used for mutation rate estimation. A high read count cutoff is as important as a low read count cutoff in this case because you don't want the fraction labeled of chosen features to be extreme (e.g., close to 0 or 1), and high read count features are likely low fraction new features.
chains: Number of Markov chains to sample from. 1 should suffice since these are validated models. Running more chains is generally preferable, but memory constraints can make this unfeasible.
NSS: Logical; if TRUE, logit(fn)s are directly compared to avoid assuming steady-state
Chase: Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment).
BDA_model: Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used.
multi_pold: Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate.
Long: Logical; if TRUE, long read optimized fraction new estimation strategy is used.
kmeans: Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold.
ztest: Logical; if TRUE and the MLE implementation is being used, then a z-test will be used for p-value calculation rather than the more conservative moderated t-test.
Fisher: Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew).
...: Arguments passed to either fast_analysis (if a bakRData object) or TL_Stan and Hybrid_fit (if a bakRFit object)

Value

bakRFit object with results from statistical modeling and data processing. Objects possibly included are:

Fast_Fit; Always will be present. Output of fast_analysis
Hybrid_Fit; Only present if HybridFit = TRUE. Output of TL_stan
Stan_Fit; Only present if StanFit = TRUE. Output of TL_stan
Data_lists; Always will be present. Output of cBprocess with Fast and Stan == TRUE

Details

If bakRFit is run on a bakRData object, cBprocess and then fast_analysis will always be called. The former will generate the processed data that can be passed to the model fitting functions (fast_analysis and TL_Stan). The call to fast_analysis will generate a list of dataframes containing information regarding the fast_analysis fit. fast_analysis is always called because its output is required for both Hybrid_fit and TL_Stan.

If bakRFit is run on a bakRFit object, cBprocess will not be called again, as the output of cBprocess will already be contained in the bakRFit object. Similarly, fast_analysis will not be called again unless bakRFit is rerun on the bakRData object. or if FastRerun is set to TRUE. If you want to generate model fits using different parameters for cBprocess, you will have to rerun bakRFit on the bakRData object.

If bakRFit is run on a bakRFnData object, fn_process and then avg_and_regularize will always be called. The former will generate the processed data that can be passed to the model fitting functions (avg_and_regularize and TL_Stan, the latter only with HybridFit = TRUE).

If bakRFit is run on a bakRFnFit object. fn_process will not be called again, as the output of fn_process will already be contained in the bakRFnFit object. Similary, avg_and_regularize will not be called unless bakRFit is rerun on the bakRData object, or if FastRerun is set to TRUE. If you want to generate model fits using different parameters for fn_process, you will have to rerun bakRFit on the bakRData object.

See the documentation for the individual fitting functions for details regarding how they analyze nucleotide recoding data. What follows is a brief overview of how each works

fast_analysis (referred to as the MLE implementation in the bakR paper) either estimates mutation rates from + and (if available) - s4U samples or uses mutation rate estimates provided by the user to perform maximum likelihood estimation (MLE) of the fraction of RNA that is labeled for each replicate of nucleotide recoding data provided. Uncertainties for each replicate's estimate are approximated using asymptotic results involving the Fisher Information and assuming known mutation rates. Replicate data is pooled using an approximation to hierarchical modeling that relies on analytic solutions to simple Bayesian models. Linear regression is used to estimate the relationship between read depths and replicate variability for uncertainty estimation regularization, again performed using analytic solutions to Bayesian models.

TL_Stan with Hybrid_Fit set to TRUE (referred to as the Hybrid implementation in the bakR paper) takes as input estimates of the logit(fraction new) and uncertainty provided by fast_analysis. It then uses 'Stan' on the backend to implement a hierarchical model that pools data across replicates and the dataset to estimate effect sizes (L2FC(kdeg)) and uncertainties. Replicate variability information is pooled across each experimental condition to regularize variance estimates using a hierarchical linear regression model.

The default behavior of TL_Stan (referred to as the MCMC implementation in the bakR paper) is to use 'Stan' on the back end to implement a U-content exposure adjusted Poisson mixture model to estimate fraction news from the mutational data. Partial pooling of replicate variability estimates is performed as with the Hybrid implementation.

Examples

# \donttest{
# Simulate data for 1000 genes, 2 replicates, 2 conditions
simdata <- Simulate_bakRData(1000, nreps = 2)

# You always must fit fast implementation before any others
Fit <- bakRFit(simdata$bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups:   mut [2]
#>     mut  reps   pnew     pold
#>   <int> <dbl>  <dbl>    <dbl>
#> 1     1     1 0.0501 0.000999
#> 2     1     2 0.0498 0.000999
#> 3     2     1 0.0500 0.000999
#> 4     2     2 0.0501 0.000999
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the 
#>             quality of your data and get recommendations for next steps.

# }