# Estimating kinetic parameters from nucleotide recoding RNA-seq data

Source:`R/Fitting_Models.R`

`bakRFit.Rd`

`bakRFit`

analyzes nucleotide recoding RNA-seq data to estimate
kinetic parameters relating to RNA stability and changes in RNA
stability induced by experimental perturbations. Several statistical
models of varying efficiency and accuracy can be used to fit data.

## Usage

```
bakRFit(
obj,
StanFit = FALSE,
HybridFit = FALSE,
high_p = 0.2,
totcut = 50,
totcut_all = 10,
Ucut = 0.25,
AvgU = 4,
FastRerun = FALSE,
FOI = c(),
concat = TRUE,
StanRateEst = FALSE,
RateEst_size = 30,
low_reads = 100,
high_reads = 5e+05,
chains = 1,
NSS = FALSE,
Chase = FALSE,
BDA_model = FALSE,
multi_pold = FALSE,
Long = FALSE,
kmeans = FALSE,
ztest = FALSE,
Fisher = TRUE,
...
)
```

## Arguments

- obj
`bakRData`

object produced by`bakRData`

,`bakRFit`

object produced by`bakRFit`

`bakRFnData`

object produced by`bakRFnData`

, or`bakRFnFit`

object produced by`bakRFit`

.- StanFit
Logical; if TRUE, then the MCMC implementation is run. Will only be used if

`obj`

is a`bakRFit`

object- HybridFit
Logical; if TRUE, then the Hybrid implementation is run. Will only be used if

`obj`

is a`bakRFit`

object- high_p
Numeric; Any features with a mutation rate (number of mutations / number of Ts in reads) higher than this in any -s4U control samples (i.e., samples that were not treated with s4U) are filtered out

- totcut
Numeric; Any features with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out

- totcut_all
Numeric; Any features with less than this number of sequencing reads in any sample are filtered out

- Ucut
Numeric; All features must have a fraction of reads with 2 or less Us less than this cutoff in all samples

- AvgU
Numeric; All features must have an average number of Us greater than this cutoff in all samples

- FastRerun
Logical; only matters if a bakRFit object is passed to

`bakRFit`

. If TRUE, then the Stan-free model implemented in`fast_analysis`

is rerun on data, foregoing fitting of either of the 'Stan' models.- FOI
Features of interest; character vector containing names of features to analyze

- concat
Logical; If TRUE, FOI is concatenated with output of reliableFeatures

- StanRateEst
Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis.

- RateEst_size
Numeric; if StanRateEst is TRUE, then data from RateEst_size genes are used for mutation rate estimation. This can be as low as 1 and should be kept low to ensure maximum efficiency

- low_reads
Numeric; if StanRateEst is TRUE, then only features with more than low_reads reads in all samples will be used for mutation rate estimation

- high_reads
Numeric; if StanRateEst is TRUE, then only features with less than high_read reads in all samples will be used for mutation rate estimation. A high read count cutoff is as important as a low read count cutoff in this case because you don't want the fraction labeled of chosen features to be extreme (e.g., close to 0 or 1), and high read count features are likely low fraction new features.

- chains
Number of Markov chains to sample from. 1 should suffice since these are validated models. Running more chains is generally preferable, but memory constraints can make this unfeasible.

- NSS
Logical; if TRUE, logit(fn)s are directly compared to avoid assuming steady-state

- Chase
Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment).

- BDA_model
Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used.

- multi_pold
Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate.

- Long
Logical; if TRUE, long read optimized fraction new estimation strategy is used.

- kmeans
Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold.

- ztest
Logical; if TRUE and the MLE implementation is being used, then a z-test will be used for p-value calculation rather than the more conservative moderated t-test.

- Fisher
Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew).

- ...
Arguments passed to either

`fast_analysis`

(if a bakRData object) or`TL_Stan`

and`Hybrid_fit`

(if a bakRFit object)

## Value

bakRFit object with results from statistical modeling and data processing. Objects possibly included are:

Fast_Fit; Always will be present. Output of

`fast_analysis`

Hybrid_Fit; Only present if HybridFit = TRUE. Output of

`TL_stan`

Stan_Fit; Only present if StanFit = TRUE. Output of

`TL_stan`

Data_lists; Always will be present. Output of

`cBprocess`

with Fast and Stan == TRUE

## Details

If `bakRFit`

is run on a bakRData object, `cBprocess`

and then `fast_analysis`

will always be called. The former will generate the processed
data that can be passed to the model fitting functions (`fast_analysis`

and `TL_Stan`

). The call to `fast_analysis`

will generate a list of dataframes
containing information regarding the `fast_analysis`

fit. `fast_analysis`

is always
called because its output is required for both `Hybrid_fit`

and `TL_Stan`

.

If `bakRFit`

is run on a bakRFit object, `cBprocess`

will not be called again,
as the output of `cBprocess`

will already be contained in the bakRFit object. Similarly,
`fast_analysis`

will not be called again unless bakRFit is rerun on the bakRData object.
or if `FastRerun`

is set to TRUE. If you want to generate model fits using different parameters for cBprocess,
you will have to rerun `bakRFit`

on the bakRData object.

If `bakRFit`

is run on a bakRFnData object, `fn_process`

and then `avg_and_regularize`

will always be called. The former will generate the processed data that can be passed to the model
fitting functions (`avg_and_regularize`

and `TL_Stan`

, the latter only with HybridFit = TRUE).

If `bakRFit`

is run on a bakRFnFit object. `fn_process`

will not be called again, as
the output of `fn_process`

will already be contained in the bakRFnFit object. Similary,
`avg_and_regularize`

will not be called unless `bakRFit`

is rerun on the bakRData object,
or if `FastRerun`

is set to TRUE. If you want to generate model fits using different
parameters for `fn_process`

, you will have to rerun `bakRFit`

on the bakRData object.

See the documentation for the individual fitting functions for details regarding how they analyze nucleotide recoding data. What follows is a brief overview of how each works

`fast_analysis`

(referred to as the MLE implementation in the bakR paper)
either estimates mutation rates from + and (if available) - s4U samples or uses mutation rate estimates
provided by the user to perform maximum likelihood estimation (MLE) of the fraction of RNA that is labeled for each
replicate of nucleotide recoding data provided. Uncertainties for each replicate's estimate are approximated using
asymptotic results involving the Fisher Information and assuming known mutation rates. Replicate data
is pooled using an approximation to hierarchical modeling that relies on analytic solutions to simple Bayesian models.
Linear regression is used to estimate the relationship between read depths and replicate variability for uncertainty
estimation regularization, again performed using analytic solutions to Bayesian models.

`TL_Stan`

with Hybrid_Fit set to TRUE (referred to as the Hybrid implementation in the bakR paper)
takes as input estimates of the logit(fraction new) and uncertainty provided by `fast_analysis`

.
It then uses 'Stan' on the backend to implement a hierarchical model that pools data across replicates and the dataset
to estimate effect sizes (L2FC(kdeg)) and uncertainties. Replicate variability information is pooled across each experimental
condition to regularize variance estimates using a hierarchical linear regression model.

The default behavior of `TL_Stan`

(referred to as the MCMC implementation in the bakR paper)
is to use 'Stan' on the back end to implement a U-content exposure adjusted Poisson mixture model
to estimate fraction news from the mutational data. Partial pooling of replicate variability estimates
is performed as with the Hybrid implementation.

## Examples

```
# \donttest{
# Simulate data for 1000 genes, 2 replicates, 2 conditions
simdata <- Simulate_bakRData(1000, nreps = 2)
# You always must fit fast implementation before any others
Fit <- bakRFit(simdata$bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups: mut [2]
#> mut reps pnew pold
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 0.0499 0.00100
#> 2 1 2 0.0500 0.00100
#> 3 2 1 0.0499 0.00100
#> 4 2 2 0.0500 0.00100
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the
#> quality of your data and get recommendations for next steps.
# }
```