Estimating kinetic parameters from nucleotide recoding RNA-seq data
Source:R/Fitting_Models.R
bakRFit.Rd
bakRFit
analyzes nucleotide recoding RNA-seq data to estimate
kinetic parameters relating to RNA stability and changes in RNA
stability induced by experimental perturbations. Several statistical
models of varying efficiency and accuracy can be used to fit data.
Usage
bakRFit(
obj,
StanFit = FALSE,
HybridFit = FALSE,
high_p = 0.2,
totcut = 50,
totcut_all = 10,
Ucut = 0.25,
AvgU = 4,
FastRerun = FALSE,
FOI = c(),
concat = TRUE,
StanRateEst = FALSE,
RateEst_size = 30,
low_reads = 100,
high_reads = 5e+05,
chains = 1,
NSS = FALSE,
Chase = FALSE,
BDA_model = FALSE,
multi_pold = FALSE,
Long = FALSE,
kmeans = FALSE,
ztest = FALSE,
Fisher = TRUE,
...
)
Arguments
- obj
bakRData
object produced bybakRData
,bakRFit
object produced bybakRFit
bakRFnData
object produced bybakRFnData
, orbakRFnFit
object produced bybakRFit
.- StanFit
Logical; if TRUE, then the MCMC implementation is run. Will only be used if
obj
is abakRFit
object- HybridFit
Logical; if TRUE, then the Hybrid implementation is run. Will only be used if
obj
is abakRFit
object- high_p
Numeric; Any features with a mutation rate (number of mutations / number of Ts in reads) higher than this in any -s4U control samples (i.e., samples that were not treated with s4U) are filtered out
- totcut
Numeric; Any features with less than this number of sequencing reads in any replicate of all experimental conditions are filtered out
- totcut_all
Numeric; Any features with less than this number of sequencing reads in any sample are filtered out
- Ucut
Numeric; All features must have a fraction of reads with 2 or less Us less than this cutoff in all samples
- AvgU
Numeric; All features must have an average number of Us greater than this cutoff in all samples
- FastRerun
Logical; only matters if a bakRFit object is passed to
bakRFit
. If TRUE, then the Stan-free model implemented infast_analysis
is rerun on data, foregoing fitting of either of the 'Stan' models.- FOI
Features of interest; character vector containing names of features to analyze
- concat
Logical; If TRUE, FOI is concatenated with output of reliableFeatures
- StanRateEst
Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis.
- RateEst_size
Numeric; if StanRateEst is TRUE, then data from RateEst_size genes are used for mutation rate estimation. This can be as low as 1 and should be kept low to ensure maximum efficiency
- low_reads
Numeric; if StanRateEst is TRUE, then only features with more than low_reads reads in all samples will be used for mutation rate estimation
- high_reads
Numeric; if StanRateEst is TRUE, then only features with less than high_read reads in all samples will be used for mutation rate estimation. A high read count cutoff is as important as a low read count cutoff in this case because you don't want the fraction labeled of chosen features to be extreme (e.g., close to 0 or 1), and high read count features are likely low fraction new features.
- chains
Number of Markov chains to sample from. 1 should suffice since these are validated models. Running more chains is generally preferable, but memory constraints can make this unfeasible.
- NSS
Logical; if TRUE, logit(fn)s are directly compared to avoid assuming steady-state
- Chase
Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment).
- BDA_model
Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used.
- multi_pold
Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate.
- Long
Logical; if TRUE, long read optimized fraction new estimation strategy is used.
- kmeans
Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold.
- ztest
Logical; if TRUE and the MLE implementation is being used, then a z-test will be used for p-value calculation rather than the more conservative moderated t-test.
- Fisher
Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew).
- ...
Arguments passed to either
fast_analysis
(if a bakRData object) orTL_Stan
andHybrid_fit
(if a bakRFit object)
Value
bakRFit object with results from statistical modeling and data processing. Objects possibly included are:
Fast_Fit; Always will be present. Output of
fast_analysis
Hybrid_Fit; Only present if HybridFit = TRUE. Output of
TL_stan
Stan_Fit; Only present if StanFit = TRUE. Output of
TL_stan
Data_lists; Always will be present. Output of
cBprocess
with Fast and Stan == TRUE
Details
If bakRFit
is run on a bakRData object, cBprocess
and then fast_analysis
will always be called. The former will generate the processed
data that can be passed to the model fitting functions (fast_analysis
and TL_Stan
). The call to fast_analysis
will generate a list of dataframes
containing information regarding the fast_analysis
fit. fast_analysis
is always
called because its output is required for both Hybrid_fit
and TL_Stan
.
If bakRFit
is run on a bakRFit object, cBprocess
will not be called again,
as the output of cBprocess
will already be contained in the bakRFit object. Similarly,
fast_analysis
will not be called again unless bakRFit is rerun on the bakRData object.
or if FastRerun
is set to TRUE. If you want to generate model fits using different parameters for cBprocess,
you will have to rerun bakRFit
on the bakRData object.
If bakRFit
is run on a bakRFnData object, fn_process
and then avg_and_regularize
will always be called. The former will generate the processed data that can be passed to the model
fitting functions (avg_and_regularize
and TL_Stan
, the latter only with HybridFit = TRUE).
If bakRFit
is run on a bakRFnFit object. fn_process
will not be called again, as
the output of fn_process
will already be contained in the bakRFnFit object. Similary,
avg_and_regularize
will not be called unless bakRFit
is rerun on the bakRData object,
or if FastRerun
is set to TRUE. If you want to generate model fits using different
parameters for fn_process
, you will have to rerun bakRFit
on the bakRData object.
See the documentation for the individual fitting functions for details regarding how they analyze nucleotide recoding data. What follows is a brief overview of how each works
fast_analysis
(referred to as the MLE implementation in the bakR paper)
either estimates mutation rates from + and (if available) - s4U samples or uses mutation rate estimates
provided by the user to perform maximum likelihood estimation (MLE) of the fraction of RNA that is labeled for each
replicate of nucleotide recoding data provided. Uncertainties for each replicate's estimate are approximated using
asymptotic results involving the Fisher Information and assuming known mutation rates. Replicate data
is pooled using an approximation to hierarchical modeling that relies on analytic solutions to simple Bayesian models.
Linear regression is used to estimate the relationship between read depths and replicate variability for uncertainty
estimation regularization, again performed using analytic solutions to Bayesian models.
TL_Stan
with Hybrid_Fit set to TRUE (referred to as the Hybrid implementation in the bakR paper)
takes as input estimates of the logit(fraction new) and uncertainty provided by fast_analysis
.
It then uses 'Stan' on the backend to implement a hierarchical model that pools data across replicates and the dataset
to estimate effect sizes (L2FC(kdeg)) and uncertainties. Replicate variability information is pooled across each experimental
condition to regularize variance estimates using a hierarchical linear regression model.
The default behavior of TL_Stan
(referred to as the MCMC implementation in the bakR paper)
is to use 'Stan' on the back end to implement a U-content exposure adjusted Poisson mixture model
to estimate fraction news from the mutational data. Partial pooling of replicate variability estimates
is performed as with the Hybrid implementation.
Examples
# \donttest{
# Simulate data for 1000 genes, 2 replicates, 2 conditions
simdata <- Simulate_bakRData(1000, nreps = 2)
# You always must fit fast implementation before any others
Fit <- bakRFit(simdata$bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups: mut [2]
#> mut reps pnew pold
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 0.0501 0.000999
#> 2 1 2 0.0498 0.000999
#> 3 2 1 0.0500 0.000999
#> 4 2 2 0.0501 0.000999
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the
#> quality of your data and get recommendations for next steps.
# }