fast_analysis
analyzes nucleotide recoding data with maximum likelihood estimation
implemented by stats::optim
combined with analytic solutions to simple Bayesian models to perform
approximate partial pooling. Output includes kinetic parameter estimates in each replicate, kinetic parameter estimates
averaged across replicates, and log-2 fold changes in the degradation rate constant (L2FC(kdeg)).
Averaging takes into account uncertainties estimated using the Fisher Information and estimates
are regularized using analytic solutions of fully Bayesian models. The result is that kdegs are
shrunk towards population means and that uncertainties are shrunk towards a mean-variance trend estimated as part of the analysis.
Usage
fast_analysis(
df,
pnew = NULL,
pold = NULL,
no_ctl = FALSE,
read_cut = 50,
features_cut = 50,
nbin = NULL,
prior_weight = 2,
MLE = TRUE,
ztest = FALSE,
lower = -7,
upper = 7,
se_max = 2.5,
mut_reg = 0.1,
p_mean = 0,
p_sd = 1,
StanRate = FALSE,
Stan_data = NULL,
null_cutoff = 0,
NSS = FALSE,
Chase = FALSE,
BDA_model = FALSE,
multi_pold = FALSE,
Long = FALSE,
kmeans = FALSE,
Fisher = TRUE
)
Arguments
- df
Dataframe in form provided by cB_to_Fast
- pnew
Labeled read mutation rate; default of 0 means that model estimates rate from s4U fed data. If pnew is provided by user, must be a vector of length == number of s4U fed samples. The 1st element corresponds to the s4U induced mutation rate estimate for the 1st replicate of the 1st experimental condition; the 2nd element corresponds to the s4U induced mutation rate estimate for the 2nd replicate of the 1st experimental condition, etc.
- pold
Unlabeled read mutation rate; default of 0 means that model estimates rate from no-s4U fed data
- no_ctl
Logical; if TRUE, then -s4U control is not used for background mutation rate estimation
- read_cut
Minimum number of reads for a given feature-sample combo to be used for mut rate estimates
- features_cut
Number of features to estimate sample specific mutation rate with
- nbin
Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used
- prior_weight
Determines extent to which logit(fn) variance is regularized to the mean-variance regression line
- MLE
Logical; if TRUE then replicate logit(fn) is estimated using maximum likelihood; if FALSE more conservative Bayesian hypothesis testing is used
- ztest
TRUE; if TRUE, then a z-test is used for p-value calculation rather than the more conservative moderated t-test.
- lower
Lower bound for MLE with L-BFGS-B algorithm
- upper
Upper bound for MLE with L-BFGS-B algorithm
- se_max
Uncertainty given to those transcripts with estimates at the upper or lower bound sets. This prevents downstream errors due to abnormally high standard errors due to transcripts with extreme kinetics
- mut_reg
If MLE has instabilities, empirical mut rate will be used to estimate fn, multiplying pnew by 1+mut_reg and pold by 1-mut_reg to regularize fn
- p_mean
Mean of normal distribution used as prior penalty in MLE of logit(fn)
- p_sd
Standard deviation of normal distribution used as prior penalty in MLE of logit(fn)
- StanRate
Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis.
- Stan_data
List; if StanRate is TRUE, then this is the data passed to the 'Stan' model to estimate mutation rates. If using the
bakRFit
wrapper offast_analysis
, then this is created automatically.- null_cutoff
bakR will test the null hypothesis of |effect size| < |null_cutoff|
- NSS
Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption.
- Chase
Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment)
- BDA_model
Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used.
- multi_pold
Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate.
- Long
Logical; if TRUE, long read optimized fraction new estimation strategy is used.
- kmeans
Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold.
- Fisher
Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew).
Value
List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:
Fn_Estimates; dataframe with estimates for the fraction new and fraction new uncertainty for each feature in each replicate. The columns of this dataframe are:
Feature_ID; Numerical ID of feature
Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
Replicate; Numerical ID for replicate
logit_fn; logit(fraction new) estimate, unregularized
logit_fn_se; logit(fraction new) uncertainty, unregularized and obtained from Fisher Information
nreads; Number of reads mapping to the feature in the sample for which the estimates were obtained
log_kdeg; log of degradation rate constant (kdeg) estimate, unregularized
kdeg; degradation rate constant (kdeg) estimate
log_kd_se; log(kdeg) uncertainty, unregularized and obtained from Fisher Information
sample; Sample name
XF; Original feature name
Regularized_ests; dataframe with average fraction new and kdeg estimates, averaged across the replicates and regularized using priors informed by the entire dataset. The columns of this dataframe are:
Feature_ID; Numerical ID of feature
Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
avg_log_kdeg; Weighted average of log(kdeg) from each replicate, weighted by sample and feature-specific read depth
sd_log_kdeg; Standard deviation of the log(kdeg) estimates
nreads; Total number of reads mapping to the feature in that condition
sdp; Prior standard deviation for fraction new estimate regularization
theta_o; Prior mean for fraction new estimate regularization
sd_post; Posterior uncertainty
log_kdeg_post; Posterior mean for log(kdeg) estimate
kdeg; exp(log_kdeg_post)
kdeg_sd; kdeg uncertainty
XF; Original feature name
Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this dataframe are:
Feature_ID; Numerical ID of feature
Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
L2FC(kdeg); Log2 fold change (L2FC) kdeg estimate or change in logit(fn) if NSS TRUE
effect; LFC(kdeg)
se; Uncertainty in L2FC_kdeg
pval; P-value obtained using effect_size, se, and a z-test
padj; pval adjusted for multiple testing using Benjamini-Hochberg procedure
XF; Original feature name
Mut_rates; list of two elements. The 1st element is a dataframe of s4U induced mutation rate estimates, where the mut column represents the experimental ID and the rep column represents the replicate ID. The 2nd element is the single background mutation rate estimate used
Hyper_Parameters; vector of two elements, named a and b. These are the hyperparameters estimated from the uncertainties for each feature, and represent the two parameters of a Scaled Inverse Chi-Square distribution. Importantly, a is the number of additional degrees of freedom provided by the sharing of uncertainty information across the dataset, to be used in the moderated t-test.
Mean_Variance_lms; linear model objects obtained from the uncertainty vs. read count regression model. One model is run for each Exp_ID
Details
Unless the user supplies estimates for pnew and pold, the first step of fast_analysis
is to estimate the background (pold)
and metabolic label (will refer to as s4U for simplicity, though bakR is compatible with other metabolic labels such as s6G)
induced (pnew) mutation rates. Several pnew and pold estimation strategies are implemented in bakR. For pnew estimation,
the two strategies are likelihood maximization of a binomial mixture model (default) and sampling from the full posterior of
a U-content adjusted Poisson mixture model with HMC (when StanRateEst
is set to TRUE in bakRFit
).
The default pnew estimation strategy involves combining the mutational data for all features into sample-wide mutational data vectors (# of T-to-C conversions, # of Ts, and # of such reads vectors). These data vectors are then downsampled (to prevent float overflow) and used to maximize the likelihood of a two-component binomial mixture model. The two components correspond to reads from old and new RNA, and the three estimated paramters are the fraction of all reads that are new (nuisance parameter in this case), and the old and new read mutation rates.
The alternative strategy involves running a fully Bayesian implementation of a similar mixture model using Stan, a
probalistic programming language that bakR makes use of in several functions. This strategy can yield more accurate
mutation rate estimates when the label times are much shorter or longer than the average half-lives of the sequenced RNA
(i.e., the fraction news are mostly close to 0 or 1, respectively). To improve the efficiency of this approach, only a small
subset of RNA features are analyzed, the number of which is set by the RateEst_size
parameter in bakRFit
. By default,
this number is set to 30, which on the average NR-seq dataset yields a several minute runtime for mutation rate estimation.
Estimation of pold can be performed with three strategies: the same two strategies discussed for pnew estimation, and a third
strategy that relies on the presence of -s4U control data. If -s4U control data is present, the default pold estimation
strategy is to use the average mutation rate in reads from all -s4U control datasets as the global pold estimate. Thus,
a single pold estimate is used for all samples. The likelihood maximization strategy can be used by
setting no_ctl
to TRUE, and this strategy becomes the default strategy if no -s4U data is present.
In addition, as of version 1.0.0 of bakR (released late June of 2023), users can decide to estimate
pold for each +s4U sample independently by setting multi_pold
to TRUE. In this case, independent -s4U datasets can no longer be used for
mutation rate estimation purposes, and thus the strategies for pold estimation are identical to the
set of pnew estimation strategies.
Once mutation rates are estimated, fraction news for each feature in each sample are estimated. The approach utilized is MLE
using the L-BFGS-B algorithm implemented in stats::optim
. The assumed likelihood function is derived from a Poisson mixture
model with rates adjusted according to each feature's empirical U-content (the average number of Us present in sequencing reads mapping
to that feature in a particular sample). Fraction new estimates are then converted to degradation rate constant estimates using
a solution to a simple ordinary differential equation model of RNA metabolism.
Once fraction new and kdegs are estimated, the uncertainty in these parameters is estimated using the Fisher Information. In the limit of
large datasets, the variance of the MLE is inversely proportional to the Fisher Information evaluated at the MLE. Mixture models are
typically singular, meaning that the Fisher information matrix is not positive definite and asymptotic results for the variance
do not necessarily hold. As the mutation rates are estimated a priori and fixed to be > 0, these problems are eliminated. In addition, when assessing
the uncertainty of replicate fraction new estimates, the size of the dataset is the raw number of sequencing reads that map to a
particular feature. This number is often large (>100) which increases the validity of invoking asymptotics.
As of version 1.0.0, users can opt for an alternative uncertainty estimation strategy by setting
Fisher
to FALSE. This strategy makes use of the standard error for the estimator of a binomial random
variables rate of success parameter. If we can uniquely identify new and old reads, then the
variance in our estimate for the fraction of reads that is new is fn*(1-fn)/n. This uncertainty
estimate will typically underestimate fraction new replicate uncertainties. We showed in the bakR
paper though that the Fisher information strategy often significantly overestimates uncertainties
of low coverage or extreme fraction new features. Therefore, this more bullish, underconservative
uncertainty quantification can be useful on datasets with low mutation rates, extreme label times,
or low sequecning depth. We have found that false discovery rates are still well controlled when using
this alternative uncertainty quantification strategy.
With kdegs and their uncertainties estimated, replicate estimates are pooled and regularized. There are two key steps in this
downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend,
and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the
Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty
estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate
priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of
regularization can be tuned by adjusting the prior_weight
parameter, with larger numbers yielding stronger shrinkage towards
the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a
Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the
population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior).
In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads
mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization
of the uncertainty estimates.
Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini- Hochberg procedure to control false discovery rates (FDRs).
In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed.
In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg).
This analysis strategy is implemented when NSS
is set to TRUE. Comparing logit(fraction new) is only valid
If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq
data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in
logit(fraction new) could stem from differences in kinetics or label times.
Examples
# \donttest{
# Simulate small dataset
sim <- Simulate_bakRData(300, nreps = 2)
# Fit fast model to get fast_df
Fit <- bakRFit(sim$bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups: mut [2]
#> mut reps pnew pold
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 0.0504 0.000989
#> 2 1 2 0.0500 0.000989
#> 3 2 1 0.0502 0.000989
#> 4 2 2 0.0503 0.000989
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the
#> quality of your data and get recommendations for next steps.
# Fit fast model with fast_analysis
Fast_Fit <- fast_analysis(Fit$Data_lists$Fast_df)
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups: mut [2]
#> mut reps pnew pold
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 0.0504 0.000989
#> 2 1 2 0.0500 0.000989
#> 3 2 1 0.0502 0.000989
#> 4 2 2 0.0503 0.000989
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the
#> quality of your data and get recommendations for next steps.
# }