Efficiently analyze nucleotide recoding data

fast_analysis analyzes nucleotide recoding data with maximum likelihood estimation implemented by stats::optim combined with analytic solutions to simple Bayesian models to perform approximate partial pooling. Output includes kinetic parameter estimates in each replicate, kinetic parameter estimates averaged across replicates, and log-2 fold changes in the degradation rate constant (L2FC(kdeg)). Averaging takes into account uncertainties estimated using the Fisher Information and estimates are regularized using analytic solutions of fully Bayesian models. The result is that kdegs are shrunk towards population means and that uncertainties are shrunk towards a mean-variance trend estimated as part of the analysis.

Usage

fast_analysis(
  df,
  pnew = NULL,
  pold = NULL,
  no_ctl = FALSE,
  read_cut = 50,
  features_cut = 50,
  nbin = NULL,
  prior_weight = 2,
  MLE = TRUE,
  ztest = FALSE,
  lower = -7,
  upper = 7,
  se_max = 2.5,
  mut_reg = 0.1,
  p_mean = 0,
  p_sd = 1,
  StanRate = FALSE,
  Stan_data = NULL,
  null_cutoff = 0,
  NSS = FALSE,
  Chase = FALSE,
  BDA_model = FALSE,
  multi_pold = FALSE,
  Long = FALSE,
  kmeans = FALSE,
  Fisher = TRUE
)

Arguments

df: Dataframe in form provided by cB_to_Fast
pnew: Labeled read mutation rate; default of 0 means that model estimates rate from s4U fed data. If pnew is provided by user, must be a vector of length == number of s4U fed samples. The 1st element corresponds to the s4U induced mutation rate estimate for the 1st replicate of the 1st experimental condition; the 2nd element corresponds to the s4U induced mutation rate estimate for the 2nd replicate of the 1st experimental condition, etc.
pold: Unlabeled read mutation rate; default of 0 means that model estimates rate from no-s4U fed data
no_ctl: Logical; if TRUE, then -s4U control is not used for background mutation rate estimation
read_cut: Minimum number of reads for a given feature-sample combo to be used for mut rate estimates
features_cut: Number of features to estimate sample specific mutation rate with
nbin: Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used
prior_weight: Determines extent to which logit(fn) variance is regularized to the mean-variance regression line
MLE: Logical; if TRUE then replicate logit(fn) is estimated using maximum likelihood; if FALSE more conservative Bayesian hypothesis testing is used
ztest: TRUE; if TRUE, then a z-test is used for p-value calculation rather than the more conservative moderated t-test.
lower: Lower bound for MLE with L-BFGS-B algorithm
upper: Upper bound for MLE with L-BFGS-B algorithm
se_max: Uncertainty given to those transcripts with estimates at the upper or lower bound sets. This prevents downstream errors due to abnormally high standard errors due to transcripts with extreme kinetics
mut_reg: If MLE has instabilities, empirical mut rate will be used to estimate fn, multiplying pnew by 1+mut_reg and pold by 1-mut_reg to regularize fn
p_mean: Mean of normal distribution used as prior penalty in MLE of logit(fn)
p_sd: Standard deviation of normal distribution used as prior penalty in MLE of logit(fn)
StanRate: Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes to the runtime of the analysis.
Stan_data: List; if StanRate is TRUE, then this is the data passed to the 'Stan' model to estimate mutation rates. If using the bakRFit wrapper of fast_analysis, then this is created automatically.
null_cutoff: bakR will test the null hypothesis of |effect size| < |null_cutoff|
NSS: Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption.
Chase: Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment)
BDA_model: Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal model is used.
multi_pold: Logical; if TRUE, pold is estimated for each sample rather than use a global pold estimate.
Long: Logical; if TRUE, long read optimized fraction new estimation strategy is used.
kmeans: Logical; if TRUE, kmeans clustering on read-specific mutation rates is used to estimate pnews and pold.
Fisher: Logical; if TRUE, Fisher information is used to estimate logit(fn) uncertainty. Else, a less conservative binomial model is used, which can be preferable in instances where the Fisher information strategy often drastically overestimates uncertainty (i.e., low coverage or low pnew).

Value

List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:

Fn_Estimates; dataframe with estimates for the fraction new and fraction new uncertainty for each feature in each replicate. The columns of this dataframe are:
- Feature_ID; Numerical ID of feature
- Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
- Replicate; Numerical ID for replicate
- logit_fn; logit(fraction new) estimate, unregularized
- logit_fn_se; logit(fraction new) uncertainty, unregularized and obtained from Fisher Information
- nreads; Number of reads mapping to the feature in the sample for which the estimates were obtained
- log_kdeg; log of degradation rate constant (kdeg) estimate, unregularized
- kdeg; degradation rate constant (kdeg) estimate
- log_kd_se; log(kdeg) uncertainty, unregularized and obtained from Fisher Information
- sample; Sample name
- XF; Original feature name
Regularized_ests; dataframe with average fraction new and kdeg estimates, averaged across the replicates and regularized using priors informed by the entire dataset. The columns of this dataframe are:
- Feature_ID; Numerical ID of feature
- Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
- avg_log_kdeg; Weighted average of log(kdeg) from each replicate, weighted by sample and feature-specific read depth
- sd_log_kdeg; Standard deviation of the log(kdeg) estimates
- nreads; Total number of reads mapping to the feature in that condition
- sdp; Prior standard deviation for fraction new estimate regularization
- theta_o; Prior mean for fraction new estimate regularization
- sd_post; Posterior uncertainty
- log_kdeg_post; Posterior mean for log(kdeg) estimate
- kdeg; exp(log_kdeg_post)
- kdeg_sd; kdeg uncertainty
- XF; Original feature name
Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this dataframe are:
- Feature_ID; Numerical ID of feature
- Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
- L2FC(kdeg); Log2 fold change (L2FC) kdeg estimate or change in logit(fn) if NSS TRUE
- effect; LFC(kdeg)
- se; Uncertainty in L2FC_kdeg
- pval; P-value obtained using effect_size, se, and a z-test
- padj; pval adjusted for multiple testing using Benjamini-Hochberg procedure
- XF; Original feature name
Mut_rates; list of two elements. The 1st element is a dataframe of s4U induced mutation rate estimates, where the mut column represents the experimental ID and the rep column represents the replicate ID. The 2nd element is the single background mutation rate estimate used
Hyper_Parameters; vector of two elements, named a and b. These are the hyperparameters estimated from the uncertainties for each feature, and represent the two parameters of a Scaled Inverse Chi-Square distribution. Importantly, a is the number of additional degrees of freedom provided by the sharing of uncertainty information across the dataset, to be used in the moderated t-test.
Mean_Variance_lms; linear model objects obtained from the uncertainty vs. read count regression model. One model is run for each Exp_ID

Details

Unless the user supplies estimates for pnew and pold, the first step of fast_analysis is to estimate the background (pold) and metabolic label (will refer to as s4U for simplicity, though bakR is compatible with other metabolic labels such as s6G) induced (pnew) mutation rates. Several pnew and pold estimation strategies are implemented in bakR. For pnew estimation, the two strategies are likelihood maximization of a binomial mixture model (default) and sampling from the full posterior of a U-content adjusted Poisson mixture model with HMC (when StanRateEst is set to TRUE in bakRFit).

The default pnew estimation strategy involves combining the mutational data for all features into sample-wide mutational data vectors (# of T-to-C conversions, # of Ts, and # of such reads vectors). These data vectors are then downsampled (to prevent float overflow) and used to maximize the likelihood of a two-component binomial mixture model. The two components correspond to reads from old and new RNA, and the three estimated paramters are the fraction of all reads that are new (nuisance parameter in this case), and the old and new read mutation rates.

The alternative strategy involves running a fully Bayesian implementation of a similar mixture model using Stan, a probalistic programming language that bakR makes use of in several functions. This strategy can yield more accurate mutation rate estimates when the label times are much shorter or longer than the average half-lives of the sequenced RNA (i.e., the fraction news are mostly close to 0 or 1, respectively). To improve the efficiency of this approach, only a small subset of RNA features are analyzed, the number of which is set by the RateEst_size parameter in bakRFit. By default, this number is set to 30, which on the average NR-seq dataset yields a several minute runtime for mutation rate estimation.

Estimation of pold can be performed with three strategies: the same two strategies discussed for pnew estimation, and a third strategy that relies on the presence of -s4U control data. If -s4U control data is present, the default pold estimation strategy is to use the average mutation rate in reads from all -s4U control datasets as the global pold estimate. Thus, a single pold estimate is used for all samples. The likelihood maximization strategy can be used by setting no_ctl to TRUE, and this strategy becomes the default strategy if no -s4U data is present. In addition, as of version 1.0.0 of bakR (released late June of 2023), users can decide to estimate pold for each +s4U sample independently by setting multi_pold to TRUE. In this case, independent -s4U datasets can no longer be used for mutation rate estimation purposes, and thus the strategies for pold estimation are identical to the set of pnew estimation strategies.

Once mutation rates are estimated, fraction news for each feature in each sample are estimated. The approach utilized is MLE using the L-BFGS-B algorithm implemented in stats::optim. The assumed likelihood function is derived from a Poisson mixture model with rates adjusted according to each feature's empirical U-content (the average number of Us present in sequencing reads mapping to that feature in a particular sample). Fraction new estimates are then converted to degradation rate constant estimates using a solution to a simple ordinary differential equation model of RNA metabolism.

Once fraction new and kdegs are estimated, the uncertainty in these parameters is estimated using the Fisher Information. In the limit of large datasets, the variance of the MLE is inversely proportional to the Fisher Information evaluated at the MLE. Mixture models are typically singular, meaning that the Fisher information matrix is not positive definite and asymptotic results for the variance do not necessarily hold. As the mutation rates are estimated a priori and fixed to be > 0, these problems are eliminated. In addition, when assessing the uncertainty of replicate fraction new estimates, the size of the dataset is the raw number of sequencing reads that map to a particular feature. This number is often large (>100) which increases the validity of invoking asymptotics. As of version 1.0.0, users can opt for an alternative uncertainty estimation strategy by setting Fisher to FALSE. This strategy makes use of the standard error for the estimator of a binomial random variables rate of success parameter. If we can uniquely identify new and old reads, then the variance in our estimate for the fraction of reads that is new is fn*(1-fn)/n. This uncertainty estimate will typically underestimate fraction new replicate uncertainties. We showed in the bakR paper though that the Fisher information strategy often significantly overestimates uncertainties of low coverage or extreme fraction new features. Therefore, this more bullish, underconservative uncertainty quantification can be useful on datasets with low mutation rates, extreme label times, or low sequecning depth. We have found that false discovery rates are still well controlled when using this alternative uncertainty quantification strategy.

With kdegs and their uncertainties estimated, replicate estimates are pooled and regularized. There are two key steps in this downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend, and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of regularization can be tuned by adjusting the prior_weight parameter, with larger numbers yielding stronger shrinkage towards the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior). In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization of the uncertainty estimates.

Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini- Hochberg procedure to control false discovery rates (FDRs).

In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed. In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg). This analysis strategy is implemented when NSS is set to TRUE. Comparing logit(fraction new) is only valid If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in logit(fraction new) could stem from differences in kinetics or label times.

Examples

# \donttest{

# Simulate small dataset
sim <- Simulate_bakRData(300, nreps = 2)

# Fit fast model to get fast_df
Fit <- bakRFit(sim$bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups:   mut [2]
#>     mut  reps   pnew     pold
#>   <int> <dbl>  <dbl>    <dbl>
#> 1     1     1 0.0504 0.000989
#> 2     1     2 0.0500 0.000989
#> 3     2     1 0.0502 0.000989
#> 4     2     2 0.0503 0.000989
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the 
#>             quality of your data and get recommendations for next steps.

# Fit fast model with fast_analysis
Fast_Fit <- fast_analysis(Fit$Data_lists$Fast_df)
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups:   mut [2]
#>     mut  reps   pnew     pold
#>   <int> <dbl>  <dbl>    <dbl>
#> 1     1     1 0.0504 0.000989
#> 2     1     2 0.0500 0.000989
#> 3     2     1 0.0502 0.000989
#> 4     2     2 0.0503 0.000989
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the 
#>             quality of your data and get recommendations for next steps.
# }