TL_stan
is an internal function to analyze nucleotide recoding RNA-seq data with a fully
Bayesian hierarchical model implemented in the PPL Stan
. TL_stan
estimates
kinetic parameters and differences in kinetic parameters between experimental
conditions. When assessing differences, a single reference sample is compared to
each collection of experimental samples provided.
Arguments
- data_list
List to pass to 'Stan' of form given by
cBprocess
- Hybrid_Fit
Logical; if TRUE, Hybrid 'Stan' model that takes as data output of
fast_analysis
is run.- keep_fit
Logical; if TRUE, 'Stan' fit object is included in output; typically large file so default FALSE.
- NSS
Logical; if TRUE, models that directly compare logit(fn)s are used to avoid steady-state assumption
- chains
Number of Markov chains to sample from. The default is to only run a single chain. Typical NR-seq datasets yield very memory intensive analyses, but running a single chain should decrease this burden. For reference, running the MCMC implementation (Hybrid_Fit = FALSE) with 3 chains on an NR-seq dataset with 3 replicates of 2 experimental conditions with around 20 million raw (unmapped) reads per sample requires over 100 GB of RAM. With a single chain, this burden drops to around 20 GB. Due to memory demands and time constraints (runtimes for the MCMC implementation border will likely be around 1-2 days) means that these models should usually be run in a specialized High Performance Computing (HPC) system.
- ...
Arguments passed to
rstan::sampling
(e.g. iter, warmup, etc.).
Value
A list of objects:
Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this dataframe are:
Feature_ID; Numerical ID of feature
Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
L2FC_kdeg; L2FC(kdeg) posterior mean
L2FC_kd_sd; L2FC(kdeg) posterior sd
effect; identical to L2FC_kdeg (kept for symmetry with MLE fit output)
se; identical to L2FC_kd_sd (kept for symmetry with MLE fit output)
XF; Feature name
pval; p value obtained from effect and se + z-test
padj; p value adjusted for multiple testing using Benjamini-Hochberg procedure
Kdeg_df; dataframe with estimates of the kdeg (RNA degradation rate constant) for each feature, averaged across replicate data. The columns of this dataframe are:
Feature_ID; Numerical ID of feature
Exp_ID; Numerical ID for experimental condition
kdeg; Degradation rate constant posterior mean
kdeg_sd; Degradation rate constant posterior standard deviation
log_kdeg; Log of degradation rate constant posterior mean (as of version 1.0.0)
log_kdeg_sd; Log of degradation rate constant posterior standard deviation (as of version 1.0.0)
XF; Original feature name
Fn_Estimates; dataframe with estimates of the logit(fraction new) for each feature in each replicate. The columns of this dataframe are:
Feature_ID; Numerical ID for feature
Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
Replicate; Numerical ID for replicate
logit_fn; Logit(fraction new) posterior mean
logit_fn_se; Logit(fraction new) posterior standard deviation
sample; Sample name
XF; Original feature name
Fit_Summary; only outputted if keep_fit == FALSE. Summary of 'Stan' fit object with each row corresponding to a particular parameter. All posterior point descriptions are descriptions of the marginal posterior distribution for the parameter in that row. For example, the posterior mean is the average value for the parameter when averaging over all other parameter values. The columns of this dataframe are:
mean; Posterior mean for the parameter given by the row name
se_mean; Standard error of the posterior mean; essentially how confident the model is that what it estimates to be the posterior mean is what the posterior mean actually is. This will depend on the number of chains run on the number of iterations each chain is run for.
sd; Posterior standard deviation
2.5%; 2.5th percentile of the posterior distribution. 2.5% of the posterior mass is below this point
25%; 25th percentile of the posterior distribution
50%; 50th percentile of the posterior distribution
75%; 75th percentile of the posterior distribution
97.5%; 97.5th percentile of the posterior distribution
n_eff; Effective sample size. The larger this is the better, though it should preferably be around the total number of iterations (iter x chains). Small values of this could represent poor model convergence
Rhat; Describes how well separate Markov chains mixed. This is preferably as close to 1 as possible, and values higher than 1 could represent poor model convergence
Stan_Fit; only outputted if keep_fit == TRUE. This is the full 'Stan' fit object, an R6 object of class
stanfit
Mutation_Rates; data frame with information about mutation rate estimates. Has the same columns as Fit_Summary. Each row corresponds to either a background mutation rate (log_lambda_o) or an s4U induced mutation rate (log_lambda_n), denoted in the parameter column. The bracketed portion of the parameter name will contain two numbers. The first corresponds to the Exp_ID and the second corresponds to the Replicate_ID. For example, if the parameter name is log_lambda_o[1,2] then that row corresponds to the background mutation rate in the second replicate of experimental condition one. A final point to mention is that the estimates are on a log(avg. # of mutations) scale. So a log_lambda_n of 1 means that on average, there are an estimated 2.72 (exp(1)) mutations in reads from new RNA (i.e., RNA synthesized during s4U labeling).
Details
Two implementations of a similar model can be fit with TL_stan: a complete nucleotide recoding RNA-seq
analysis and a hybrid analysis that takes as input results from fast_analysis
.
In the complete analysis (referred to in the bakR publication as the MCMC implementation),
U-to-C mutations are modeled as coming from a Poisson distribution
with rate parameter adjusted by the empirical U-content of each feature analyzed. Features
represent whatever the user defined them to be when constructing the bakR data object.
Typical feature categories are genes, exons, etc. Hierarchical modeling is used to pool data
across replicates and across features. More specifically, replicate data for the
same feature are partially pooled to estimate feature-specific mean fraction news and uncertainties.
Feature means are partially pooled to estimate dataset-wide mean fraction news and standard deviations.
The replicate variability for each feature is also partially pooled to determine a condition-wide
heteroskedastic relationship between read depths and replicate variability. Partial pooling
reduces subjectivity when determining priors by allowing the model to determine what priors make sense
given the data. Partial pooling also regularizes estimates, reducing estimate variability and thus increasing
estimate accuracy. This is particularly important for replicate variability estimates, which often rely
on only a few replicates of data per feature and thus are typically highly unstable.
The hybrid analysis (referred to in the bakR publication as the Hybrid implementation)
inherits the hierarchical modeling structure of the complete analysis, but reduces computational
burden by foregoing per-replicate-and-feature fraction new estimation and uncertainty quantification. Instead,
the hybrid analysis takes as data fraction new estimates and approximate uncertainties from fast_analysis
.
Runtimes of the hybrid analysis are thus often an order of magnitude shorter than with the complete analysis, but
loses some accuracy by relying on point estimates and uncertainty quantification that is only valid in the
limit of large dataset sizes (where the dataset size for the per-replicate-and-feature fraction new estimate is the raw number
of sequencing reads mapping to the feature in that replicate).
Users also have the option to save or discard the Stan
fit object. Fit objects can be exceedingly large (> 10 GB) for most
nucleotide recoding RNA-seq datasets. Therefore, if you don't want to store such a large object, a summary object will be saved instead,
which greatly reduces the size of the output (~ 10-50 MB) while still retaining much of the important information. In addition,
the output of TL_stan
provides the estimates and uncertainties for key parameters (L2FC(kdeg), kdeg, and fraction new)
that will likely be of most interest. That being said, there are some analyses that are only possible if the original fit object
is saved. For example, the fit object will contain all of the samples from the posterior collected during model fitting. Thus,
new parameters (e.g., L2FC(kdeg)'s comparing two experimental samples) not naturally generated by the model can be estimated
post-hoc. Still, there are often approximate estimates that can be obtained for such parameters that don't rely on the full
fit object. One analysis that is impossible without the original fit object is generating model diagnostic plots. These include
trace plots (to show mixing and efficient parameter space exploration of the Markov chains), pairs plots (to show correlations
between parameters and where any divergences occurred), and other visualizations that can help users assess how well a model
ran. Because the models implemented by TL_stan
are extensively validated, it is less likely that such diagnostics will be helpful,
but often anomalies on your data can lead to poor model convergence, in which case assessing model diagnostics can help you
identify the source of problems in your data. Summary statistics describing how well the model was able to estimate each parameter
(n_eff and rhat) are provided in the fit summaries, but can often obscure some of the nuanced details of model fitting.