Correcting for metabolic labeling induced RNA dropout
Source:R/Dropout_correction.R
CorrectDropout.Rd
Dropout is the name given to a phenomenon originally identified by our lab and
further detailed in two independent publications (Zimmer et al. (2023),
and Berg et al. (2023)).
Dropout is the under-representation of reads from RNA containing metabolic label
(4-thiouridine or 6-thioguanidine most commonly). Loss of 4-thiouridine (s4U)
containing RNA on plastic surfaces and RT dropoff caused by
modifications on s4U introduced by recoding chemistry have been attributed as the likely
causes of this phenomenon. While protocols can be altered in ways to drastically reduce this
source of dropout, you may still have datasets that you want to analyze with bakR collected
with suboptimal handling. That is where CorrectDropout
comes in.
Arguments
- obj
bakRFit object
- scale_init
Numeric; initial estimate for -s4U/+s4U scale factor. This is the factor difference in RPM normalized read counts for completely unlabeled transcripts (i.e., highly stable transcript) between the +s4U and -s4U samples.
- pdo_init
Numeric; initial estimtae for the dropout rate. This is the probability that an s4U labeled RNA molecule is lost during library prepartion.
- recalc_uncertainty
Logical; if TRUE, then fraction new uncertainty is recalculated using adjusted fn and a simple binomial model of estimate uncertainty. This will provide a slight underestimate of the fn uncertainty, but will be far less biased for low coverage features, or for samples with low pnews.
- ...
Additional (optional) parameters to be passed to
stats::nls()
Value
A bakRFit
or bakRFnFit
object (same type as was passed in). Fraction new estimates and read counts
in Fast_Fit$Fn_Estimates
and (in the case of a bakRFnFit
input) Data_lists$Fn_Est
are dropout corrected.
A count matrix with corrected read counts (Data_lists$Count_Matrix_corrected
) is also output, along with a
data frame with information about the dropout rate estimated for each sample (Data_lists$Dropout_df
).
Details
CorrectDropout
estimates the percentage of 4-thiouridine containing RNA
that was lost during library preparation (pdo). It then uses this estimate of pdo
to correct fraction new estimates and read counts. Both corrections are analytically
derived from a rigorous generative model of NR-seq data. Importantly, the read count
correction preserves the total library size to avoid artificially inflating read counts.
Examples
# \donttest{
# Simulate data for 500 genes and 2 replicates with 40% dropout
sim <- Simulate_relative_bakRData(500, 100000, nreps = 2, p_do = 0.4)
# Fit data with fast implementation
Fit <- bakRFit(sim$bakRData)
#> Finding reliable Features
#> Filtering out unwanted or unreliable features
#> Processing data...
#> Estimating pnew with likelihood maximization
#> Estimating unlabeled mutation rate with -s4U data
#> Estimated pnews and polds for each sample are:
#> # A tibble: 4 × 4
#> # Groups: mut [2]
#> mut reps pnew pold
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 0.0501 0.00102
#> 2 1 2 0.0505 0.00102
#> 3 2 1 0.0498 0.00102
#> 4 2 2 0.0500 0.00102
#> Estimating fraction labeled
#> Estimating per replicate uncertainties
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the
#> quality of your data and get recommendations for next steps.
# Correct for dropout
Fit <- CorrectDropout(Fit)
#> Estimated rates of dropout are:
#> Exp_ID Replicate pdo
#> 1 1 1 0.4169686
#> 2 1 2 0.2494506
#> 3 2 1 0.0000000
#> 4 2 2 0.0000000
#> Mapping sample name to sample characteristics
#> Filtering out low coverage features
#> Processing data...
#> Estimating read count-variance relationship
#> Averaging replicate data and regularizing estimates
#> Assessing statistical significance
#> All done! Run QC_checks() on your bakRFit object to assess the
#> quality of your data and get recommendations for next steps.
# }