`Simulate_bakRData`

simulates a `bakRData`

object. It's output also includes the simulated
values of all kinetic parameters of interest. Only the number of genes (`ngene`

) has to be set by the
user, but an extensive list of additional parameters can be adjusted.

## Usage

```
Simulate_bakRData(
ngene,
num_conds = 2L,
nreps = 3L,
eff_sd = 0.75,
eff_mean = 0,
fn_mean = 0,
fn_sd = 1,
kslog_c = 0.8,
kslog_sd = 0.95,
tl = 60,
p_new = 0.05,
p_old = 0.001,
read_lengths = 200L,
p_do = 0,
noise_deg_a = -0.3,
noise_deg_b = -1.5,
noise_synth = 0.1,
sd_rep = 0.05,
low_L2FC_ks = -1,
high_L2FC_ks = 1,
num_kd_DE = c(0L, as.integer(rep(round(as.integer(ngene)/2), times =
as.integer(num_conds) - 1))),
num_ks_DE = rep(0L, times = as.integer(num_conds)),
scale_factor = 150,
sim_read_counts = TRUE,
a1 = 5,
a0 = 0.01,
nreads = 50L,
alpha = 25,
beta = 75,
STL = FALSE,
STL_len = 40,
lprob_U_sd = 0,
lp_sd = 0
)
```

## Arguments

- ngene
Number of genes to simulate data for

- num_conds
Number of experimental conditions (including the reference condition) to simulate

- nreps
Number of replicates to simulate

- eff_sd
Effect size; more specifically, the standard deviation of the normal distribution from which non-zero changes in logit(fraction new) are pulled from.

- eff_mean
Effect size mean; mean of normal distribution from which non-zero changes in logit(fraction new) are pulled from. Note, setting this to 0 does not mean that some of the significant effect sizes will be 0, as any exact integer is impossible to draw from a continuous random number generator. Setting this to 0 just means that there is symmetric stabilization and destabilization

- fn_mean
Mean of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this mean

- fn_sd
Standard deviation of fraction news of simulated transcripts in reference condition. The logit(fraction) of RNA from each transcript that is metabolically labeled (new) is drawn from a normal distribution with this sd

- kslog_c
Synthesis rate constants will be drawn from a lognormal distribution with meanlog =

`kslog_c`

- mean(log(kd_mean)) where kd_mean is determined from the fraction new simulated for each gene as well as the label time (`tl`

).- kslog_sd
Synthesis rate lognormal standard deviation; see kslog_c documentation for details

- tl
metabolic label feed time

- p_new
metabolic label (e.g., s4U) induced mutation rate. Can be a vector of length num_conds

- p_old
background mutation rate

- read_lengths
Total read length for each sequencing read (e.g., PE100 reads correspond to read_lengths = 200)

- p_do
Rate at which metabolic label containing reads are lost due to dropout; must be between 0 and 1

- noise_deg_a
Slope of trend relating log10(standardized read counts) to log(replicate variability)

- noise_deg_b
Intercept of trend relating log10(standardized read counts) to log(replicate variability)

- noise_synth
Homoskedastic variability of L2FC(ksyn)

- sd_rep
Variance of lognormal distribution from which replicate variability is drawn

- low_L2FC_ks
Most negative L2FC(ksyn) that can be simulated

- high_L2FC_ks
Most positive L2FC(ksyn) that can be simulated

- num_kd_DE
Vector where each element represents the number of genes that show a significant change in stability relative to the reference. 1st entry must be 0 by definition (since relative to the reference the reference sample is unchanged)

- num_ks_DE
Same as num_kd_DE but for significant changes in synthesis rates.

- scale_factor
Factor relating RNA concentration (in arbitrary units) to average number of read counts

- sim_read_counts
Logical; if TRUE, read counts are simulated as coming from a heterodisperse negative binomial distribution

- a1
Heterodispersion 1/reads dependence parameter

- a0
High read depth limit of negative binomial dispersion parameter

- nreads
Number of reads simulated if sim_read_counts is FALSE

- alpha
shape1 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are drawn for each gene.

- beta
shape2 parameter of the beta distribution from which U-contents (probability that a nucleotide in a read from a transcript is a U) are drawn for each gene.

- STL
logical; if TRUE, simulation is of STL-seq rather than a standard TL-seq experiment. The two big changes are that a short read length is required (< 60 nt) and that every read for a particular feature will have the same number of Us. Only one read length is simulated for simplicity.

- STL_len
Average length of simulated STL-seq length. Since Pol II typically pauses about 20-60 bases from the promoter, this should be around 40

- lprob_U_sd
Standard deviation of the logit(probability nt is a U) for each sequencing read. The number of Us in a sequencing read are drawn from a binomial distribution with prob drawn from a logit-Normal distribution with this logit-sd.

- lp_sd
Standard deviation of logit(probability a U is mutated) for each U. The number of mutations in a given read is the sum of nU Bernoulli random variables, where nU is the number of Us, and p is drawn from a logit-normal distribution with lp_sd standard deviation on logit scale.

## Value

A list containing a simulated `bakRData`

object as well as a list of simulated kinetic parameters of interest.
The contents of the latter list are:

Effect_sim; Dataframe meant to mimic formatting of Effect_df that are part of

`bakRFit(StanFit = TRUE)`

,`bakRFit(HybridFit = TRUE)`

and`bakRFit(bakRData object)`

output.Fn_mean_sim; Dataframe meant to mimic formatting of Regularized_ests that is part of

`bakRFit(bakRData object)`

output. Contains information about the true fraction new simulated in each condition (the mean of the normal distribution from which replicate fraction news are simulated)Fn_rep_sim; Dataframe meant to mimic formatting of Fn_Estimates that is part of

`bakRFit(bakRData object)`

output. Contains information about the fraction new simulated for each feature in each replicate of each condition.L2FC_ks_mean; The true L2FC(ksyn) for each feature in each experimental condition. The i-th column corresponds to the L2FC(ksyn) when comparing the i-th condition to the reference condition (defined as the 1st condition) so the 1st column is always all 0s

RNA_conc; The average number of normalized read counts expected for each feature in each sample.

## Details

`Simulate_bakRData`

simulates a `bakRData`

object using a realistic generative model with many
adjustable parameters. Average RNA kinetic parameters are drawn from biologically inspired
distributions. Replicate variability is simulated by drawing a feature's
fraction new in a given replicate from a logit-Normal distribution with a heteroskedastic
variance term with average magnitude given by the chosen read count vs. variance relationship.
For each replicate, a feature's ksyn is drawn from a homoskedastic lognormal distribution. Read counts
can either be set to the same value for all simulated features or can be simulated according to
a heterodisperse negative binomial distribution. The latter is the default

The number of Us in each sequencing read is drawn from a binomial distribution with number of trials
equal to the read length and probability of each nucleotide being a U drawn from a beta distribution. Each read is assigned to the
new or old population according to a Bernoulli distribution with p = fraction new. The number of
mutations in each read are then drawn from one of two binomial distributions; if the read is assigned to the
population of new RNA, the number of mutations are drawn from a binomial distribution with number of trials equal
to the number of Us and probability of mutation = `p_new`

; if the read is assigned to the population of old RNA,
the number of mutations is instead drawn from a binomial distribution with the same number of trials but with the probability
of mutation = `p_old`

. `p_new`

must be greater than `p_old`

because mutations in new RNA
arise from both background mutations that occur with probability `p_old`

as well as metabolic label induced mutations

Simulated read counts should be treated as if they are spike-in and RPKM normalized, so the same scale factor of 1 can be applied to each sample when comparing the sequencing reads (e.g., if you are performing differential expression analysis).

Function to simulate a `bakRData`

object according to a realistic generative model

## Examples

```
# \donttest{
# 2 replicate, 2 experimental condition, 1000 gene simulation
sim_2reps <- Simulate_bakRData(ngene = 1000, nreps = 2)
# 3 replicate, 2 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics
sim_3reps <- Simulate_bakRData(ngene = 1000, num_kd_DE = c(0, 100))
# 2 replicates, 3 experimental condition, 1000 gene simulation
# with 100 instances of differential degradation kinetics in the 1st
# condition and no instances of differential degradation kinetics in the
# 2nd condition
sim_3es <- Simulate_bakRData(ngene = 1000,
nreps = 2,
num_conds = 3,
num_kd_DE = c(0, 100, 0))
# }
```