Quick summary about SlamSeq:

Important terms:

Simulation and Evaluation

So far I have simulated data sets for a set of 1000 randomly chosen 3’ UTRs from the mouse annotation set we are using for the real data-sets.

I simulated data sets for all combinations of the following parameters:

For all data sets I compared the simulated labeling rate to the recovered labeling rates by computing:

Based on this I tried visualizing the effect of read length, coverage, T->C rate and labeling rates on the performance of SlamSeq/SlamDunk

Reading data from file

library(dplyr)
library(tidyr)
library(ggplot2)
tcs = c(0.024, 0.07)
covs = c(25, 50, 75, 100, 150, 200)
rls = c(38, 88, 138)
reps = c(1)
# Read data
data = read.table("~/Dropbox/Slamdunk/Data/mesc_1k_random_expressed_eval.tsv", sep = '\t')
dataUnique = read.table("~/Dropbox/Slamdunk/Data/mesc_1k_random_expressed_eval_unique.tsv", sep = '\t')
#data = read.table("~/Dropbox/Slamdunk/Data/golden_list.tsv", sep = '\t')
#dataUnique = read.table("~/Dropbox/Slamdunk/Data/golden_list_unique.tsv", sep = '\t')

Probability to see a T->C mutation in a “labeled” read of length R, given a T->C conversion rate of tcRatePerPosition

readLength = 38
tcRatePerPosition = 0.024
1 - dbinom(0, round(readLength / 4), tcRatePerPosition)
[1] 0.2156712

Relative error for all combinations of read length and coverage

Absolut error for all combinations of read lengths and coverage

Log2 error for all combinations of read lengths and coverage

Relative error for labeling rate bins

Scatter plots: simulated vs recovered labeling rates

Median relative error

for(r in rls) {
  p = data %>% filter(rl==r) %>% ggplot(., aes(x=cov, y=rel_error)) + 
    stat_summary(fun.y = median, geom = "point") +
    coord_cartesian(ylim = c(0,20)) +
    ylab("relative error [%]") + xlab("Coverage") + ggtitle(paste0(r, " bp reads"))
  print(p)
}

