Quick summary about SlamSeq:
- Only 3’ UTRs are sequenced
- A modified Uracial called 4-Thiouridine (4SU) is added to the sample. If 4SU is present during transcription there is a certain change (e.g. 2%) that a 4SU is incorporated into the mRNA Instead of a unmodified Uracil.
- 4SU alone has no effect and acts like an unmodified Uracil
- However, when the samples are treated with Iodoacetamide (IAA), which is done after extraction, 4SUs are alkylated (4SU*)
- Alkylated 4SUs are converted to Cytosin druing reverse transcription which is required for sequencing.
- In summary, if a mRNA molecule was transcribed while 4SU was present, reads from this transcript will have a certain rate of T->C mutations (e.g. 2%), if 4SU was not present, no T->C mutations will be visible in the mapped data.
Important terms:
- T->C conversion rate: incorporation rate of 4SU during transcription
- Labeling rate: Looking at all the mRNA molecules for a specific gene that are present at a specific time point, how many of those contain 4SUs (T->C conversions)
Simulation and Evaluation
So far I have simulated data sets for a set of 1000 randomly chosen 3’ UTRs from the mouse annotation set we are using for the real data-sets.
I simulated data sets for all combinations of the following parameters:
- Read length: 38, 88, 138
- Coverage: 25, 50, 75, 100, 150, 200
- T->C conversion rate: 0.024, 0.07
- Labeling rates: between 0.0 and 1.0 in steps of 0.05
For all data sets I compared the simulated labeling rate to the recovered labeling rates by computing:
- Absolut error: simulated - recovered
- Relative error: (simulated - recovered) / simulated
- Log2 error: log2(simulated / recovered)
Based on this I tried visualizing the effect of read length, coverage, T->C rate and labeling rates on the performance of SlamSeq/SlamDunk
Reading data from file
library(dplyr)
library(tidyr)
library(ggplot2)
tcs = c(0.024, 0.07)
covs = c(25, 50, 75, 100, 150, 200)
rls = c(38, 88, 138)
reps = c(1)
# Read data
data = read.table("~/Dropbox/Slamdunk/Data/mesc_1k_random_expressed_eval.tsv", sep = '\t')
dataUnique = read.table("~/Dropbox/Slamdunk/Data/mesc_1k_random_expressed_eval_unique.tsv", sep = '\t')
#data = read.table("~/Dropbox/Slamdunk/Data/golden_list.tsv", sep = '\t')
#dataUnique = read.table("~/Dropbox/Slamdunk/Data/golden_list_unique.tsv", sep = '\t')
Probability to see a T->C mutation in a “labeled” read of length R, given a T->C conversion rate of tcRatePerPosition
readLength = 38
tcRatePerPosition = 0.024
1 - dbinom(0, round(readLength / 4), tcRatePerPosition)
[1] 0.2156712
Relative error for all combinations of read length and coverage
Absolut error for all combinations of read lengths and coverage
Log2 error for all combinations of read lengths and coverage
Relative error for labeling rate bins
Scatter plots: simulated vs recovered labeling rates
---
title: "SlamDunk paper simulation notebook"
output:
  html_document: default
  html_notebook: default
  pdf_document: default
---


###Quick summary about SlamSeq:

* Only 3' UTRs are sequenced
* A modified Uracial called 4-Thiouridine (4SU) is added to the sample. If 4SU is present during transcription there is a certain change (e.g. 2%) that a 4SU is incorporated into the mRNA Instead of a unmodified Uracil.
* 4SU alone has no effect and acts like an unmodified Uracil
* However, when the samples are treated with Iodoacetamide (IAA), which is done after extraction, 4SUs are alkylated (4SU*)
* Alkylated 4SUs are converted to Cytosin druing reverse transcription which is required for sequencing.
* In summary, if a mRNA molecule was transcribed while 4SU was present, reads from this transcript will have a certain rate of T->C mutations (e.g. 2%), if 4SU was not present, no T->C mutations will be visible in the mapped data.


###Important terms:

* T->C conversion rate: incorporation rate of 4SU during transcription
* Labeling rate: Looking at all the mRNA molecules for a specific gene that are present at a specific time point, how many of those contain 4SUs (T->C conversions)

### Simulation and Evaluation
So far I have simulated data sets for a set of 1000 randomly chosen 3' UTRs from the mouse annotation set we are using for the real data-sets.

I simulated data sets for all combinations of the following parameters:

* Read length: 38, 88, 138
* Coverage: 25, 50, 75, 100, 150, 200
* T->C conversion rate: 0.024, 0.07
* Labeling rates: between 0.0 and 1.0 in steps of 0.05

For all data sets I compared the simulated labeling rate to the recovered labeling rates by computing:

* Absolut error: simulated - recovered
* Relative error: (simulated - recovered) / simulated
* Log2 error: log2(simulated / recovered)

Based on this I tried visualizing the effect of read length, coverage, T->C rate and labeling rates on the performance of SlamSeq/SlamDunk



### Reading data from file
```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
library(ggplot2)

tcs = c(0.024, 0.07)
covs = c(25, 50, 75, 100, 150, 200)
rls = c(38, 88, 138)
reps = c(1)

# Read data
data = read.table("~/Dropbox/Slamdunk/Data/mesc_1k_random_expressed_eval.tsv", sep = '\t')
dataUnique = read.table("~/Dropbox/Slamdunk/Data/mesc_1k_random_expressed_eval_unique.tsv", sep = '\t')

#data = read.table("~/Dropbox/Slamdunk/Data/golden_list.tsv", sep = '\t')
#dataUnique = read.table("~/Dropbox/Slamdunk/Data/golden_list_unique.tsv", sep = '\t')

```


#### Probability to see a T->C mutation in a "labeled" read of length R, given a T->C conversion rate of tcRatePerPosition

```{r}
readLength = 38
tcRatePerPosition = 0.024
1 - dbinom(0, round(readLength / 4), tcRatePerPosition)

```


#### Relative error for all combinations of read length and coverage
```{r, echo=FALSE}
# Relative error 
p1 = data %>% filter(rl==38) %>% ggplot(., aes(factor(cov), rel_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-100,100)) + ylab("relative error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("38 bp reads")
print(p1)
p2 = data %>% filter(rl==88) %>% ggplot(., aes(factor(cov), rel_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-100,100)) + ylab("relative error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("88 bp reads")
print(p2)
p3 = data %>% filter(rl==138) %>% ggplot(., aes(factor(cov), rel_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-100,100)) + ylab("relative error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("138 bp reads")
print(p3)
#multiplot(p1, p2, p3, cols=2)
```

#### Absolut error for all combinations of read lengths and coverage
```{r, echo=FALSE}
p1 = data %>% filter(rl==38) %>% ggplot(., aes(factor(cov), abs_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-0.4,0.4)) + ylab("absolute error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("38 bp reads")
print(p1)
p2 = data %>% filter(rl==88) %>% ggplot(., aes(factor(cov), abs_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-0.4,0.4)) + ylab("absolute error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("88 bp reads")
print(p2)
p3 = data %>% filter(rl==138) %>% ggplot(., aes(factor(cov), abs_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-0.4,0.4)) + ylab("absolute error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("138 bp reads")
print(p3)

#multiplot(p1, p2, p3, cols=2)
```

#### Log2 error for all combinations of read lengths and coverage
```{r, echo=FALSE}
# Log2 error
p1 = data %>% filter(rl==38) %>% ggplot(., aes(factor(cov), log2_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-1,1)) + ylab("relative error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("38 bp reads")
print(p1)
p2 = data %>% filter(rl==88) %>% ggplot(., aes(factor(cov), log2_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-1,1)) + ylab("relative error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("88 bp reads")
print(p2)
p3 = data %>% filter(rl==138) %>% ggplot(., aes(factor(cov), log2_error)) + geom_boxplot(outlier.shape = NA) + coord_cartesian(ylim = c(-1,1)) + ylab("relative error [%]") + xlab("Coverage") + geom_hline(yintercept = 0, linetype="dashed") + ggtitle("138 bp reads")
print(p3)

#multiplot(p1, p2, p3, cols=2)
```

### Relative error for labeling rate bins
```{r, echo=FALSE}
outliers = NA
#plotList = list()
for(c in covs) {
  for(r in rls) {
    p = data %>% filter(cov==c, rl==r) %>% ggplot(., aes(factor(name), rel_error)) + geom_boxplot(outlier.shape = outliers) + coord_cartesian(ylim = c(-100,100)) +       geom_hline(yintercept = 0, linetype="dashed") + ylab("relative error [%]") + xlab("Coverage") + ggtitle(paste0(r, " bp reads, ", c, "X coverage"))
    print(p)
  #plotList[[length(plotList)+1]] = p
  }
}
#multiplot(plotlist = plotList, cols=2)
```

#### Scatter plots: simulated vs recovered labeling rates

```{r, echo=FALSE}
outliers = NA
#plotList = list()
for(r in c(38, 88)) {
  for(c in c(25, 100)) {
  
    p = data %>% filter(cov==c, rl==r) %>% ggplot(., aes(x=simulated, y=slamdunk))  + 
      geom_point(shape=1) + 
      geom_abline(intercept = 0, slope = 1,  linetype="dashed") + 
      coord_cartesian(ylim = c(0,1), xlim=c(0,1)) +
      ylab("Slamdunk") + xlab("Simulated") + ggtitle(paste0(r, " bp reads, ", c, "X coverage"))
    print(p)
    #plotList[[length(plotList)+1]] = p
  }
}
#multiplot(plotlist = plotList, cols=2)
```

#### Median relative error
```{r}

for(r in rls) {
  p = data %>% filter(rl==r) %>% ggplot(., aes(x=cov, y=rel_error)) + 
    stat_summary(fun.y = median, geom = "point") +
    coord_cartesian(ylim = c(0,20)) +
    ylab("relative error [%]") + xlab("Coverage") + ggtitle(paste0(r, " bp reads"))
  print(p)
}

```

```{r, eval=FALSE, include=FALSE}
coverage = read.table("/project/ngs/philipp/slamdunk-analysis/paper/paper_runs_coverage/34359_An312_wt-2n_mRNA-slamseq-autoquant_1h-R3_trimmed.fq_slamdunk_mapped_filtered_multicov.bed")

coverage$cov = (coverage$V7 * 38) / (coverage$V3 - coverage$V2)
hist(coverage[coverage$cov < 100 & coverage$V7 > 10,]$cov, breaks=seq(0,101,by=5))
```


```{r}

data %>% group_by(simulated, cov, rl) %>% count(rel_error < 20)

data %>% filter(rl == 88, cov == 100) %>% group_by(simulated, cov, rl) %>% count(abs_error < 0.05)

data %>% filter(rl == 88, cov == 100) %>% count(abs_error < 0.05)
dataUnique %>% filter(rl == 88, cov == 100) %>% count(abs_error < 0.05)

for(rl in c(38, 88, 138)) {
  for(cov in c(50, 100, 200)) {

    data %>% filter(rl == rl, cov == cov) %>% ggplot(., aes(abs_error)) + xlim(-1, 1) + geom_histogram(binwidth = 0.01)
    
  }
}

# A histogram of bill sizes
data %>% filter(rl, cov == 50 | cov == 100 | cov == 200) %>% ggplot(., aes(x=abs_error)) + xlim(-0.25, 0.25) + geom_histogram(binwidth = 0.01) + facet_grid(rl ~ cov)

data %>% filter(rl, cov == 50 | cov == 100 | cov == 200) %>% ggplot(., aes(x=rel_error)) + xlim(-25, 25) + geom_histogram(binwidth = 1) + facet_grid(rl ~ cov)


```

```{r}



```


```{r}

stats = read.table("/project/ngs/philipp/slamdunk-analysis/paper/paper_runs/slamdunk_mESC_full_timecourse_1/summary.txt", header=T)

stats$mapped_perc = stats$Mapped / stats$Sequenced

stats[stats$SampleType == "chase",]$SampleTime = 1440 - stats[stats$SampleType == "chase",]$SampleTime

plot(stats$mapped_perc ~ stats$SampleTime)
cor(stats$mapped_perc, stats$SampleTime)

```

