R --vanilla < alphabet_chrom.R
--f
<input file> --k <number_of_epi_letter> --d <dictionary_file>--r <0> --o <mutilple_epigenome_filename>
Example: R --vanilla < alphabet_chrom.R --f example.txt --k 3 --d epi_letter.dict --r 0 --o example.epi
Please create the text file with the acronyms for the epi-letter as you want, each row is for a letter (--d parameter, e.g. epi_letter.dict), --r parameter is for creating random epi-letter-represented epigenomes (0-no, 1-yes), default 0.
It will create the multiple epigenomes for all chromatin marks with epi-letter representation in a single file (--o is parameter for output file).
It also creates the look-up dictionary (.dict) listing all the tiles with coordinates, signals and letter_ID assigned and the epi-letter string file (.dna) for each individual mark. The coordinate file (.coor) is created for using in the next step.
3.1 Scanning for the epigenetic patterns
perl epimotif_scanning.pl -f <mutilple_epigenome_filename> (currently only support column patterns)
Example: perl epimotif_scanning.pl -f example.epi
It will create a file with ".cols" that list all column patterns and the corresponding frequency of its appreances (in the file .cols.freq).
3.2 Using R to make a unique column file for removing the repeated patterns for efficient computation of Hamming distance between patterns, for example:
write.table(unique(read.table("example.epi.cols.freq")), "example.epi.cols.freq.uniq", sep = "\t", quote=F, row.names=F, col.names=F)
OR using shell command-line as following:
sort example.epi.cols.freq | uniq > example.epi.cols.freq.uniq
example.epi.cols.freq.uniq is the file of unique column patterns. The orginal pattern file (example.epi.cols) is still necessary for tracing back the corresponding location in the genome.
3.3 Computing Hamming distance matrix for clustering
perl hamming_distance.pl -f <column_pattern_file>
Example: perl hamming_distance.pl -f example.epi.cols.freq.uniq
It will output the .hamming file that can be used for clustering, for example with k-mean method in R in the next step.
3.4 Clustering
R --vanilla < try_clustering.R
--f <hamming_distance_file> --u <unique_pattern_file> --c <column_pattern_file> --k <number_of_cluster>
Example: R --vanilla < try_clustering.R --f example.epi.cols.freq.uniq.hamming --u example.epi.cols.freq.uniq --c example.epi.cols --k 4
It will output for each cluster one file (named cluster_xx, xx is the cluster_id) consiting of the pattern, coordinates and cluster_id. It also extract the pattern (the 2nd column in the file .logo) for the logo representation in the next step.