Clustering evaluation: Difference between revisions
| (7 intermediate revisions by 4 users not shown) | |||
| Line 7: | Line 7: | ||
* Promoter ratio |
* Promoter ratio |
||
A summary of the results is given in [[File: |
A summary of the results is given in [[File:Cluster_comparison_110627.pdf]]. Details follow below. |
||
| Line 73: | Line 73: | ||
Pearson's correlation of log10(tag count + 1) was computed and plotted for all pairs of samples. Red circles in scatter plots in pdf indicate the same cell type. Higher resolution scatter plots are available in [[File:Png.zip]]. Tables of pair wise correlations are available in [[File:Csv.zip]]. Average correlation within cell type is reported in the summary pdf. |
Pearson's correlation of log10(tag count + 1) was computed and plotted for all pairs of samples. Red circles in scatter plots in pdf indicate the same cell type. Higher resolution scatter plots are available in [[File:Png.zip]]. Tables of pair wise correlations are available in [[File:Csv.zip]]. Average correlation within cell type is reported in the summary pdf. |
||
Pairwise correlation of expression signal under the clusters for donor replicates and technical replicates (see page 3 onwards for https://fantom5-collaboration.gsc.riken.jp/wiki/images/9/9d/Cluster_comparison.pdf). |
|||
== Promoter ratio == |
== Promoter ratio == |
||
| Line 79: | Line 79: | ||
The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in [[File:Missed_prom.zip]]. |
The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in [[File:Missed_prom.zip]]. |
||
== Gencode transcriptome derived annotation == |
|||
Cluster input files in OSC format were taken from https://fantom5-collaboration.gsc.riken.jp/webdav/home/severin/cluster_expression_tpm/ <br>Annotation of each cluster file was performed using the approach described in [[Tag Cluster Annotation]] using Gencode (v7) as the basis for the annotations |
|||
Details of the data processing is available from https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/readme.txt <br>Scripts useful to the annotations and generating the plots and summary csv files can be found in https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/Script |
|||
<br> Resulting annotated clusters in OSC format similar to the input files with the addition of annotation columns can be found in https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/Gencode-Annotated-Cluster-OSCfiles/ |
|||
*Coding, non-coding and "others" transcript annotation were performed independently and a single annotation was selected to simplify the general Gencode transcriptome derived annotation comparison of each cluster method, together this resultesd in the addition of the following columns to the original input file |
|||
*ColumnVariable[annotation.gencode_coding.class] = the class of the annotation (aka prom,S UTR,AS ...) from the 1st column of gencode_coding ColumnVariable[annotation.gencode_coding.names] = comma delimited names of all the corresponding annotation entries (aka NM_12345,MN_12346, or ENST00000373448,ENST00000373447,) |
|||
*ColumnVariable[annotation.gencode_noncoding.class] = the class of the annotation (aka prom,S UTR,AS ...) from the 1st column of gencode_noncoding |
|||
*ColumnVariable[annotation.gencode_noncoding.names] = comma delimited names of all the corresponding annotation entries (aka NM_12345,MN_12346, or ENST00000373448,ENST00000373447,) |
|||
*ColumnVariable[annotation.gencode_other.class] = the class of the annotation (aka prom,S UTR,AS ...) from the 1st column of gencode_other |
|||
*ColumnVariable[annotation.gencode_other.names] = comma delimited names of all the corresponding annotation entries (aka NM_12345,MN_12346, or ENST00000373448,ENST00000373447,) |
|||
*ColumnVariable[annotation.gencode.class_summary] = THE annotation class selected amongst annotation.gencode_coding.class, annotation.gencode_noncoding.class or annotation.gencode_other.class, set as intergenic if all NA, also noncoding intron and exon annotations are preppended with NC_ |
|||
<br> |
|||
Two sets of plots and summary tab delimited files were made. |
|||
#A first set in which expression level are not considered (https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.method_sorted.expressionless_ratio.pdf) |
|||
#*https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.expressionless_data.txt |
|||
# A second set taking into account the expression level in each of the libraries of the input osc files |
|||
#*https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.using_expresion_data.txt |
|||
#*The figure derived from this latter set are |
|||
#**organized by alphabetical order of the clustering method name (https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.method_sorted.*.pdf files) |
|||
#**or sorted by the summed ratio of proximal promoter + 5'UTR exon signal intensities (https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.canonical_exp_sorted.*.pdf files) |
|||
<br> |
|||
Below is a quick overview of the alphabetical sorted annotation for each of the 18 libraries : [[Image:ClusterEval GencodeAnnot.summary.png]] |
|||
== Irreproducible Discovery Rate (IDR) == |
== Irreproducible Discovery Rate (IDR) == |
||
| Line 88: | Line 119: | ||
[[File:CD14_donor23.jpeg|caption]] |
[[File:CD14_donor23.jpeg|caption]] |
||
== Lists of known loci missed by each method == |
|||
Lists of known loci missed by each method (https://fantom5-collaboration.gsc.riken.jp/wiki/index.php/File:Missed_prom.zip) |
|||
== Specific gene loci of interest == |
== Specific gene loci of interest == |
||
* B4GALT1 |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr9:33096459..33181534 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr9:33166097..33168728 |
|||
* ABCA1 |
|||
**example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr9:107506494..107727223 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr9:107690015..107690795 |
|||
* MAFB |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr20:39313674..39318715 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr20:39317825..39317924 |
|||
* FN1 |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr2:216206272..216319694 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr2:216300166..216301803 |
|||
* DUSP1 |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr5:172194326..172198977 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr5:172198150..172198249 |
|||
* CEBPA |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr19:33790339..33793915 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr19:33793385..33793484 |
|||
* HERC1 |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr15:63844483..64182479 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr15:64125991..64126298 |
|||
* HMGA1 |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr6:34202218..34216365 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr6:34204186..34205105 |
|||
* IGF2 |
|||
** example (NOTE IDR is just run on the CD14 samples whcih don;t express this therefore nothing called as reproducible) https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr11:2145224..2175954 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr11:2150419..2160905 |
|||
* IRF8 |
|||
** example https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr16:85926913..85962071 |
|||
** zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr16:85932059..85933147 |
|||
Latest revision as of 15:34, 8 July 2011
Initial comparison of clustering methods
Nine clustering methods have been compared for various metrics, specifically:
- Length distribution
- Pair wise correlation
- Promoter ratio
A summary of the results is given in File:Cluster comparison 110627.pdf. Details follow below.
Input files
The following files, contributed by collaborators, were used for the analysis:
- Kawaji pooled decomposed smoothing - https://fantom5-collaboration.gsc.riken.jp/webdav/home/kawaji/110428-human-cage-cluster-UPDATE_011/pooled.tc.decompose_smoothing.bed9.gz
- OSC level2 - https://fantom5-collaboration.gsc.riken.jp/webdav/home/WP5/UPDATE_011_Tag_clusters/UPDATE_011_level2_hg19.osc.gz
- OSC level3 - https://fantom5-collaboration.gsc.riken.jp/webdav/home/WP5/UPDATE_011_Tag_clusters/UPDATE_011_level3_hg19.osc.gz
- TSC Balwierz - https://fantom5-collaboration.gsc.riken.jp/webdav/home/balwierz/TranscriptionStartClusters/TSC.hg19.bed
- FBKclust pooled - https://fantom5-collaboration.gsc.riken.jp/webdav/home/FBKclust/FBKclust_pooled_BED6.bed.gz
- Frith pclu - https://fantom5-collaboration.gsc.riken.jp/webdav/home/mcfrith/110425-pclu/hg19-pclu.bed
- Schmeier filtered - https://fantom5-collaboration.gsc.riken.jp/webdav/home/seb/filtered_hCAGE_clusters.bed.gz
- BASE - https://fantom5-collaboration.gsc.riken.jp/webdav/home/FA_Pro/FactorAnalysisResults_UPDATE011_up4M.bed
- Vanja - https://fantom5-collaboration.gsc.riken.jp/webdav/home/vhaberle/clusters/human.pooled/Pooled.all.CTSS.clusters.0.25.bed.gz
The following libraries were used for correlation and promoter ratio analysis:
CD4+
CNhs10853 CNhs11955 CNhs11998
CD14+
CNhs10852 CNhs11954 CNhs11997
Astrocyte-cerebellum
CNhs11321 CNhs12081 CNhs12117
Dendritic cells -monocyte immature derived (technical and donor replicates)
CNhs10855 and CNhs11062 are technical replicates for donor1 CNhs12195 CNhs12000
THP-1 biological reps
CNhs10722 CNhs10723 CNhs10724
For theses libraries and clusterings, tag count and TPM tables were created by Jessica Severin:
- https://fantom5-collaboration.gsc.riken.jp/webdav/home/severin/cluster_expression_tagcount/
- https://fantom5-collaboration.gsc.riken.jp/webdav/home/severin/cluster_expression_tpm/
Size distributions
A density plot of log10(cluster_length) is provided in the summary pdf.
Additional plots of size distribution (and basic refgene annotation) for each clustering method can be found at https://fantom5-collaboration.gsc.riken.jp/webdav/home/m.lizio/update11_clusters_size_distribution_plots/
Pair wise correlation
Pearson's correlation of log10(tag count + 1) was computed and plotted for all pairs of samples. Red circles in scatter plots in pdf indicate the same cell type. Higher resolution scatter plots are available in File:Png.zip. Tables of pair wise correlations are available in File:Csv.zip. Average correlation within cell type is reported in the summary pdf.
Pairwise correlation of expression signal under the clusters for donor replicates and technical replicates (see page 3 onwards for https://fantom5-collaboration.gsc.riken.jp/wiki/images/9/9d/Cluster_comparison.pdf).
Promoter ratio
The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in File:Missed prom.zip.
Gencode transcriptome derived annotation
Cluster input files in OSC format were taken from https://fantom5-collaboration.gsc.riken.jp/webdav/home/severin/cluster_expression_tpm/
Annotation of each cluster file was performed using the approach described in Tag Cluster Annotation using Gencode (v7) as the basis for the annotations
Details of the data processing is available from https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/readme.txt
Scripts useful to the annotations and generating the plots and summary csv files can be found in https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/Script
Resulting annotated clusters in OSC format similar to the input files with the addition of annotation columns can be found in https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/Gencode-Annotated-Cluster-OSCfiles/
- Coding, non-coding and "others" transcript annotation were performed independently and a single annotation was selected to simplify the general Gencode transcriptome derived annotation comparison of each cluster method, together this resultesd in the addition of the following columns to the original input file
- ColumnVariable[annotation.gencode_coding.class] = the class of the annotation (aka prom,S UTR,AS ...) from the 1st column of gencode_coding ColumnVariable[annotation.gencode_coding.names] = comma delimited names of all the corresponding annotation entries (aka NM_12345,MN_12346, or ENST00000373448,ENST00000373447,)
- ColumnVariable[annotation.gencode_noncoding.class] = the class of the annotation (aka prom,S UTR,AS ...) from the 1st column of gencode_noncoding
- ColumnVariable[annotation.gencode_noncoding.names] = comma delimited names of all the corresponding annotation entries (aka NM_12345,MN_12346, or ENST00000373448,ENST00000373447,)
- ColumnVariable[annotation.gencode_other.class] = the class of the annotation (aka prom,S UTR,AS ...) from the 1st column of gencode_other
- ColumnVariable[annotation.gencode_other.names] = comma delimited names of all the corresponding annotation entries (aka NM_12345,MN_12346, or ENST00000373448,ENST00000373447,)
- ColumnVariable[annotation.gencode.class_summary] = THE annotation class selected amongst annotation.gencode_coding.class, annotation.gencode_noncoding.class or annotation.gencode_other.class, set as intergenic if all NA, also noncoding intron and exon annotations are preppended with NC_
Two sets of plots and summary tab delimited files were made.
- A first set in which expression level are not considered (https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.method_sorted.expressionless_ratio.pdf)
- A second set taking into account the expression level in each of the libraries of the input osc files
- https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.using_expresion_data.txt
- The figure derived from this latter set are
- organized by alphabetical order of the clustering method name (https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.method_sorted.*.pdf files)
- or sorted by the summed ratio of proximal promoter + 5'UTR exon signal intensities (https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Clustering-Evaluation/gencode_annotation_distribution.canonical_exp_sorted.*.pdf files)
Below is a quick overview of the alphabetical sorted annotation for each of the 18 libraries :
Irreproducible Discovery Rate (IDR)
To select signal in high throughput experiments one can use consistency between replicas, because genuine signals are supposed to be reproducible between replicates. IDR is a statistical method to quantitatively measure the consistency between replicates and select signals with the reproducibility of signals into account. For a proper description see: http://www.stat.berkeley.edu/tech-reports/790.pdf .
Below are the results from comparing CD14 donor 2 and 3 replicas using different clustering methods:
Lists of known loci missed by each method
Lists of known loci missed by each method (https://fantom5-collaboration.gsc.riken.jp/wiki/index.php/File:Missed_prom.zip)
Specific gene loci of interest
- B4GALT1
- ABCA1
- MAFB
- FN1
- DUSP1
- CEBPA
- HERC1
- HMGA1
- IGF2
- example (NOTE IDR is just run on the CD14 samples whcih don;t express this therefore nothing called as reproducible) https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr11:2145224..2175954
- zoomed https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=m-DqpFrNIlDvYQwONbZosD;loc=hg19::chr11:2150419..2160905
