Clustering evaluation: Difference between revisions

From Wiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 78: Line 78:


The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in [[File:Missed_prom.zip]].
The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in [[File:Missed_prom.zip]].

== Gencode transcriptome derived annotation ==





Revision as of 14:51, 8 July 2011

Initial comparison of clustering methods

Nine clustering methods have been compared for various metrics, specifically:

  • Length distribution
  • Pair wise correlation
  • Promoter ratio

A summary of the results is given in File:Cluster comparison 110627.pdf. Details follow below.


Input files

The following files, contributed by collaborators, were used for the analysis:


The following libraries were used for correlation and promoter ratio analysis:

CD4+

CNhs10853
CNhs11955
CNhs11998

CD14+

CNhs10852
CNhs11954
CNhs11997

Astrocyte-cerebellum

CNhs11321
CNhs12081
CNhs12117


Dendritic cells -monocyte immature derived (technical and donor replicates)

CNhs10855 and CNhs11062 are technical replicates for donor1
CNhs12195
CNhs12000


THP-1 biological reps

CNhs10722
CNhs10723
CNhs10724


For theses libraries and clusterings, tag count and TPM tables were created by Jessica Severin:


Size distributions

A density plot of log10(cluster_length) is provided in the summary pdf.

Additional plots of size distribution (and basic refgene annotation) for each clustering method can be found at https://fantom5-collaboration.gsc.riken.jp/webdav/home/m.lizio/update11_clusters_size_distribution_plots/

Pair wise correlation

Pearson's correlation of log10(tag count + 1) was computed and plotted for all pairs of samples. Red circles in scatter plots in pdf indicate the same cell type. Higher resolution scatter plots are available in File:Png.zip. Tables of pair wise correlations are available in File:Csv.zip. Average correlation within cell type is reported in the summary pdf.

Pairwise correlation of expression signal under the clusters for donor replicates and technical replicates (see page 3 onwards for https://fantom5-collaboration.gsc.riken.jp/wiki/images/9/9d/Cluster_comparison.pdf).

Promoter ratio

The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in File:Missed prom.zip.

Gencode transcriptome derived annotation

Irreproducible Discovery Rate (IDR)

To select signal in high throughput experiments one can use consistency between replicas, because genuine signals are supposed to be reproducible between replicates. IDR is a statistical method to quantitatively measure the consistency between replicates and select signals with the reproducibility of signals into account. For a proper description see: http://www.stat.berkeley.edu/tech-reports/790.pdf .

Below are the results from comparing CD14 donor 2 and 3 replicas using different clustering methods:


caption

Lists of known loci missed by each method

Lists of known loci missed by each method (https://fantom5-collaboration.gsc.riken.jp/wiki/index.php/File:Missed_prom.zip)

Specific gene loci of interest