Clustering evaluation

From Wiki
Jump to navigationJump to search

Initial comparison of clustering methods

Nine clustering methods have been compared for various metrics, specifically:

  • Length distribution
  • Pair wise correlation
  • Promoter ratio

A summary of the results is given in File:Cluster comparison.pdf. Details follow below.


Input files

The following files, contributed by collaborators, were used for the analysis:


The following libraries were used for correlation and promoter ratio analysis:

CD4+

CNhs10853
CNhs11955
CNhs11998

CD14+

CNhs10852
CNhs11954
CNhs11997

Astrocyte-cerebellum

CNhs11321
CNhs12081
CNhs12117


Dendritic cells -monocyte immature derived (technical and donor replicates)

CNhs10855 and CNhs11062 are technical replicates for donor1
CNhs12195
CNhs12000


THP-1 biological reps

CNhs10722
CNhs10723
CNhs10724


For theses libraries and clusterings, tag count and TPM tables were created by Jessica Severin:


Size distributions

A density plot of log10(cluster_length) is provided in the summary pdf.

Additional plots of size distribution (and basic refgene annotation) for each clustering method can be found at https://fantom5-collaboration.gsc.riken.jp/webdav/home/m.lizio/update11_clusters_size_distribution_plots/

Pair wise correlation

Pearson's correlation of log10(tag count + 1) was computed and plotted for all pairs of samples. Red circles in scatter plots in pdf indicate the same cell type. Higher resolution scatter plots are available in File:Png.zip. Tables of pair wise correlations are available in File:Csv.zip. Average correlation within cell type is reported in the summary pdf.


Promoter ratio

The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in File:Missed prom.zip.


Irreproducible Discovery Rate (IDR)

To select signal in high throughput experiments one can use consistency between replicas, because genuine signals are supposed to be reproducible between replicates. IDR is a statistical method to quantitatively measure the consistency between replicates and select signals with the reproducibility of signals into account. For a proper description see: http://www.stat.berkeley.edu/tech-reports/790.pdf

Specific gene loci of interest