Clustering evaluation

Initial comparison of clustering methods

Nine clustering methods have been compared for various metrics, specifically:

Length distribution
Pair wise correlation
Promoter ratio

A summary of the results is given in File:Cluster comparison.pdf. Details follow below.

Input files

The following files, contributed by collaborators, were used for the analysis:

The following libraries were used for correlation and promoter ratio analysis:

CD4+

CNhs10853
CNhs11955
CNhs11998

CD14+

CNhs10852
CNhs11954
CNhs11997

Astrocyte-cerebellum

CNhs11321
CNhs12081
CNhs12117

Dendritic cells -monocyte immature derived (technical and donor replicates)

CNhs10855 and CNhs11062 are technical replicates for donor1
CNhs12195
CNhs12000

THP-1 biological reps

CNhs10722
CNhs10723
CNhs10724

For theses libraries and clusterings, tag count and TPM tables were created by Jessica Severin:

Size distributions

A density plot of log10(cluster_length) is provided in the summary pdf.

Additional plots of size distribution (and basic refgene annotation) for each clustering method can be found at https://fantom5-collaboration.gsc.riken.jp/webdav/home/m.lizio/update11_clusters_size_distribution_plots/

Pair wise correlation

Pearson's correlation of log10(tag count + 1) was computed and plotted for all pairs of samples. Red circles in scatter plots in pdf indicate the same cell type. Higher resolution scatter plots are available in File:Png.zip. Tables of pair wise correlations are available in File:Csv.zip. Average correlation within cell type is reported in the summary pdf.

Promoter ratio

The ratio of clusters falling within 500 bp of RefSeq promoters was computed. Additionally, the amount of expression falling within the same promoter regions was computed and averaged over cell type for each clustering method. Lists of missed RefSeq promoters for each clustering method is reported in File:Missed prom.zip.

Irreproducible Discovery Rate (IDR)

To select signal in high throughput experiments one can use consistency between replicas, because genuine signals are supposed to be reproducible between replicates. IDR is a statistical method to quantitatively measure the consistency between replicates and select signals with the reproducibility of signals into account. For a proper description see: http://www.stat.berkeley.edu/tech-reports/790.pdf

Clustering evaluation

Contents

Initial comparison of clustering methods

Input files

Size distributions

Pair wise correlation

Promoter ratio

Irreproducible Discovery Rate (IDR)

Specific gene loci of interest

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools