Tag Cluster Annotation
Committed names
- Piero Carninci
- Laurens Wilming
- Timo Lassmann
- Richard Baldarelli
- Juha Kere
- Leonard Lipovich(long ncRNA promoters, sense-antisense pair promoters, bidirectional promoters, global human lncRNAome and sense-antisense coordinates)
- Boris Lenhard(enhancers)
- Alison Meynert (Ensembl gene models)
- Sarah Djebali
Annotations & assignments
NB: we will need to be careful about 0-based and 1-based coordinates
Given the majority of people in FANTOM5 seem to be UCSC focused, propose we use 0-based UCSC format with 'chr' prefix. Any Ensembl or other 1-based annotations will need to be adjusted in the output and a note added to the flat file headers/README.
Output requirements/formats
Proposal - flat file format
OSCtable format tab-delimited file, separate files for each species, could also be separate files per annotation class, but should be able to mash together all files for a species into one without re-formatting.
Here is an initial sketch of a possible flat-file format (scroll right to see it all):
## ## Contact name = Ann Otator ## Contact e-mail = ann.otator@institute ## Description = General annotations (or could be just e.g. Ensembl gene models, lncRNAs, enhancers) ## Species = Homo sapiens ## NCBI taxon id = 9606 ## Release = FANTOM5 UPDATE_009 ## Tag_cluster_id Library_id Annotation_class Annotation_type Annotation_id Distance Chr Tag_cluster_pos Tag_cluster_strand Annotation_start Annotation_end Annotation_strand TSC000001 CNhs11772 CORE_PROMOTER ENSEMBL_TRANSCRIPT ENST00000000001 -30 chr1 1234153 + 1234183 1238888 + TSC000002 . 3_PRIME_UTR UCSC_TRANSCRIPT GENE1 . chr2 1234124 + 1234124 1289193 + TSC000003 . CORE_PROMOTER LONG_NC_RNA LEONARD01 -2 chr3 4848484 - 4848400 4848482 - TSC000004 CNhs11334 LONG_RANGE_REGULATION VISTA_ENHANCER VISTA001 . chr4 3893493 + 3893400 3893499 - TSC000005 . EXTENDED_PROMOTER ENSEMBL_TRANSCRIPT ENST00000000002 -503 chr5 3485928 + 3482000 3486431 -
If the library id is given, the tag cluster is associated with that specific library; otherwise, it is associated with the aggregate of all libraries. It is possible that some annotations will not be required on a per-library basis.
Some types of annotation (e.g. core promoter, extended promoter) we will want to include the distance from the tag cluster reference position to the annotation position (e.g. annotated protein-coding gene TSS). For other types (e.g. 3' UTR, exonic), it's enough to know that the tag cluster reference position overlaps that annotation, and the distance can be unspecified.
Proposal - MySQL database schema
Is this something that people would find useful? Here's a very de-normalized schema - another option is to have one table per species and just follow the flat-file format.
CREATE TABLE library (
library_id VARCHAR(9) NOT NULL,
species VARCHAR(40) NOT NULL,
taxon_id INT(10) UNSIGNED NOT NULL,
source ENUM('primary cell', 'cell line', 'tissue', 'timecourse', 'quality control') NOT NULL,
description VARCHAR(100) NOT NULL,
PRIMARY KEY (library_id)
);
CREATE TABLE annotation (
annotation_id INT(10) UNSIGNED NOT NULL,
taxon_id INT(10) UNSIGNED NOT NULL,
class VARCHAR(40) NOT NULL,
source VARCHAR(40) NOT NULL,
type VARCHAR(40) NOT NULL,
external_id VARCHAR(40) NOT NULL,
chr VARCHAR(40) NOT NULL,
pos VARCHAR(40) NOT NULL,
PRIMARY KEY (annotation_id),
KEY annotation_idx (class, source, type, external_id),
KEY location_idx (chr, pos)
);
CREATE TABLE tag_cluster (
library_id VARCHAR(9) NOT NULL,
tag_cluster_id VARCHAR(40) NOT NULL,
chr VARCHAR(40) NOT NULL,
pos INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (tag_cluster_id)
KEY location_idx (chr, pos)
);
CREATE TABLE tag_cluster_annotation (
annotation_id INT(10) UNSIGNED NOT NULL,
tag_cluster_id VARCHAR(40) NOT NULL,
KEY annotation_idx (annotation_id),
KEY tag_cluster_idx (tag_cluster_id)
);
Milestones
- Agreement on annotations to use (Working group notes from Piero)
- Set out annotation list and any definitions (e.g. core promoter vs. extended promoter) on this wiki page
- Assignment of annotation types to participants
- Agreement on output format - ASAP
- Annotation of release 009 clusters using agreed strategy
- Are we waiting on the results of the tag cluster competition or is there a test set of clusters that we can start working on?
- Annotation of data freeze 1 - ASAP after freeze