Tag Cluster Annotation

Committed names

Piero Carninci
Laurens Wilming
Timo Lassmann
Richard Baldarelli
Juha Kere
Leonard Lipovich(long ncRNA promoters, sense-antisense pair promoters, bidirectional promoters, global human lncRNAome and sense-antisense coordinates)
Boris Lenhard(enhancers)
Alison Meynert (Ensembl gene models)
Sarah Djebali

Output requirements/formats

Proposal - flat file format

OSCtable format tab-delimited file, separate files for each species, could also be separate files per annotation class, but should be able to mash together all files for a species into one without re-formatting.

Here is an initial sketch of a possible flat-file format:

##
## Comments and meta-data TBD
## Author: Ann Otator (ann.otator@institute)
## Description: General annotations (or could be just e.g. Ensembl gene models, lncRNAs, enhancers)
## Species: Homo sapiens
## NCBI taxon id: 9606
## FANTOM5 UPDATE_009
## 
Tag_cluster_id Library_id  Annotation_class      Annotation_type      Annotation_id   Cluster_ref_pos_distance
TSC000001      CNhs11772   CORE_PROMOTER         ENSEMBL_TRANSCRIPT   ENST00000000001 -30
TSC000002      .           3_PRIME_UTR           UCSC_TRANSCRIPT      GENE1           .
TSC000003      .           CORE_PROMOTER         LONG_NC_RNA          LEONARD01       -2
TSC000004      CNhs11334   LONG_RANGE_REGULATION VISTA_ENHANCER       VISTA001        .
TSC000005      .           EXTENDED_PROMOTER     ENSEMBL_TRANSCRIPT   ENST00000000002 -503

If the library id is given, the tag cluster is associated with that specific library; otherwise, it is associated with the aggregate of all libraries. It is possible that some annotations will not be required on a per-library basis.

Some types of annotation (e.g. core promoter, extended promoter) we will want to include the distance from the tag cluster reference position to the annotation position (e.g. annotated protein-coding gene TSS). For other types (e.g. 3' UTR, exonic), it's enough to know that the tag cluster reference position overlaps that annotation, and the distance can be unspecified.

Proposal - MySQL database schema

Is this something that people would find useful? Here's a very de-normalized schema - another option is to have one table per species and just follow the flat-file format.

CREATE TABLE library (

  library_id  VARCHAR(9) NOT NULL,
  species     VARCHAR(40) NOT NULL,
  taxon_id    INT(10) UNSIGNED NOT NULL,
  source      ENUM('primary cell', 'cell line', 'tissue', 'timecourse', 'quality control') NOT NULL,
  description VARCHAR(100) NOT NULL,

  PRIMARY KEY (library_id)

);

CREATE TABLE annotation (

  annotation_id INT(10) UNSIGNED NOT NULL,
  taxon_id      INT(10) UNSIGNED NOT NULL,
  class         VARCHAR(40) NOT NULL,
  source        VARCHAR(40) NOT NULL,
  type          VARCHAR(40) NOT NULL,
  external_id   VARCHAR(40) NOT NULL,
  chr           VARCHAR(40) NOT NULL,
  pos           VARCHAR(40) NOT NULL,

  PRIMARY KEY (annotation_id),
  KEY annotation_idx (class, source, type, external_id),
  KEY location_idx (chr, pos)

);

CREATE TABLE tag_cluster (

  library_id     VARCHAR(9) NOT NULL,
  tag_cluster_id VARCHAR(40) NOT NULL,
  chr            VARCHAR(40) NOT NULL,
  pos            INT(10) UNSIGNED NOT NULL,

  PRIMARY KEY (tag_cluster_id)
  KEY location_idx (chr, pos)

);

CREATE TABLE tag_cluster_annotation (

  annotation_id  INT(10) UNSIGNED NOT NULL,
  tag_cluster_id VARCHAR(40) NOT NULL,

  KEY annotation_idx (annotation_id),
  KEY tag_cluster_idx (tag_cluster_id)
);

Milestones

Agreement on annotations to use (Working group notes from Piero)
- Set out annotation list and any definitions (e.g. core promoter vs. extended promoter) on this wiki page
- Assignment of annotation types to participants
- Agreement on output format - ASAP
Annotation of release 009 clusters using agreed strategy
- Are we waiting on the results of the tag cluster competition or is there a test set of clusters that we can start working on?
Annotation of data freeze 1 - ASAP after freeze

Tag Cluster Annotation

Contents

Committed names

Output requirements/formats

Proposal - flat file format

Proposal - MySQL database schema

Milestones

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools