Tag Cluster Annotation
Committed names
- Piero Carninci
- Laurens Wilming
- Timo Lassmann (non-supervised classification into TSS / non-TSS clusters)
- Richard Baldarelli
- Juha Kere
- Leonard Lipovich(long ncRNA promoters, sense-antisense pair promoters, bidirectional promoters, global human lncRNAome Media:F5_human_lncRNAome(Jia&Lipovich_Gencode_Lander).xls and sense-antisense coordinates Media:F5_human_sense-antisense_pairs_hg19.zip)
- Boris Lenhard(enhancers)
- Alison Meynert (Ensembl gene models)
- Sarah Djebali
- Yulia Medvedeva (CpG islands, DNA methylation, Repeats)
- Nicolas Bertin (Transcript_model_derived_annotation_protocol)
Annotations & assignments
NB: we will need to be careful about 0-based and 1-based coordinates
Given the majority of people in FANTOM5 seem to be UCSC focused, propose we use 0-based UCSC format with 'chr' prefix. Any Ensembl or other 1-based annotations will need to be adjusted in the output and a note added to the flat file headers/README.
Promoter annotations of "UPDATE_012 Decomposition-based Peak Identification (DPI) cluster "
Annotation datasets
At the exception of F5_human_lncRNAome and TBP_JASPAR_CORE_MA0108.2, all source files where obtained from
- F5's UCSC mirror :
- F5's ZENBU instance
RefSeq, Ensembl and gencode
Note that Gencode corresponds to gencode V7 (added by Kawaji-san in GencodeV7, while 2011-06-20 UCSC dumpstill contains gencodeV4)
Split non-coding and protein-coding separately according to the following sub-categories :
INPUT.non_coding.first_exon.bed INPUT.non_coding.inner_exon.bed INPUT.non_coding.last_exon.bed
INPUT.non_coding.first_intron.bed INPUT.non_coding.inner_intron.bed INPUT.non_coding.last_intron.bed
INPUT.non_coding.tss.bed INPUT.non_coding.tss_upstream1000.bed INPUT.non_coding.tss_upstream100.bed
INPUT.non_coding.tss_upstream500.bed INPUT.protein_coding.tss.bed INPUT.protein_coding.first_exon.bed
INPUT.protein_coding.inner_exon.bed INPUT.protein_coding.last_exon.bed INPUT.protein_coding.first_exon_in_3UTR.bed
INPUT.protein_coding.inner_exon_in_3UTR.bed INPUT.protein_coding.last_exon_in_3UTR.bed INPUT.protein_coding.first_exon_in_5UTR.bed
INPUT.protein_coding.inner_exon_in_5UTR.bed INPUT.protein_coding.last_exon_in_5UTR.bed INPUT.protein_coding.first_exon_in_CDS.bed
INPUT.protein_coding.inner_exon_in_CDS.bed INPUT.protein_coding.last_exon_in_CDS.bed INPUT.protein_coding.first_intron.bed
INPUT.protein_coding.inner_intron.bed INPUT.protein_coding.last_intron.bed INPUT.protein_coding.first_intron_in_3UTR.bed
INPUT.protein_coding.inner_intron_in_3UTR.bed INPUT.protein_coding.last_intron_in_3UTR.bed INPUT.protein_coding.first_intron_in_5UTR.bed
INPUT.protein_coding.inner_intron_in_5UTR.bed INPUT.protein_coding.last_intron_in_5UTR.bed INPUT.protein_coding.first_intron_in_CDS.bed
INPUT.protein_coding.inner_intron_in_CDS.bed INPUT.protein_coding.last_intron_in_CDS.bed INPUT.protein_coding.tss_upstream1000.bed
INPUT.protein_coding.tss_upstream100.bed INPUT.protein_coding.tss_upstream500.bed
In addition to non-coding and protein-coding, gencode annotations also contain the pseudo-gene class
lncRNA (provided by Leonard)
F5_human_lncRNAome(Jia&Lipovich_Gencode)BED.zip
INPUT.first_exon.bed INPUT.inner_exon.bed INPUT.last_exon.bed INPUT.first_intron.bed INPUT.inner_intron.bed INPUT.last_intron.bed INPUT.tss.bed INPUT.tss_upstream1000.bed INPUT.tss_upstream100.bed INPUT.tss_upstream500.bed
UCSC ESTs
Overlapping ESTs exonic sequences have been merged, only the name of upt to 3 ESTs are reported
INPUT.exon.bed INPUT.intron.bed INPUT.tss.bed
INPUT.tss_upstream1000.bed INPUT.tss_upstream100.bed INPUT.tss_upstream500.bed
UCSC all_mrna
INPUT.first_exon.bed INPUT.inner_exon.bed INPUT.last_exon.bed INPUT.first_intron.bed INPUT.inner_intron.bed INPUT.last_intron.bed INPUT.tss.bed INPUT.tss_upstream1000.bed INPUT.tss_upstream100.bed INPUT.tss_upstream500.bed
CpG island
CpG island coordinates extracted from F5 UCSC mirror were expanded 200bp downstream in order to replicate the analysis in F3 Nat. Gen. :
"... we determined whether a ... CpG (within 200 bp) upstream of the start site of the clusters was present..."
TATA box
TATA box coordinate were extracted from Michiel's JASPAR_CORE TBP_MA0108.2 Position Weight Matrix whole genome scanning
Only JASPAR_CORE TBP_MA0108.2 PWM with a ppvalue greater than 3.5 were retained.
This threshold might need to be revisited after Sebastian and Timo's analysis of TSS surrounding motifs. An overview of the impact of this threshold on gencode coding and non-codoing derived TSS can be seen in CAGE-Tag-Cluster-Annotation-Building/TBP_JASPAR_CORE_MA0108.2.hg19.scanning_vs_gencodev7.tss.ppval_distrib.pdf
UCSC Repeat masker
repName, repClass and repFamily annotate each cluster
Annotation Set Building pipeline
Details of how (and scripts) the annotations sets have been built can be found in
- CAGE-Tag-Cluster-Annotation-Building.readme
- CAGE-Tag-Cluster-Annotation-Building
- In particular the script CAGE-Tag-Cluster-Annotation-Building.split_bed12.pl is useful to break down a transcript BED12 files into sub components (intron|exon, first|inner|last 5UTR|CDS|3UTR)
Annotation Pipeline
Human Decomposition-based Peak Identification (DPI) cluster
Mouse Decomposition-based Peak Identification (DPI) cluster
Annotation Results
Human Decomposition-based Peak Identification (DPI) cluster
- https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/tc.decompose_smoothing_merged.hg19.annotations/
- tc.decompose_smoothing_merged.hg19.CpGislands.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.EST.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.Ensembl.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.Ensembl.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.F5_human_lncRNAome.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.RefSeq.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.RefSeq.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.TBP_JASPAR_CORE_MA0108.2.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.gencode-pseudo.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.gencode.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.gencode.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.knownGene.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.knownGene.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.mRNA.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.rmsk.annotated.repClass.repFamily.osc.gz
Mouse Decomposition-based Peak Identification (DPI) cluster
- https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/tc.decompose_smoothing_merged.mm9.annotations/
- tc.decompose_smoothing_merged.mm9.CpGislands.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.EST.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.Ensembl.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.Ensembl.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.RefSeq.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.RefSeq.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.TBP_JASPAR_CORE_MA0108.2.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.knownGene.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.knownGene.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.mRNA.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.rmsk.annotated.repClass.repFamily.osc.gz
Using CAGE-Tag-Cluster-Annotation.sh
File:CAGE-Tag-Cluster-Annotation.sh.zip
CAGE-Tag-Cluster-Annotation.sh
File:CAGE-Tag-Cluster-Annotation-Bin.zip
CAGE-Tag-Cluster-Annotation.sh accompanyingrequired BedTools binaries
CAGE-Tag-Cluster-Annotation.sh annotation files
CAGE-Tag-Cluster-Annotation.sh annotation files
Please note that this is now version 1.8
which has significant speed when improvement dealing with already BED-like formatted input and also allows for numbering the column according to their input order (annotated output will have a different order corresponding to those of the ordered matching annotations)
usage: ./CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
Use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters
produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (-o or STDIN for
F5 HelicosCAGE clustering pipeline output files which assumes a precise ordering of genome
coodinates columns (see details of the -o option below) and the addition of tab-delmimted
feature-wise metadata, or -b for BED like ordering of genome coodinates columns, note that
it will also accomodate additioan tab delimited columns see details of the -b option below)
with respect to an ordered list of BED6 formatted annotations files, those files are to
be listed in a tab delimited text file (-a) which along with the path to the actual BED6
formatted annotation file contains information regarding the respective orientation of
the cluster and annotation to be considered (aka in 'sense' or 'anitsense'), the name to
be reported and if clusters overlapping a first annotation should be filtered out or taken
into consideration or not if overlapping a second annotation further down the ordered list
(for more details, see the description the -a option or few of the illustrative examples
below).
Additionally, it is possible to trigger the summation of (numerical) columns (-s) by
aggregating over the annotation names (-n) and/or annotation classes (-c) of the list of
BED6 formatted annotations files (See examples below for more details).
version: 1.8
OPTIONS:
-h Show this message.
-v Verbose level (1-9).
-w Show the commented lines of this script.
SIMPLE CLUSTER ANNOTATIONS
-o Path to the (optionally gzip-compressed) OSCfile to be annotated
The OSCfile is likely obtained from F5 HelicosCAGE clustering pipeline,
therefore, it is assumed that :
* The file contains a header, each line of which is marked by "##".
* Comment line that will be reported but whose position relative to
the data might be altered are marked by "#".
* The first non "#|##" line contains a short description of the column
content.
* Columns are tab delimited.
* The ordering of the columns is fixed and corresponds to :
id, chrom, start.0base, end, strand, raw.<exp_name>, norm.<exp_name>
See the -b parameter for alternative BED compliant column ordering
Note: actually works as long as ordering of the 1st columns are
<something>, chrom, start.0base, end, strand
* Positions are 0-based (see column ordering expectations).
* 'chrom' names must be similar to those defined in the annotation files.
* The file may be in a gzip-compressed format.
-b Path to the (optionally gzip-compressed) BED6-like formatted OSCfile.
The -b option is an alternative to using F5 HelicosCAGE clustering pipeline OSCfile,
in which the ordering of the columns is fixed and corresponds to :
chrom, start.0base, end, id, score, strand
Note that the column order is preseved in the output annotation added OSCfile
Similarly to the -o parameter: 'chrom' names must be similar to those of the annotation
files, the BED6-like file can contain comment lines marked by '#' and can be in a
gzip-compressed form.
-a Path to the file containing the list of sorted BED6 annotation file paths
This list is to be sorted in the order in which potential annotations need to be
looked-up, this ordering is crucial and allows for prioritizing one of several
possible annotations, for example a cluster matching the promoter of a transcript
may also overlap the intron of another transcript,this alternative annotation being
somehow less relevant than the first one (note that this can be turned of by setting
the filter flag as anything but '1', see below).
* This file must be a tab delimited txt file with the following 4 columns
1) A name for the annotation (aka prom500,S gencode_exon,AS ...) which
describes synthetically the content of this annotation.
This string will be appended to the annotated OSCFile send to STDOUT
in its one-before-last column.
2) The path to the BED6 formatted file holding the annotation
3) The relative orientation of the annotation and input data (sense or
antisense).
Accepted values are "sense", "1", "S", or "antisense", "-1", "AS".
4) A flag set to "1" for the filtering out of the overlapping data in
subsequent sequential annotation look up.
* This file may contain comments marked by the character "#" that will be ignored
when parsed for execution of this script
* The basename of this file will be used as the header associated metadata in :
- ParameterValue[annotation_list_path] = <basename of this file>
- ParameterValue[annotation_list] = <basename of this file>
- ColumnVariable[annotation.<basename of this file>.class] = the class of ...
- ColumnVariable[annotation.<basename of this file>.names] = comma delimited ...
-g Flag allowing the grouping of annotation names sharing the same class (aka prom, UTR)
It is recommended to use this flag in order to obtain a single concise annotation for
each given cluster.
-z Flag triggering the reporting of the sole annotation classes (aka without annotation names)
-x allows for specifying the string used to mark "intergenic" aka cluster that did not
overlap with any of the listed annotation.
-y Adds a column corresponding the (comment and header less) line number of the input OSC or
BED6-like file. This option is useful to recover the initial sorting of the input file from
the annotated outputed OSCfile (which otherwise is sorted in part along the order of the
matched annotations provided by the -a parameter)
This line number is added just after the last column of the input data and right before
the annotations
the ColumnVariable[InputFile.DataLineNum] = <line_number> is added to the output OSCheader
ADDITIONAL OPTIONS FOR CLUSTER/ANNOTATIONS-DERIVED EXPRESSION LEVEL
-s Comma delimited list of the column index to be summed up. In which case, the flag -n, -c
or -nc becomes mandatory
-n Flag triggering the aggregation using the overlapping annotation name (aka NM_2345)
-c Flag triggering the aggregation using the class of the overlapping annotation (aka
prom500,S UTR,AS etc). Note that giving the list of comma delimited column indexes to
be summed up (option -s) becomes mandatory.
BUG NEEDING FIXING
[Feb 2nd 2011] When piping successive calls to ./CAGE-Tag-Cluster-Annotation.sh,
blank lines are inserted between the header and the actual data.
while this is not an issue for most usage of the script, it may cause
problem when parsing the file into ZENBU
EXAMPLES
1) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotion_list.txt -g > cluster.refgene_annotated.osc
Will Annotate clusters with respect to refgene coding and non-coding, proximal
promoter regions, introns, exons, UTRs, ...
Clusters (XYZ and ABC) overlapping the proximal promoter regions of more than
one transcripts (say NM_12345 and NM_23456 or NM_56789) will be annotated as
"XYZ XYZ_coord prom500,S NM_12345,NM_23456"
"ABC ABC_coord exon,S NM_56789,"
2) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotation_list.txt > cluster.refgene_split_annotation.osc
Same as example 1) but ommiting the -g flag
clustres will be annotated as "XYZ XYZ_coord prom500,S NM_12345"
"XYZ XYZ_coord prom500,S NM_23456"
"ABC ABC_coord exon,S NM_56789,"
3) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotion_list.txt -g | ./CAGE-Tag-Cluster-Annotation.sh \
-a repeat.annotation_list.txt -g -x NA> cluster.refgene_and_repeat_annotated.osc
By piping the result of example 1) into a second round of annotations
(e.g. repeat element), with "NA" (specified using -x) when no overlap
was found clusters will be annotated as
"XYZ XYZ_coord prom500,S NM_12345,NM_23456 repeat LINE"
"XYZ XYZ_coord prom500,S NM_23456 repeat LINE"
"ABC ABC_coord exon,S NM_56789, NA NA"
4) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -n -s 6,7 > cluster.refgene_expression.osc
Provided refgene_prom_UTR.txt lists BED a file of refgene promoters
and a BED file of UTRs. This will allow the gathering of the gene
expression level in tagcount and tmp (6th and 7th column in test.osc),
is other word summing up all the expression of annotation overlapping
clusters. Resulting in an OSC file containing "NM_12345 50 0.25"
"NM_23456 26 0.13"
5) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -nc -s 6,7 > cluster.refgene_subfeature_expression.osc
Same as above but adding the flag -c
expression will be reported as "prom500,S NM_12345 45 0.225"
"5UTR,S NM_12345 4 0.02"
"5UTR,AS NM_12345 1 0.005"
"prom500,S NM_23456 25 0.0125"
"5UTR,S NM_12345 1 0.005"
Proposed decision tree
Based on the notes Piero took during the meeting, here is the potential decision tree.
A version integrating CAGEscan to fine-tune the association cluster to transcript:
And a more detailled splitting up of transcripts substructures (exonic | intronic 5'UTR | 3'UTR, CDS, etc ...)
Output requirements/formats
Proposal - flat file format
OSCtable format tab-delimited file, separate files for each species, could also be separate files per annotation class, but should be able to mash together all files for a species into one without re-formatting.
Here is an initial sketch of a possible flat-file format (scroll right to see it all):
## ## Contact name = Ann Otator ## Contact e-mail = ann.otator@institute ## Description = General annotations (or could be just e.g. Ensembl gene models, lncRNAs, enhancers) ## Species = Homo sapiens ## NCBI taxon id = 9606 ## Release = FANTOM5 UPDATE_009 ## Tag_cluster_id Library_id Annotation_class Annotation_type Annotation_id Distance Chr Tag_cluster_pos Tag_cluster_strand Annotation_start Annotation_end Annotation_strand TSC000001 CNhs11772 CORE_PROMOTER ENSEMBL_TRANSCRIPT ENST00000000001 -30 chr1 1234153 + 1234183 1238888 + TSC000002 . 3_PRIME_UTR UCSC_TRANSCRIPT GENE1 . chr2 1234124 + 1234124 1289193 + TSC000003 . CORE_PROMOTER LONG_NC_RNA LEONARD01 -2 chr3 4848484 - 4848400 4848482 - TSC000004 CNhs11334 LONG_RANGE_REGULATION VISTA_ENHANCER VISTA001 . chr4 3893493 + 3893400 3893499 - TSC000005 . EXTENDED_PROMOTER ENSEMBL_TRANSCRIPT ENST00000000002 -503 chr5 3485928 + 3482000 3486431 -
If the library id is given, the tag cluster is associated with that specific library; otherwise, it is associated with the aggregate of all libraries. It is possible that some annotations will not be required on a per-library basis.
Some types of annotation (e.g. core promoter, extended promoter) we will want to include the distance from the tag cluster reference position to the annotation position (e.g. annotated protein-coding gene TSS). For other types (e.g. 3' UTR, exonic), it's enough to know that the tag cluster reference position overlaps that annotation, and the distance can be unspecified.
[nicolas] The cluster will probably be in OSCtable format, it might be better to take advantage of the few existing columns (coordinates, expression level and confidence level) and just append annotations to them. You can have a look at the already existing F5 annotation pipeline output.
Proposal - MySQL database schema
Is this something that people would find useful? Here's a very de-normalized schema - another option is to have one table per species and just follow the flat-file format.
CREATE TABLE library (
library_id VARCHAR(9) NOT NULL,
species VARCHAR(40) NOT NULL,
taxon_id INT(10) UNSIGNED NOT NULL,
source ENUM('primary cell', 'cell line', 'tissue', 'timecourse', 'quality control') NOT NULL,
description VARCHAR(100) NOT NULL,
PRIMARY KEY (library_id)
);
CREATE TABLE annotation (
annotation_id INT(10) UNSIGNED NOT NULL,
taxon_id INT(10) UNSIGNED NOT NULL,
class VARCHAR(40) NOT NULL,
source VARCHAR(40) NOT NULL,
type VARCHAR(40) NOT NULL,
external_id VARCHAR(40) NOT NULL,
chr VARCHAR(40) NOT NULL,
pos VARCHAR(40) NOT NULL,
PRIMARY KEY (annotation_id),
KEY annotation_idx (class, source, type, external_id),
KEY location_idx (chr, pos)
);
CREATE TABLE tag_cluster (
library_id VARCHAR(9) NOT NULL,
tag_cluster_id VARCHAR(40) NOT NULL,
chr VARCHAR(40) NOT NULL,
pos INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (tag_cluster_id)
KEY location_idx (chr, pos)
);
CREATE TABLE tag_cluster_annotation (
annotation_id INT(10) UNSIGNED NOT NULL,
tag_cluster_id VARCHAR(40) NOT NULL,
KEY annotation_idx (annotation_id),
KEY tag_cluster_idx (tag_cluster_id)
);
Milestones
- Agreement on annotations to use (Working group notes from Piero)
- Set out annotation list and any definitions (e.g. core promoter vs. extended promoter) on this wiki page
- Assignment of annotation types to participants
- Agreement on output format - ASAP
- Annotation of release 009 clusters using agreed strategy
- Are we waiting on the results of the tag cluster competition or is there a test set of clusters that we can start working on?
- Annotation of data freeze 1 - ASAP after freeze