User:Nbertin/Transcript model derived annotation protocol: Difference between revisions

Latest revision as of 20:15, 4 August 2011

Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq for all organisms and gencode for human samples) feature : proximal promoter region, 5' and 3' UTR, exon (coding, located in UTR region or associated to non-coding transcripts), intron (similarly coding, located in UTR region or associated to non-coding transcripts).
The protocol and script described here use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (see User:Tlassmann/clustering for the details of the clustering pipeline) with respect to an ordered serie of BED6 formatted annotations. This script will be run as part of the CAGE post-processing pipeline and data file for each independant libraries will be provided in datasets updates.

Alternatively you can also use the script CAGE-Tag-Cluster-Annotation.sh to redo this annotation step with your own clustered data or using a different list of annotations, altering their ordering, etc or to obtain promoter / 'gene' expression aggregated over the provided annotations.

Promoter annotations of "UPDATE_012 Decomposition-based Peak Identification (DPI) cluster "

https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/readme.txt

Annotation datasets

At the exception of F5_human_lncRNAome and TBP_JASPAR_CORE_MA0108.2, all source files where obtained from

F5's UCSC mirror :
F5's ZENBU instance

All the BED6 annotation files

RefSeq, Ensembl and gencode

Note that Gencode corresponds to gencode V7 (added by Kawaji-san in GencodeV7, while 2011-06-20 UCSC dumpstill contains gencodeV4)
Split non-coding and protein-coding separately according to the following sub-categories :

    INPUT.non_coding.first_exon.bed                INPUT.non_coding.inner_exon.bed		  INPUT.non_coding.last_exon.bed
    INPUT.non_coding.first_intron.bed	            INPUT.non_coding.inner_intron.bed		  INPUT.non_coding.last_intron.bed
    INPUT.non_coding.tss.bed		            INPUT.non_coding.tss_upstream1000.bed	  INPUT.non_coding.tss_upstream100.bed
    INPUT.non_coding.tss_upstream500.bed           INPUT.protein_coding.tss.bed                  INPUT.protein_coding.first_exon.bed
    INPUT.protein_coding.inner_exon.bed            INPUT.protein_coding.last_exon.bed            INPUT.protein_coding.first_exon_in_3UTR.bed
    INPUT.protein_coding.inner_exon_in_3UTR.bed    INPUT.protein_coding.last_exon_in_3UTR.bed    INPUT.protein_coding.first_exon_in_5UTR.bed
    INPUT.protein_coding.inner_exon_in_5UTR.bed    INPUT.protein_coding.last_exon_in_5UTR.bed    INPUT.protein_coding.first_exon_in_CDS.bed
    INPUT.protein_coding.inner_exon_in_CDS.bed     INPUT.protein_coding.last_exon_in_CDS.bed     INPUT.protein_coding.first_intron.bed
    INPUT.protein_coding.inner_intron.bed          INPUT.protein_coding.last_intron.bed          INPUT.protein_coding.first_intron_in_3UTR.bed
    INPUT.protein_coding.inner_intron_in_3UTR.bed  INPUT.protein_coding.last_intron_in_3UTR.bed  INPUT.protein_coding.first_intron_in_5UTR.bed
    INPUT.protein_coding.inner_intron_in_5UTR.bed  INPUT.protein_coding.last_intron_in_5UTR.bed  INPUT.protein_coding.first_intron_in_CDS.bed
    INPUT.protein_coding.inner_intron_in_CDS.bed   INPUT.protein_coding.last_intron_in_CDS.bed   INPUT.protein_coding.tss_upstream1000.bed
    INPUT.protein_coding.tss_upstream100.bed       INPUT.protein_coding.tss_upstream500.bed

In addition to non-coding and protein-coding, gencode annotations also contain the pseudo-gene class

lncRNA (provided by Leonard)

F5_human_lncRNAome(Jia&Lipovich_Gencode)BED.zip

  INPUT.first_exon.bed       INPUT.inner_exon.bed        INPUT.last_exon.bed
  INPUT.first_intron.bed     INPUT.inner_intron.bed      INPUT.last_intron.bed
  INPUT.tss.bed              INPUT.tss_upstream1000.bed  INPUT.tss_upstream100.bed
  INPUT.tss_upstream500.bed

UCSC ESTs

Overlapping ESTs exonic sequences have been merged, only the name of upt to 3 ESTs are reported

    INPUT.exon.bed              INPUT.intron.bed           INPUT.tss.bed
    INPUT.tss_upstream1000.bed  INPUT.tss_upstream100.bed  INPUT.tss_upstream500.bed

UCSC all_mrna

  INPUT.first_exon.bed       INPUT.inner_exon.bed        INPUT.last_exon.bed
  INPUT.first_intron.bed     INPUT.inner_intron.bed      INPUT.last_intron.bed
  INPUT.tss.bed              INPUT.tss_upstream1000.bed  INPUT.tss_upstream100.bed
  INPUT.tss_upstream500.bed

CpG island

CpG island coordinates extracted from F5 UCSC mirror were expanded 200bp downstream in order to replicate the analysis in F3 Nat. Gen. :
"... we determined whether a ... CpG (within 200 bp) upstream of the start site of the clusters was present..."

TATA box

TATA box coordinate were extracted from Michiel's JASPAR_CORE TBP_MA0108.2 Position Weight Matrix whole genome scanning
Only JASPAR_CORE TBP_MA0108.2 PWM with a ppvalue greater than 3.5 were retained.
This threshold might need to be revisited after Sebastian and Timo's analysis of TSS surrounding motifs. An overview of the impact of this threshold on gencode coding and non-codoing derived TSS can be seen in CAGE-Tag-Cluster-Annotation-Building/TBP_JASPAR_CORE_MA0108.2.hg19.scanning_vs_gencodev7.tss.ppval_distrib.pdf

UCSC Repeat masker

repName, repClass and repFamily annotate each cluster

Annotation Set Building pipeline

Details of how (and scripts) the annotations sets have been built can be found in

CAGE-Tag-Cluster-Annotation-Building.readme
CAGE-Tag-Cluster-Annotation-Building
In particular the script CAGE-Tag-Cluster-Annotation-Building.split_bed12.pl is useful to break down a transcript BED12 files into sub components (intron|exon, first|inner|last 5UTR|CDS|3UTR)

Annotation Pipeline

Human Decomposition-based Peak Identification (DPI) cluster

CAGE-Tag-Cluster-Annotation of tc.decompose_smoothing_merged.hg19.bed readme

Mouse Decomposition-based Peak Identification (DPI) cluster

CAGE-Tag-Cluster-Annotation of tc.decompose_smoothing_merged.mm9.bed readme

Annotation Results

Human Decomposition-based Peak Identification (DPI) cluster

https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/tc.decompose_smoothing_merged.hg19.annotations/
- tc.decompose_smoothing_merged.hg19.CpGislands.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.EST.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.Ensembl.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.Ensembl.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.F5_human_lncRNAome.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.RefSeq.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.RefSeq.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.TBP_JASPAR_CORE_MA0108.2.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.gencode-pseudo.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.gencode.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.gencode.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.knownGene.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.knownGene.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.hg19.mRNA.annotated.osc.gz
- tc.decompose_smoothing_merged.hg19.rmsk.annotated.repClass.repFamily.osc.gz

Mouse Decomposition-based Peak Identification (DPI) cluster

https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/tc.decompose_smoothing_merged.mm9.annotations/
- tc.decompose_smoothing_merged.mm9.CpGislands.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.EST.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.Ensembl.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.Ensembl.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.RefSeq.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.RefSeq.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.TBP_JASPAR_CORE_MA0108.2.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.knownGene.non_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.knownGene.protein_coding.annotated.sym.osc.gz
- tc.decompose_smoothing_merged.mm9.mRNA.annotated.osc.gz
- tc.decompose_smoothing_merged.mm9.rmsk.annotated.repClass.repFamily.osc.gz

Annotating your own clusters

Using CAGE-Tag-Cluster-Annotation.sh

~~File:CAGE-Tag-Cluster-Annotation.sh.zip~~ CAGE-Tag-Cluster-Annotation.sh
~~File:CAGE-Tag-Cluster-Annotation-Bin.zip~~ CAGE-Tag-Cluster-Annotation.sh accompanying required BedTools binaries
~~CAGE-Tag-Cluster-Annotation.sh annotation files~~ CAGE-Tag-Cluster-Annotation.sh annotation files

Please note that this is now version 1.8
which has significant speed when improvement dealing with already BED-like formatted input and also allows for numbering the column according to their input order (annotated output will have a different order corresponding to those of the ordered matching annotations)

usage: ./CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>

   Use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters 
   produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (-o or STDIN for
   F5 HelicosCAGE clustering pipeline output files which assumes a precise ordering of genome
   coodinates columns (see details of the -o option below) and the addition of tab-delmimted
   feature-wise metadata, or -b for BED like ordering of genome coodinates columns, note that 
   it will also accomodate additioan tab delimited columns see details of the -b option below)
   with respect to an ordered list of BED6 formatted annotations files, those files are to 
   be listed in a tab delimited text file (-a) which along with the path to the actual BED6
   formatted annotation file contains information regarding the respective orientation of
   the cluster and annotation to be considered (aka in 'sense' or 'anitsense'), the name to
   be reported and if clusters overlapping a first annotation should be filtered out or taken
   into consideration or not if overlapping a second annotation further down the ordered list
   (for more details, see the description the -a option or few of the illustrative examples
   below).
   Additionally, it is possible to trigger the summation of (numerical) columns (-s) by
   aggregating over the annotation names (-n) and/or annotation classes (-c) of the list of
   BED6 formatted annotations files (See examples below for more details).

version: 1.8

OPTIONS:
   -h      Show this message.
   -v      Verbose level (1-9).
   -w      Show the commented lines of this script.

   SIMPLE CLUSTER ANNOTATIONS
   -o      Path to the (optionally gzip-compressed) OSCfile to be annotated
           The OSCfile is likely obtained from F5 HelicosCAGE clustering pipeline,
           therefore, it is assumed that : 
               * The file contains a header, each line of which is marked by "##".
               * Comment line that will be reported but whose position relative to
                 the data might be altered are marked by "#".
               * The first non "#|##" line contains a short description of the column
                 content.
               * Columns are tab delimited.
               * The ordering of the columns is fixed and corresponds to :
                    id, chrom, start.0base, end, strand, raw.<exp_name>, norm.<exp_name>
                 See the -b parameter for alternative BED compliant column ordering
                 Note: actually works as long as ordering of the 1st columns are
                    <something>, chrom, start.0base, end, strand
               * Positions are 0-based (see column ordering expectations).
               * 'chrom' names must be similar to those defined in the annotation files.
               * The file may be in a gzip-compressed format.
   -b      Path to the (optionally gzip-compressed) BED6-like formatted OSCfile.
           The -b option is an alternative to using F5 HelicosCAGE clustering pipeline OSCfile,
           in which the ordering of the columns is fixed and corresponds to :
                   chrom, start.0base, end, id, score, strand
           Note that the column order is preseved in the output annotation added OSCfile
           Similarly to the -o parameter: 'chrom' names must be similar to those of the annotation
           files, the BED6-like file can contain comment lines marked by '#' and can be in a 
           gzip-compressed form.
   -a      Path to the file containing the list of sorted BED6 annotation file paths
           This list is to be sorted in the order in which potential annotations need to be
           looked-up, this ordering is crucial and allows for prioritizing one of several
           possible annotations, for example a cluster matching the promoter  of a transcript
           may also overlap the intron of another transcript,this alternative annotation being
           somehow less relevant than the first one (note that this can be turned of by setting
           the filter flag as anything but '1', see below).
               * This file must be a tab delimited txt file with the following 4 columns
                     1) A name for the annotation (aka prom500,S  gencode_exon,AS ...) which
                       describes synthetically the content of this annotation.
                       This string will be appended to the annotated OSCFile send to STDOUT
                       in its one-before-last column.
                     2) The path to the BED6 formatted file holding the annotation 
                     3) The relative orientation of the annotation and input data (sense or 
                       antisense).
                       Accepted values are "sense", "1", "S", or "antisense", "-1", "AS".
                     4) A flag set to "1" for the filtering out of the overlapping data in
                       subsequent sequential annotation look up.
               * This file may contain comments marked by the character "#" that will be ignored
                 when parsed for execution of this script 
               * The basename of this file will be used as the header associated metadata in :
                     - ParameterValue[annotation_list_path] = <basename of this file>
                     - ParameterValue[annotation_list] = <basename of this file>
                     - ColumnVariable[annotation.<basename of this file>.class] = the class of ...
                     - ColumnVariable[annotation.<basename of this file>.names] = comma delimited ...
   -g      Flag allowing the grouping of annotation names sharing the same class (aka prom, UTR)
           It is recommended to use this flag in order to obtain a single concise annotation for
           each given cluster.
   -z      Flag triggering the reporting of the sole annotation classes (aka without annotation names)
   -x      allows for specifying the string used to mark "intergenic" aka cluster that did not 
           overlap with any of the listed annotation.
   -y      Adds a column corresponding the (comment and header less) line number of the input OSC or
           BED6-like file. This option is useful to recover the initial sorting of the input file from
           the annotated outputed OSCfile (which otherwise is sorted in part along the order of the
           matched annotations provided by the -a parameter)
           This line number is added just after the last column of the input data and right before
           the annotations
           the ColumnVariable[InputFile.DataLineNum] = <line_number> is added to the output OSCheader

   ADDITIONAL OPTIONS FOR CLUSTER/ANNOTATIONS-DERIVED EXPRESSION LEVEL
   -s      Comma delimited list of the column index to be summed up. In which case, the flag -n, -c 
           or -nc becomes mandatory
   -n      Flag triggering the aggregation using the overlapping annotation name (aka NM_2345) 
   -c      Flag triggering the aggregation using the class of the overlapping annotation (aka
           prom500,S  UTR,AS etc). Note that giving the list of comma delimited column indexes to
           be summed up (option -s) becomes mandatory.


    BUG NEEDING FIXING

          [Feb 2nd 2011] When piping successive calls to ./CAGE-Tag-Cluster-Annotation.sh,
                         blank lines are inserted between the header and the actual data.
                         while this is not an issue for most usage of the script, it may cause
                         problem when parsing the file into ZENBU


    EXAMPLES

          1) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotion_list.txt -g > cluster.refgene_annotated.osc
             Will Annotate clusters with respect to refgene coding and non-coding, proximal
             promoter regions, introns, exons, UTRs, ...
             Clusters (XYZ and ABC) overlapping the proximal promoter regions of more than
             one transcripts (say NM_12345 and NM_23456 or NM_56789) will be annotated as
                  "XYZ  XYZ_coord  prom500,S  NM_12345,NM_23456"
                  "ABC  ABC_coord  exon,S     NM_56789,"

          2) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotation_list.txt > cluster.refgene_split_annotation.osc
             Same as example 1) but ommiting the -g flag
             clustres will be annotated as "XYZ  XYZ_coord  prom500,S  NM_12345"
                                           "XYZ  XYZ_coord  prom500,S  NM_23456"
                                           "ABC  ABC_coord  exon,S     NM_56789,"

          3) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotion_list.txt -g | ./CAGE-Tag-Cluster-Annotation.sh \
              -a repeat.annotation_list.txt -g -x NA> cluster.refgene_and_repeat_annotated.osc
             By piping the result of example 1) into a second round of annotations 
             (e.g. repeat element), with "NA" (specified using -x) when no overlap
             was found clusters will be annotated as
                   "XYZ  XYZ_coord  prom500,S  NM_12345,NM_23456 repeat      LINE"
                   "XYZ  XYZ_coord  prom500,S  NM_23456          repeat      LINE"
                   "ABC  ABC_coord  exon,S     NM_56789,         NA          NA"


          4) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -n -s 6,7 > cluster.refgene_expression.osc
             Provided refgene_prom_UTR.txt lists BED a file of refgene promoters 
             and a BED file of UTRs. This will allow the gathering of the gene 
             expression level in tagcount and tmp (6th and 7th column in test.osc),
             is other word summing up all the expression of annotation overlapping
             clusters. Resulting in an OSC file containing "NM_12345   50   0.25"
                                                           "NM_23456   26   0.13"

          5) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -nc -s 6,7 > cluster.refgene_subfeature_expression.osc
             Same as above but adding the flag -c 
             expression will be reported as "prom500,S   NM_12345   45  0.225"
                                            "5UTR,S      NM_12345    4  0.02"
                                            "5UTR,AS     NM_12345    1  0.005"
                                            "prom500,S   NM_23456   25  0.0125"
                                            "5UTR,S      NM_12345    1  0.005"

Using the latest BedTools "annotateBed" tool

The latest version of Bedtools (v2.11.2, 31st January) includes a novel tool called "annotateBed" that annotates one BED/VCF/GFF file with the coverage and number of overlaps observed from multiple other BED/VCF/GFF files. In this way, it allows one to ask to what degree one feature coincides with multiple other feature types with a single command. This is a potentially good alternative to CAGE-Tag-Cluster-Annotation.sh provided the file you want to annotate is BED VCF or GFF formatted and you only care about one quantitative value (that you would bury in the BED6::score columns).

Here is a quick way to transform the OSCfile provided in FANTOM5 updates into BED and a (rather simplistic) example of annotateBed usage, counting the number of entries of each annotation for each cluster (note that here the score is et to "1" but could be replaced by any of the column containing a quantitative value of interest, also for more clever use of the tool please have a look at BedTools man page)

awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' cluter_file.osc > cluster_file.bed
annotateBed -s -counts -i cluster_file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > cluster_file.annotated

Side note, since the ordering of the data in cluster_file.bed should be preserved in cluster_file.annotated, to get antisense count, one may use :

awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' file.osc > file.bed
annotateBed    -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > file.annotated_SAS
annotateBed -s -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > file.annotated_S
paste file.annotated_SAS file.annotated_S | awk '{FS="\t"}{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7-$14,$8-$15}' > file.annotated_AS

Promoter / 'gene' expression

   Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script which aggregates the value of a particular OSCfile given the associated annotation category can be used to obtain promoter or 'transcript' expression levels from an input OSCfile

This extension allows for "complex" transcript expression level to be obtained
    i.e. report the exprxpression level of a transcript as the sum up the tags/cluster overlapping the proximal promoter and 5'UTR exons only.
    Or as the sum up the tags/cluster overlapping the proximal promoter and entire first 5'UTR exon and intron, etc

usage: ./CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>

   Use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters 
   produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (-o or STDIN for
   F5 HelicosCAGE clustering pipeline output files...
   
   Additionally, it is possible to trigger the summation of (numerical) columns (-s) by
   aggregating over the annotation names (-n) and/or annotation classes (-c) of the list of
   BED6 formatted annotations files (See examples below for more details).'''

version: 1.8

OPTIONS:
   -h      Show this message.
   -v      Verbose level (1-9).
   -w      Show the commented lines of this script.

   SIMPLE CLUSTER ANNOTATIONS
   -o      Path to the (optionally gzip-compressed) OSCfile to be annotated
   -b      Path to the (optionally gzip-compressed) BED6-like formatted OSCfile.
   -a      Path to the file containing the list of sorted BED6 annotation file paths
   -g      Flag allowing the grouping of annotation names sharing the same class (aka prom, UTR)
           It is recommended to use this flag in order to obtain a single concise annotation for
           each given cluster.

   ADDITIONAL OPTIONS FOR CLUSTER/ANNOTATIONS-DERIVED EXPRESSION LEVEL
   -s      Comma delimited list of the column index to be summed up. In which case, the flag -n, -c 
           or -nc becomes mandatory
   -n      Flag triggering the aggregation using the overlapping annotation name (aka NM_2345) 
   -c      Flag triggering the aggregation using the class of the overlapping annotation (aka
           prom500,S  UTR,AS etc). Note that giving the list of comma delimited column indexes to
           be summed up (option -s) becomes mandatory.


    EXAMPLES

          1) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -n -s 6,7 > cluster.refgene_expression.osc
             Provided refgene_prom_UTR.txt lists BED a file of refgene promoters 
             and a BED file of UTRs. This will allow the gathering of the gene 
             expression level in tagcount and tmp (6th and 7th column in test.osc),
             is other word summing up all the expression of annotation overlapping
             clusters. Resulting in an OSC file containing "NM_12345   50   0.25"
                                                           "NM_23456   26   0.13"

          2) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -nc -s 6,7 > cluster.refgene_subfeature_expression.osc
             Same as above but adding the flag -c 
             expression will be reported as "prom500,S   NM_12345   45  0.225"
                                            "5UTR,S      NM_12345    4  0.02"
                                            "5UTR,AS     NM_12345    1  0.005"
                                            "prom500,S   NM_23456   25  0.0125"
                                            "5UTR,S      NM_12345    1  0.005"

Future directions / Suggestions

~~Add repeat elements (UCSC rmsk table) to the list of provided default annotations~~
~~Add CpG island (from UCSC cpgIslandExt table)~~
~~Add ESTs (obtained from UCSC all_est table)~~
~~Add TATA box (in particular take advantage of Michiel Motif Activity derived TPB scanning to define TATA box)~~
for TATA box and CpG island, write a script / wrapper around BedTools "closestBed" to not only provide the CpG/TATA status of promoters but more informatively give the distance to which the closest CpG/TATA is located
few ENCODE supertracks
Julian Gough's group protein domains annotations (via Ensmbl transcripts)

Do not hesitate to add your own suggestions here (this is a wiki page) or email me at nbertin@gsc.riken.jp

...
...

(DEPRECATED) promoter annotations "up to UPDATE_011"

Annotation datasets

RefSeq

Concerns were raised over the completness of RefSeq annotation, In future UPDATES, it will be replaced by ensembl

Ensembl

Instead of RefSeq for cross-species annotation / expression comparison

Gencode (hg19 only)

lncRNA (provided by Leonard)

The current reference nonredundant list of 8858 human lncRNA genes (chr, direction, TSS, end): Media:F5_human_lncRNAome(Jia&Lipovich_Gencode_Lander).xls

FANTOM5 annotation pipeline

Timo and Hasegawa-san have appended the CAGE-Tag-Cluster-Annotation.sh script and the annotation sets mentionned ablove to their pipeline implementation, automating the annotation of HelicosCAGE clusters with respect to refseq transcript models (for all genomes) and gencode transcript models (hg19 only).

Below is the graphical representatin of this annotation pipeline :

@@ Line 1: / Line 1: @@
+<br> &nbsp;&nbsp;&nbsp; Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq for all organisms and gencode for human samples) feature : proximal promoter region, 5' and 3' UTR, exon (coding, located in UTR region or associated to non-coding transcripts), intron (similarly coding, located in UTR region or associated to non-coding transcripts).<br>&nbsp;&nbsp;&nbsp; The protocol and script described here use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters&nbsp; produced by F5 HelicosCAGE clustering pipeline and formatted as an [http://fantom.gsc.riken.jp/4/download/Tables/doc/090703-osctable.rtf OSCfile] (see [[User:Tlassmann/clustering]] for the details of the clustering pipeline) with respect to an ordered serie of BED6 formatted annotations. This script will be run as part of the CAGE post-processing pipeline and data file for each independant libraries will be provided in [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/LATEST_UPDATE/ datasets updates].<br>
-[[Category:Methods_and_Protocols]]
+&nbsp;&nbsp; Alternatively you can also use the script CAGE-Tag-Cluster-Annotation.sh to redo this annotation step with your own clustered data or using a different list of annotations, altering their ordering, etc or to obtain promoter / 'gene' expression aggregated over the provided annotations.<br>
-&nbsp;&nbsp;&nbsp; Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron).<br>&nbsp;&nbsp;&nbsp; The protocol and script described here use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters&nbsp; produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (see ([[User:Tlassmann/clustering]])) with respect to an ordered serie of BED6 formatted annotations. This script will be run as part of the CAGE post-processing pipeline and data file for each independant libraries will be provided in [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/LATEST_UPDATE/ datasets updates]<br>
+<br> <br>
+== Promoter annotations of "[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/ UPDATE_012 Decomposition-based Peak Identification (DPI) cluster   ]"<br>  ==
-&nbsp;&nbsp; Alternatively you can also use the script CAGE-Tag-Cluster-Annotation.sh to redo this annotation step with your own clustered data or using a different list of annotations, altering their ordering, etc or to obtain Promoter / 'gene' expression aggregated over the provided annotations<br>
+https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/readme.txt
-<br>
+=== Annotation datasets  ===
+At the exception of F5_human_lncRNAome and TBP_JASPAR_CORE_MA0108.2, all source files where obtained from
+*F5's UCSC mirror&nbsp;:
+**[https://fantom5-collaboration.gsc.riken.jp/ucsc/cgi-bin/hgTables hgTables browser]
+**[https://fantom5-collaboration.gsc.riken.jp/files/data/shared/external/ucsc_database/hg19/2011-06-20/ F5 UCSC mirror datafiles]
+**[https://fantom5-collaboration.gsc.riken.jp/files/data/shared/external/ucsc_database/hg19/2011-07-25-GencodeV7/  F5 UCSC mirror gencode V7 only files]<br>
+*[https://fantom5-collaboration.gsc.riken.jp/zenbu/gLyphs/#config=FvJrEHgtDJ1gB5PId8ysz;loc=hg19::chr19:49990364..49994995 F5's ZENBU instance]
+<br>
+*[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation-Data.01Aug11.tar.gz All the BED6 annotation files]
 <br>
+==== RefSeq, Ensembl and gencode  ====
-== Promoter annotations provided in [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/LATEST_UPDATE/ datasets updates]<br>  ==
+Note that Gencode corresponds to gencode V7 (added by Kawaji-san in [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/external/ucsc_database/hg19/2011-07-25-GencodeV7/ GencodeV7], while [https://fantom5-collaboration.gsc.riken.jp/files/data/shared/external/ucsc_database/hg19/2011-06-20/ 2011-06-20 UCSC dump]still contains gencodeV4)<br>Split non-coding and protein-coding separately according to the following sub-categories&nbsp;:
-=== Annotation datasets<br>  ===
+     INPUT.non_coding.first_exon.bed                INPUT.non_coding.inner_exon.bed		  INPUT.non_coding.last_exon.bed
-==== RefSeq  ====
+     INPUT.non_coding.first_intron.bed	            INPUT.non_coding.inner_intron.bed		  INPUT.non_coding.last_intron.bed
+     INPUT.non_coding.tss.bed		            INPUT.non_coding.tss_upstream1000.bed	  INPUT.non_coding.tss_upstream100.bed
+     INPUT.non_coding.tss_upstream500.bed           INPUT.protein_coding.tss.bed                  INPUT.protein_coding.first_exon.bed
+     INPUT.protein_coding.inner_exon.bed            INPUT.protein_coding.last_exon.bed            INPUT.protein_coding.first_exon_in_3UTR.bed
+     INPUT.protein_coding.inner_exon_in_3UTR.bed    INPUT.protein_coding.last_exon_in_3UTR.bed    INPUT.protein_coding.first_exon_in_5UTR.bed
+     INPUT.protein_coding.inner_exon_in_5UTR.bed    INPUT.protein_coding.last_exon_in_5UTR.bed    INPUT.protein_coding.first_exon_in_CDS.bed
+     INPUT.protein_coding.inner_exon_in_CDS.bed     INPUT.protein_coding.last_exon_in_CDS.bed     INPUT.protein_coding.first_intron.bed
+     INPUT.protein_coding.inner_intron.bed          INPUT.protein_coding.last_intron.bed          INPUT.protein_coding.first_intron_in_3UTR.bed
+     INPUT.protein_coding.inner_intron_in_3UTR.bed  INPUT.protein_coding.last_intron_in_3UTR.bed  INPUT.protein_coding.first_intron_in_5UTR.bed
+     INPUT.protein_coding.inner_intron_in_5UTR.bed  INPUT.protein_coding.last_intron_in_5UTR.bed  INPUT.protein_coding.first_intron_in_CDS.bed
+     INPUT.protein_coding.inner_intron_in_CDS.bed   INPUT.protein_coding.last_intron_in_CDS.bed   INPUT.protein_coding.tss_upstream1000.bed
+     INPUT.protein_coding.tss_upstream100.bed       INPUT.protein_coding.tss_upstream500.bed
+In addition to non-coding and protein-coding, gencode annotations also contain the pseudo-gene class
-==== Gencode (hg19 only)  ====
-=== FANTOM5 annotation pipeline <br>  ===
+==== lncRNA (provided by Leonard) ====
+F5_human_lncRNAome(Jia&amp;Lipovich_Gencode)BED.zip
-Timo and Hasegawa-san have appended the CAGE-Tag-Cluster-Annotation.sh script and the annotation sets mentionned ablove to their pipeline implementation, automating the annotation of HelicosCAGE clusters with respect to refseq transcript models (for all genomes) and gencode transcript models (hg19 only).<br>
+   INPUT.first_exon.bed       INPUT.inner_exon.bed        INPUT.last_exon.bed
-Below is the graphical representatin of this annotation pipeline&nbsp;:[[Image:Cluster annotation pipeline.02feb11.png|graphical representation of the cluster annotation pipeline]]
+   INPUT.first_intron.bed     INPUT.inner_intron.bed      INPUT.last_intron.bed
+   INPUT.tss.bed              INPUT.tss_upstream1000.bed  INPUT.tss_upstream100.bed
+   INPUT.tss_upstream500.bed
+==== UCSC ESTs  ====
-<br>
+Overlapping ESTs exonic sequences have been merged, only the name of upt to 3 ESTs are reported<br>
-<br>
+     INPUT.exon.bed              INPUT.intron.bed           INPUT.tss.bed
-== Annotating your own clusters (CAGE-Tag-Cluster-Annotation.sh)<br>  ==
+     INPUT.tss_upstream1000.bed  INPUT.tss_upstream100.bed  INPUT.tss_upstream500.bed
-<blockquote>CAGE-Tag-Cluster-Annotation.sh -o &lt;OSCfile to annotate&gt; -a &lt;file listing the annotation to be used&gt;<br> OPTIONS: -h Show this message<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -v Verbose level (1-9)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -o Path to the (optionally gzip-compressed) OSCfile to be annotated<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -a Path to the file containing the list of sorted BED6 annotation file paths<br> </blockquote>
+==== UCSC all_mrna  ====
-This script will send to STDOUT the exact content of the input OSCfile with a slihgty modified header reflecting the annotation step and two additional columns
-*The class of the annotation (as given in the file containing the list of sorted BED6)
-*A comma delimited list of the (BED6)name of the evidences (aka transcript names in the case of the BED6 file processed as part of the CAGEcluster post-processing pipeline)
-'''Path to the (optionally gzip-compressed) OSCfile to be annotated. '''<br> This file is likely obtained from F5 HelicosCAGE clustering pipeline, therefore, it assumes that&nbsp;:
+   INPUT.first_exon.bed       INPUT.inner_exon.bed        INPUT.last_exon.bed
-*The file contain a header, each line of which is marked by "##" <span style="display: none;" id="1295956474143E">&nbsp;</span>
+   INPUT.first_intron.bed     INPUT.inner_intron.bed      INPUT.last_intron.bed
-*Comment lines that will be reported but whose position relative to the data might be altered are marked by "##"
+   INPUT.tss.bed              INPUT.tss_upstream1000.bed  INPUT.tss_upstream100.bed
-*The first non "#|##" line contains a short description of the column content
+   INPUT.tss_upstream500.bed
-*Columns are tab delimited
-*'''IMPORTANT&nbsp;:''' although this is not dicted stricto-sensu in the [http://fantom.gsc.riken.jp/4/download/Tables/doc/090703-osctable.rtf OSCtable specification]&nbsp;:
-**the ordering of the column of the F5 HelicosCAGE clustering pipeline is fixed and corresponds to&nbsp;: id, chrom, start.0base, end, strand, raw.&lt;exp_name&gt;, norm.&lt;exp_name&gt;
-**'''this script will only work if the 2nd to 5th columns are&nbsp; "chrom", "start.0base", "end"&nbsp; and "strand"'''
-**positions are 0-based
-**chrom names are similar to that defined in the annotation files (aka similar to UCSC)
-*Data is sorted by position
-*The file may be in a gzip-compressed format
+==== CpG island  ====
-'''Path to the file containing the list of sorted BED6 annotation file paths'''<br>This list is to be sorted in the order in which potential annotations need to be looked-up, this ordering is crucial and allows for prioritarizing one of several possible annotations, for example a cluster matching the promoter of a transcript may also overlap the intron of another transcript,this alternative annotation being somehow less relevant than the 1st one (note that this can be turned of by setting the filter flag as anything but '1', see below)
+CpG island coordinates extracted from F5 UCSC mirror were expanded 200bp downstream in order to replicate the analysis in F3 Nat. Gen.&nbsp;:<br> "... we determined whether a ... CpG (within 200 bp) upstream of the start site of the clusters was present..."
-*This file must be a tab delimited txt file with the following 4 columns<br> 1) a name for the annotaion (aka prom500,S gencode_exon,AS ...) which describes synthetically the content of this annotation this string will be appended to the annotated OSCFile send to STDOUT in its one-before-last column<br> 2) the path to the BED6 formatted file holding the annotation <br> 3) the relative orientation of the annotation and input data (sense or&nbsp; antisense) accepted values are "sense", "1", "S", or "antisense", "-1", "AS"<br> 4) a flag set to "1" for the filtering out of the overlaping data in subsequent sequential annotation look up<br>this file may contain comments marked by the character "#" that will be ignored when parsed for execution of this script
-*The basename of this file will be used as the header associated metadata in&nbsp;:<br> - ParameterValue[annotation_list_path] = &lt;basename of this file&gt;<br> - ParameterValue[annotation_list] = &lt;basename of this file&gt;<br> - ColumnVariable[annot.&lt;basename of this file&gt;.class] = the class of ...<br> - ColumnVariable[annot.&lt;basename of this file&gt;.names] = comma delimited ...<br>
+==== TATA box  ====
+TATA box coordinate were extracted from Michiel's JASPAR_CORE TBP_MA0108.2 Position Weight Matrix whole genome scanning<br> Only JASPAR_CORE TBP_MA0108.2 PWM with a ppvalue greater than 3.5 were retained.<br> This threshold might need to be revisited after Sebastian and Timo's analysis of TSS surrounding motifs. An overview of the impact of this threshold on gencode coding and non-codoing derived TSS can be seen in CAGE-Tag-Cluster-Annotation-Building/TBP_JASPAR_CORE_MA0108.2.hg19.scanning_vs_gencodev7.tss.ppval_distrib.pdf
+==== UCSC Repeat masker  ====
+repName, repClass and repFamily annotate each cluster
+<br><br><br>
+=== Annotation Set Building pipeline ===
+Details of how (and scripts) the annotations sets have been built can be found in
+*[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation-Building/CAGE-Tag-Cluster-Annotation-Building.readme.sh CAGE-Tag-Cluster-Annotation-Building.readme]
+*[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation-Building/ CAGE-Tag-Cluster-Annotation-Building]
+*In particular the script [https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation-Building/CAGE-Tag-Cluster-Annotation-Building.split_bed12.pl CAGE-Tag-Cluster-Annotation-Building.split_bed12.pl] is useful to break down a transcript BED12 files into sub components (intron|exon, first|inner|last 5UTR|CDS|3UTR)
+=== Annotation Pipeline ===
+====Human Decomposition-based Peak Identification (DPI) cluster====
+*[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation.tc.decompose_smoothing_merged.hg19.29Jul11.readme.sh CAGE-Tag-Cluster-Annotation of tc.decompose_smoothing_merged.hg19.bed readme]
+====Mouse Decomposition-based Peak Identification (DPI) cluster====
+*[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation.tc.decompose_smoothing_merged.mm9.29Jul11.readme.sh CAGE-Tag-Cluster-Annotation of tc.decompose_smoothing_merged.mm9.bed readme]
+=== Annotation Results ===
+==== Human Decomposition-based Peak Identification (DPI) cluster  ====
+*https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/tc.decompose_smoothing_merged.hg19.annotations/
+**tc.decompose_smoothing_merged.hg19.CpGislands.annotated.osc.gz
+**tc.decompose_smoothing_merged.hg19.EST.annotated.osc.gz
+**tc.decompose_smoothing_merged.hg19.Ensembl.non_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.Ensembl.protein_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.F5_human_lncRNAome.annotated.osc.gz
+**tc.decompose_smoothing_merged.hg19.RefSeq.non_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.RefSeq.protein_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.TBP_JASPAR_CORE_MA0108.2.annotated.osc.gz
+**tc.decompose_smoothing_merged.hg19.gencode-pseudo.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.gencode.non_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.gencode.protein_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.knownGene.non_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.knownGene.protein_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.hg19.mRNA.annotated.osc.gz
+**tc.decompose_smoothing_merged.hg19.rmsk.annotated.repClass.repFamily.osc.gz
+==== Mouse Decomposition-based Peak Identification (DPI) cluster ====
+*https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/tc.decompose_smoothing_merged.mm9.annotations/
+**tc.decompose_smoothing_merged.mm9.CpGislands.annotated.osc.gz
+**tc.decompose_smoothing_merged.mm9.EST.annotated.osc.gz
+**tc.decompose_smoothing_merged.mm9.Ensembl.non_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.mm9.Ensembl.protein_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.mm9.RefSeq.non_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.mm9.RefSeq.protein_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.mm9.TBP_JASPAR_CORE_MA0108.2.annotated.osc.gz
+**tc.decompose_smoothing_merged.mm9.knownGene.non_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.mm9.knownGene.protein_coding.annotated.sym.osc.gz
+**tc.decompose_smoothing_merged.mm9.mRNA.annotated.osc.gz
+**tc.decompose_smoothing_merged.mm9.rmsk.annotated.repClass.repFamily.osc.gz
+<br><br><br>
+== Annotating your own clusters<br>  ==
+=== Using CAGE-Tag-Cluster-Annotation.sh<br>  ===
+<strike>[[Image:CAGE-Tag-Cluster-Annotation.sh.zip|CAGE-Tag-Cluster-Annotation.sh]]</strike>
+[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation.sh CAGE-Tag-Cluster-Annotation.sh]<br> <strike>[[Image:CAGE-Tag-Cluster-Annotation-Bin.zip|CAGE-Tag-Cluster-Annotation.sh accompanying required BedTools binaries]]</strike>
+[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation-Bin/  CAGE-Tag-Cluster-Annotation.sh accompanying required BedTools binaries]<br>
+<strike> [[Image:CAGE-Tag-Cluster-Annotation-Data.zip|CAGE-Tag-Cluster-Annotation.sh annotation files]]</strike>
+[https://fantom5-collaboration.gsc.riken.jp/webdav/home/nbertin/CAGE-Tag-Cluster-Annotation_Aug11/CAGE-Tag-Cluster-Annotation-Data/ CAGE-Tag-Cluster-Annotation.sh annotation files]
+'''Please note that this is now version 1.8''' <br>which has significant speed when improvement dealing with already BED-like formatted input and also allows for numbering the column according to their input order (annotated output will have a different order corresponding to those of the ordered matching annotations)
+<pre>usage: ./CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
+   Use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters
+   produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (-o or STDIN for
+   F5 HelicosCAGE clustering pipeline output files which assumes a precise ordering of genome
+   coodinates columns (see details of the -o option below) and the addition of tab-delmimted
+   feature-wise metadata, or -b for BED like ordering of genome coodinates columns, note that
+   it will also accomodate additioan tab delimited columns see details of the -b option below)
+   with respect to an ordered list of BED6 formatted annotations files, those files are to
+   be listed in a tab delimited text file (-a) which along with the path to the actual BED6
+   formatted annotation file contains information regarding the respective orientation of
+   the cluster and annotation to be considered (aka in 'sense' or 'anitsense'), the name to
+   be reported and if clusters overlapping a first annotation should be filtered out or taken
+   into consideration or not if overlapping a second annotation further down the ordered list
+   (for more details, see the description the -a option or few of the illustrative examples
+   below).
+   Additionally, it is possible to trigger the summation of (numerical) columns (-s) by
+   aggregating over the annotation names (-n) and/or annotation classes (-c) of the list of
+   BED6 formatted annotations files (See examples below for more details).
+version: 1.8
+OPTIONS:
+   -h      Show this message.
+   -v      Verbose level (1-9).
+   -w      Show the commented lines of this script.
+   SIMPLE CLUSTER ANNOTATIONS
+   -o      Path to the (optionally gzip-compressed) OSCfile to be annotated
+           The OSCfile is likely obtained from F5 HelicosCAGE clustering pipeline,
+           therefore, it is assumed that :
+               * The file contains a header, each line of which is marked by "##".
+               * Comment line that will be reported but whose position relative to
+                 the data might be altered are marked by "#".
+               * The first non "#|##" line contains a short description of the column
+                 content.
+               * Columns are tab delimited.
+               * The ordering of the columns is fixed and corresponds to :
+                    id, chrom, start.0base, end, strand, raw.<exp_name>, norm.<exp_name>
+                 See the -b parameter for alternative BED compliant column ordering
+                 Note: actually works as long as ordering of the 1st columns are
+                    <something>, chrom, start.0base, end, strand
+               * Positions are 0-based (see column ordering expectations).
+               * 'chrom' names must be similar to those defined in the annotation files.
+               * The file may be in a gzip-compressed format.
+   -b      Path to the (optionally gzip-compressed) BED6-like formatted OSCfile.
+           The -b option is an alternative to using F5 HelicosCAGE clustering pipeline OSCfile,
+           in which the ordering of the columns is fixed and corresponds to :
+                   chrom, start.0base, end, id, score, strand
+           Note that the column order is preseved in the output annotation added OSCfile
+           Similarly to the -o parameter: 'chrom' names must be similar to those of the annotation
+           files, the BED6-like file can contain comment lines marked by '#' and can be in a
+           gzip-compressed form.
+   -a      Path to the file containing the list of sorted BED6 annotation file paths
+           This list is to be sorted in the order in which potential annotations need to be
+           looked-up, this ordering is crucial and allows for prioritizing one of several
+           possible annotations, for example a cluster matching the promoter  of a transcript
+           may also overlap the intron of another transcript,this alternative annotation being
+           somehow less relevant than the first one (note that this can be turned of by setting
+           the filter flag as anything but '1', see below).
+               * This file must be a tab delimited txt file with the following 4 columns
+) A name for the annotation (aka prom500,S  gencode_exon,AS ...) which
+                       describes synthetically the content of this annotation.
+                       This string will be appended to the annotated OSCFile send to STDOUT
+                       in its one-before-last column.
+) The path to the BED6 formatted file holding the annotation
+) The relative orientation of the annotation and input data (sense or
+                       antisense).
+                       Accepted values are "sense", "1", "S", or "antisense", "-1", "AS".
+) A flag set to "1" for the filtering out of the overlapping data in
+                       subsequent sequential annotation look up.
+               * This file may contain comments marked by the character "#" that will be ignored
+                 when parsed for execution of this script
+               * The basename of this file will be used as the header associated metadata in :
+                     - ParameterValue[annotation_list_path] = <basename of this file>
+                     - ParameterValue[annotation_list] = <basename of this file>
+                     - ColumnVariable[annotation.<basename of this file>.class] = the class of ...
+                     - ColumnVariable[annotation.<basename of this file>.names] = comma delimited ...
+   -g      Flag allowing the grouping of annotation names sharing the same class (aka prom, UTR)
+           It is recommended to use this flag in order to obtain a single concise annotation for
+           each given cluster.
+   -z      Flag triggering the reporting of the sole annotation classes (aka without annotation names)
+   -x      allows for specifying the string used to mark "intergenic" aka cluster that did not
+           overlap with any of the listed annotation.
+   -y      Adds a column corresponding the (comment and header less) line number of the input OSC or
+           BED6-like file. This option is useful to recover the initial sorting of the input file from
+           the annotated outputed OSCfile (which otherwise is sorted in part along the order of the
+           matched annotations provided by the -a parameter)
+           This line number is added just after the last column of the input data and right before
+           the annotations
+           the ColumnVariable[InputFile.DataLineNum] = <line_number> is added to the output OSCheader
+   ADDITIONAL OPTIONS FOR CLUSTER/ANNOTATIONS-DERIVED EXPRESSION LEVEL
+   -s      Comma delimited list of the column index to be summed up. In which case, the flag -n, -c
+           or -nc becomes mandatory
+   -n      Flag triggering the aggregation using the overlapping annotation name (aka NM_2345)
+   -c      Flag triggering the aggregation using the class of the overlapping annotation (aka
+           prom500,S  UTR,AS etc). Note that giving the list of comma delimited column indexes to
+           be summed up (option -s) becomes mandatory.
+    BUG NEEDING FIXING
+          [Feb 2nd 2011] When piping successive calls to ./CAGE-Tag-Cluster-Annotation.sh,
+                         blank lines are inserted between the header and the actual data.
+                         while this is not an issue for most usage of the script, it may cause
+                         problem when parsing the file into ZENBU
+    EXAMPLES
+) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotion_list.txt -g > cluster.refgene_annotated.osc
+             Will Annotate clusters with respect to refgene coding and non-coding, proximal
+             promoter regions, introns, exons, UTRs, ...
+             Clusters (XYZ and ABC) overlapping the proximal promoter regions of more than
+             one transcripts (say NM_12345 and NM_23456 or NM_56789) will be annotated as
+                  "XYZ  XYZ_coord  prom500,S  NM_12345,NM_23456"
+                  "ABC  ABC_coord  exon,S     NM_56789,"
+) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotation_list.txt > cluster.refgene_split_annotation.osc
+             Same as example 1) but ommiting the -g flag
+             clustres will be annotated as "XYZ  XYZ_coord  prom500,S  NM_12345"
+                                           "XYZ  XYZ_coord  prom500,S  NM_23456"
+                                           "ABC  ABC_coord  exon,S     NM_56789,"
+) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.some_annotion_list.txt -g | ./CAGE-Tag-Cluster-Annotation.sh \
+              -a repeat.annotation_list.txt -g -x NA> cluster.refgene_and_repeat_annotated.osc
+             By piping the result of example 1) into a second round of annotations
+             (e.g. repeat element), with "NA" (specified using -x) when no overlap
+             was found clusters will be annotated as
+                   "XYZ  XYZ_coord  prom500,S  NM_12345,NM_23456 repeat      LINE"
+                   "XYZ  XYZ_coord  prom500,S  NM_23456          repeat      LINE"
+                   "ABC  ABC_coord  exon,S     NM_56789,         NA          NA"
+) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -n -s 6,7 > cluster.refgene_expression.osc
+             Provided refgene_prom_UTR.txt lists BED a file of refgene promoters
+             and a BED file of UTRs. This will allow the gathering of the gene
+             expression level in tagcount and tmp (6th and 7th column in test.osc),
+             is other word summing up all the expression of annotation overlapping
+             clusters. Resulting in an OSC file containing "NM_12345   50   0.25"
+                                                           "NM_23456   26   0.13"
+) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -nc -s 6,7 > cluster.refgene_subfeature_expression.osc
+             Same as above but adding the flag -c
+             expression will be reported as "prom500,S   NM_12345   45  0.225"
+                                            "5UTR,S      NM_12345    4  0.02"
+                                            "5UTR,AS     NM_12345    1  0.005"
+                                            "prom500,S   NM_23456   25  0.0125"
+                                            "5UTR,S      NM_12345    1  0.005"
+</pre>
 <br>
 <br>
+=== Using the latest BedTools "annotateBed" tool<br>  ===
+&nbsp;&nbsp; The [http://code.google.com/p/bedtools/#Latest_news_%28Version_2.11.2,_31-January-2011%29 latest version of Bedtools (v2.11.2, 31st January) ]includes a novel tool called "annotateBed" that annotates one BED/VCF/GFF file with the coverage and number of overlaps observed from multiple other BED/VCF/GFF files. In this way, it allows one to ask to what degree one feature coincides with multiple other feature types with a single command. This is a potentially good alternative to CAGE-Tag-Cluster-Annotation.sh provided the file you want to annotate is BED VCF or GFF formatted and you only care about one quantitative value (that you would bury in the BED6::score columns).&nbsp; <br><br>
+&nbsp;&nbsp; Here is a quick way to transform the OSCfile provided in FANTOM5 updates into BED and a (rather simplistic) example of annotateBed usage, counting the number of entries of each annotation for each cluster (note that here the score is et to "1" but could be replaced by any of the column containing a quantitative value of interest, also for more clever use of the tool please have a look at BedTools man page)<br>
+<pre>awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' cluter_file.osc &gt; cluster_file.bed
+annotateBed -s -counts -i cluster_file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed &gt; cluster_file.annotated
+</pre>
+Side note, since the ordering of the data in cluster_file.bed should be preserved in cluster_file.annotated, to get antisense count, one may use&nbsp;:<br>
+<pre>awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' file.osc &gt; file.bed
+annotateBed    -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed &gt; file.annotated_SAS
+annotateBed -s -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed &gt; file.annotated_S
+paste file.annotated_SAS file.annotated_S | awk '{FS="\t"}{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7-$14,$8-$15}' &gt; file.annotated_AS
+</pre>
+<br><br><br>
 == Promoter / 'gene' expression<br>  ==
+<br>
+&nbsp;&nbsp; Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script which '''aggregates the value of a particular OSCfile given the associated annotation category''' can be used to obtain '''promoter or 'transcript' expression levels''' from an input OSCfile<br>
+<br>
+This extension allows for "complex" transcript expression level to be obtained <br>&nbsp;&nbsp;&nbsp;&nbsp;i.e. report the exprxpression level of a transcript as the '''sum up the tags/cluster overlapping the proximal promoter and 5'UTR exons only'''.<br> &nbsp;&nbsp;&nbsp;&nbsp;Or as '''the sum up the tags/cluster overlapping the proximal promoter and entire first 5'UTR exon and intron''', etc
+<br>
+<pre>usage: ./CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
-&nbsp;&nbsp; Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script can be which aggregates the value of a particular OSCfile given the associated annotation category <br>
-<blockquote>CAGE-Tag-Cluster-Annotation.sh -o &lt;OSCfile to annotate&gt; -a &lt;file listing the annotation to be used&gt;<br>OPTIONS: -h Show this message<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -s The column number that will be summed, aggregating data according to the BED6 annotation class<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -o Path to the (optionally gzip-compressed) OSCfile to be annotated<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -a Path to the file containing the list of sorted BED6 annotation file paths<br><br></blockquote>
+   Use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters
+   produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (-o or STDIN for
+   F5 HelicosCAGE clustering pipeline output files...
+   Additionally, it is possible to trigger the summation of (numerical) columns (-s) by
+   aggregating over the annotation names (-n) and/or annotation classes (-c) of the list of
+   BED6 formatted annotations files (See examples below for more details).'''
+version: 1.8
+OPTIONS:
+   -h      Show this message.
+   -v      Verbose level (1-9).
+   -w      Show the commented lines of this script.
+   SIMPLE CLUSTER ANNOTATIONS
+   -o      Path to the (optionally gzip-compressed) OSCfile to be annotated
+   -b      Path to the (optionally gzip-compressed) BED6-like formatted OSCfile.
+   -a      Path to the file containing the list of sorted BED6 annotation file paths
+   -g      Flag allowing the grouping of annotation names sharing the same class (aka prom, UTR)
+           It is recommended to use this flag in order to obtain a single concise annotation for
+           each given cluster.
+   ADDITIONAL OPTIONS FOR CLUSTER/ANNOTATIONS-DERIVED EXPRESSION LEVEL
+   -s      Comma delimited list of the column index to be summed up. In which case, the flag -n, -c
+           or -nc becomes mandatory
+   -n      Flag triggering the aggregation using the overlapping annotation name (aka NM_2345)
+   -c      Flag triggering the aggregation using the class of the overlapping annotation (aka
+           prom500,S  UTR,AS etc). Note that giving the list of comma delimited column indexes to
+           be summed up (option -s) becomes mandatory.
+    EXAMPLES
+) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -n -s 6,7 > cluster.refgene_expression.osc
+             Provided refgene_prom_UTR.txt lists BED a file of refgene promoters
+             and a BED file of UTRs. This will allow the gathering of the gene
+             expression level in tagcount and tmp (6th and 7th column in test.osc),
+             is other word summing up all the expression of annotation overlapping
+             clusters. Resulting in an OSC file containing "NM_12345   50   0.25"
+                                                           "NM_23456   26   0.13"
+) ./CAGE-Tag-Cluster-Annotation.sh -o cluster.osc -a refgene.prom_UTR.txt -nc -s 6,7 > cluster.refgene_subfeature_expression.osc
+             Same as above but adding the flag -c
+             expression will be reported as "prom500,S   NM_12345   45  0.225"
+                                            "5UTR,S      NM_12345    4  0.02"
+                                            "5UTR,AS     NM_12345    1  0.005"
+                                            "prom500,S   NM_23456   25  0.0125"
+                                            "5UTR,S      NM_12345    1  0.005"
+</pre>
+<br>
+== Future directions / Suggestions<br>  ==
+*<strike>Add repeat elements (UCSC rmsk table) to the list of provided default annotations</strike>
+*<strike>Add CpG island (from UCSC cpgIslandExt table)</strike><br>
+*<strike>Add ESTs (obtained from UCSC all_est table)</strike>
+*<strike>Add TATA box (in particular take advantage of Michiel Motif Activity derived TPB scanning to define TATA box)</strike><br>
+*for TATA box and CpG&nbsp;island, write a script / wrapper around BedTools "closestBed" to not only provide the CpG/TATA status of promoters but more informatively give the distance to which the closest CpG/TATA is located
+*few ENCODE supertracks
+*Julian Gough's group protein domains annotations (via Ensmbl transcripts)
+<br> ''Do not hesitate to add your own suggestions here (this is a wiki page) or email me at nbertin@gsc.riken.jp ''
+*...
+*...
+<br>
+== (DEPRECATED) promoter annotations "[https://fantom5-collaboration.gsc.riken.jp/files/data/shared/LATEST_UPDATE/ up to UPDATE_011]"<br>  ==
+=== Annotation datasets<br>  ===
+==== <strike>RefSeq</strike>  ====
+Concerns were raised over the completness of RefSeq annotation, In future UPDATES, it will be replaced by ensembl<br>
+==== Ensembl  ====
+Instead of RefSeq for cross-species annotation / expression comparison<br>
+==== Gencode (hg19 only)  ====
+==== lncRNA (provided by Leonard) <br> ====
+The current reference nonredundant list of 8858 human lncRNA genes (chr, direction, TSS, end):
+[[Media:F5_human_lncRNAome%28Jia%26Lipovich_Gencode_Lander%29.xls]]
+=== FANTOM5 annotation pipeline <br>  ===
+&nbsp;&nbsp; Timo and Hasegawa-san have appended the CAGE-Tag-Cluster-Annotation.sh script and the annotation sets mentionned ablove to their pipeline implementation, automating the annotation of HelicosCAGE clusters with respect to refseq transcript models (for all genomes) and gencode transcript models (hg19 only).<br>
+<u>Below is the graphical representatin of this annotation pipeline&nbsp;:</u>[[Image:Cluster annotation pipeline.02feb11.png|graphical representation of the cluster annotation pipeline]]
+<br>
+<br>
+[[Category:Methods_and_Protocols]]

User:Nbertin/Transcript model derived annotation protocol: Difference between revisions

Latest revision as of 20:15, 4 August 2011

Promoter annotations of "UPDATE_012 Decomposition-based Peak Identification (DPI) cluster "

Annotation datasets

RefSeq, Ensembl and gencode

lncRNA (provided by Leonard)

UCSC ESTs

UCSC all_mrna

CpG island

TATA box

UCSC Repeat masker

Annotation Set Building pipeline

Annotation Pipeline

Human Decomposition-based Peak Identification (DPI) cluster

Mouse Decomposition-based Peak Identification (DPI) cluster

Annotation Results

Human Decomposition-based Peak Identification (DPI) cluster

Mouse Decomposition-based Peak Identification (DPI) cluster

Annotating your own clusters

Using CAGE-Tag-Cluster-Annotation.sh

Using the latest BedTools "annotateBed" tool

Promoter / 'gene' expression

Future directions / Suggestions

(DEPRECATED) promoter annotations "up to UPDATE_011"

Annotation datasets

RefSeq

Ensembl

Gencode (hg19 only)

lncRNA (provided by Leonard)

FANTOM5 annotation pipeline

Navigation menu

Search