User:Nbertin/Transcript model derived annotation protocol: Difference between revisions

From Wiki
Jump to navigationJump to search
Line 7: Line 7:
<br>
<br>


== Promoter annotations provided in "[https://fantom5-collaboration.gsc.riken.jp/files/data/shared/LATEST_UPDATE/ datasets updates]"<br> ==
== Promoter annotations provided in "[https://fantom5-collaboration.gsc.riken.jp/files/data/shared/LATEST_UPDATE/ datasets updates up to UPDATE_011]"<br> ==


=== Annotation datasets<br> ===
=== Annotation datasets<br> ===

Revision as of 16:24, 4 August 2011


    Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq for all organisms and gencode for human samples) feature : proximal promoter region, 5' and 3' UTR, exon (coding, located in UTR region or associated to non-coding transcripts), intron (similarly coding, located in UTR region or associated to non-coding transcripts).
    The protocol and script described here use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters  produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (see User:Tlassmann/clustering for the details of the clustering pipeline) with respect to an ordered serie of BED6 formatted annotations. This script will be run as part of the CAGE post-processing pipeline and data file for each independant libraries will be provided in datasets updates.

   Alternatively you can also use the script CAGE-Tag-Cluster-Annotation.sh to redo this annotation step with your own clustered data or using a different list of annotations, altering their ordering, etc or to obtain promoter / 'gene' expression aggregated over the provided annotations.



Promoter annotations provided in "datasets updates up to UPDATE_011"

Annotation datasets

RefSeq

Concerns were raised over the completness of RefSeq annotation, In future UPDATES, it will be replaced by ensembl

Ensembl

Instead of RefSeq for cross-species annotation / expression comparison

Gencode (hg19 only)

lncRNA (provided by Leonard)

The current reference nonredundant list of 8858 human lncRNA genes (chr, direction, TSS, end): Media:F5_human_lncRNAome(Jia&Lipovich_Gencode_Lander).xls

FANTOM5 annotation pipeline

   Timo and Hasegawa-san have appended the CAGE-Tag-Cluster-Annotation.sh script and the annotation sets mentionned ablove to their pipeline implementation, automating the annotation of HelicosCAGE clusters with respect to refseq transcript models (for all genomes) and gencode transcript models (hg19 only).

Below is the graphical representatin of this annotation pipeline :graphical representation of the cluster annotation pipeline



Annotating your own clusters

Using CAGE-Tag-Cluster-Annotation.sh

File:CAGE-Tag-Cluster-Annotation.sh.zip
File:CAGE-Tag-Cluster-Annotation-Bin.zip
CAGE-Tag-Cluster-Annotation.sh annotation files

usage: ./CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>

Use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters
produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (-o or STDIN)
with respect to an ordered list of BED6 formatted annotations files, those files are to
be listed in a tab delimited text file (-a) which along with the path to the actual BED6
formatted annotation file contains information regarding the respective orientation of
the cluster and annotation to be considered (aka in 'sense' or 'anitsense'), the name to
be reported and if clusters overlapping a first annotation should be filtered out or taken
into consideration or not if overlapping a second annotation further down the ordered list
(for more details, see the description the -a option or few of the illustrative examples
below).
Additionally, it is possible to trigger the summation of (numerical) columns (-s) by
aggregating over the annotation names (-n) and/or annotation classes (-c) of the list of
BED6 formatted annotations files (See examples below for more details).

version: 1.6

OPTIONS:
-h Show this message.
-v Verbose level (1-9).
-w Show the commented lines of this script.

SIMPLE CLUSTER ANNOTATIONS
-o Path to the (optionally gzip-compressed) OSCfile to be annotated
   The OSCfile is likely obtained from F5 HelicosCAGE clustering pipeline,
   therefore, it assumes that :
      * The file contain a header, each line of which is marked by "##".
      * Comment line that will be reported but whose position relative to
      the data might be altered are marked by "##".
      * The first non "#|##" line contains a short description of the column
      content.
      * Columns are tab delimited.
      * The ordering of the columns is fixed and corresponds to :
         id, chrom, start.0base, end, strand, raw.<exp_name>, norm.<exp_name>
      side note: this script will work as long as ordering of the 1st columns are
        <something>, chrom, start.0base, end, strand
      * Positions are 0-based (see column ordering expectations).
      * 'chrom' names are similar to those defined in the annotation files.
      * The file may be in a gzip-compressed format.
-a Path to the file containing the list of sorted BED6 annotation file paths
   This list is to be sorted in the order in which potential annotations need to be
   looked-up, this ordering is crucial and allows for prioritizing one of several
   possible annotations, for example a cluster matching the promoter of a transcript
   may also overlap the intron of another transcript,this alternative annotation being
   somehow less relevant than the first one (note that this can be turned of by setting
   the filter flag as anything but '1', see below).
      * This file must be a tab delimited txt file with the following 4 columns
          1) A name for the annotation (aka prom500,S gencode_exon,AS ...) which
        describes synthetically the content of this annotation.
        This string will be appended to the annotated OSCFile send to STDOUT
        in its one-before-last column.
          2) The path to the BED6 formatted file holding the annotation
          3) The relative orientation of the annotation and input data (sense or
        antisense).
        Accepted values are "sense", "1", "S", or "antisense", "-1", "AS".
          4) A flag set to "1" for the filtering out of the overlapping data in
        subsequent sequential annotation look up.
      * This file may contain comments marked by the character "#" that will be ignored
        when parsed for execution of this script
      * The basename of this file will be used as the header associated metadata in :
          - ParameterValue[annotation_list_path] = <basename of this file>
          - ParameterValue[annotation_list] = <basename of this file>
          - ColumnVariable[annotation.<basename of this file>.class] = the class of ...
          - ColumnVariable[annotation.<basename of this file>.names] = comma delimited ...
-g Flag allowing the grouping of annotation names sharing the same class (aka prom, UTR)
   It is recommended to use this flag in order to obtain a single concise annotation for
   each given cluster.
-z Flag triggering the reporting of the sole annotation classes (aka without annotation names)
-x allows for specifying the string used to mark "intergenic" aka cluster that did not
   overlap with any of the listed annotation.

ADDITIONAL OPTIONS FOR CLUSTER/ANNOTATIONS-DERIVED EXPRESSION LEVEL
-s Comma delimited list of the column index to be summed up. In which case, the flag -n, -c
   or -nc becomes mandatory
-n Flag triggering the aggregation using the overlapping annotation name (aka NM_2345)
-c Flag triggering the aggregation using the class of the overlapping annotation (aka
   prom500,S UTR,AS etc). Note that giving the list of comma delimited column indexes to
   be summed up (option -s) becomes mandatory.

BUG NEEDING FIXING

[Feb 2nd 2011] When piping successive calls to ./CAGE-Tag-Cluster-Annotation.6.sh,
blank lines are inserted between the header and the actual data.
while this is not an issue for most usage of the script, it may cause
problem when parsing the file into ZENBU

EXAMPLES

1) ./CAGE-Tag-Cluster-Annotation.6.sh -o cluster.osc -a refgene.some_annotion_list.txt -g > cluster.refgene_annotated.osc
Will Annotate clusters with respect to refgene coding and non-coding, proximal
promoter regions, introns, exons, UTRs, ...
Clusters (XYZ and ABC) overlapping the proximal promoter regions of more than
one transcripts (say NM_12345 and NM_23456 or NM_56789) will be annotated as
"XYZ XYZ_coord prom500,S NM_12345,NM_23456"
"ABC ABC_coord exon,S NM_56789,"

2) ./CAGE-Tag-Cluster-Annotation.6.sh -o cluster.osc -a refgene.some_annotation_list.txt > cluster.refgene_split_annotation.osc
Same as example 1) but ommiting the -g flag
clustres will be annotated as
"XYZ XYZ_coord prom500,S NM_12345"
"XYZ XYZ_coord prom500,S NM_23456"
"ABC ABC_coord exon,S NM_56789,"

3) ./CAGE-Tag-Cluster-Annotation.6.sh -o cluster.osc -a refgene.some_annotion_list.txt -g | ./CAGE-Tag-Cluster-Annotation.6.sh -a repeat.annotation_list.txt -g -x NA> cluster.refgene_and_repeat_annotated.osc
By piping the result of example 1) into a second round of annotations
(e.g. repeat element), with "NA" (specified using -x) when no overlap
was found clusters will be annotated as
"XYZ XYZ_coord prom500,S NM_12345,NM_23456 repeat LINE"
"XYZ XYZ_coord prom500,S NM_23456 repeat LINE"
"ABC ABC_coord exon,S NM_56789, NA NA"

4) ./CAGE-Tag-Cluster-Annotation.6.sh -o cluster.osc -a refgene.prom_UTR.txt -n -s 6,7 > cluster.refgene_expression.osc
Provided refgene_prom_UTR.txt lists BED a file of refgene promoters
and a BED file of UTRs. This will allow the gathering of the gene
expression level in tagcount and tmp (6th and 7th column in test.osc),
is other word summing up all the expression of annotation overlapping
clusters. Resulting in an OSC file containing "NM_12345 50 0.25"
"NM_23456 26 0.13"

5) ./CAGE-Tag-Cluster-Annotation.6.sh -o cluster.osc -a refgene.prom_UTR.txt -nc -s 6,7 > cluster.refgene_subfeature_expression.osc
Same as above but adding the flag -c
expression will be reported as 
"prom500,S   NM_12345   45  0.225"
"5UTR,S      NM_12345    4  0.02"
"5UTR,AS     NM_12345    1  0.005"
"prom500,S   NM_23456   25  0.0125"
"5UTR,S      NM_12345    1  0.005"



Using the latest BedTools "annotateBed" tool

   The latest version of Bedtools (v2.11.2, 31st January) includes a novel tool called "annotateBed" that annotates one BED/VCF/GFF file with the coverage and number of overlaps observed from multiple other BED/VCF/GFF files. In this way, it allows one to ask to what degree one feature coincides with multiple other feature types with a single command. This is a potentially good alternative to CAGE-Tag-Cluster-Annotation.sh provided the file you want to annotate is BED VCF or GFF formatted and you only care about one quantitative value (that you would bury in the BED6::score columns). 

   Here is a quick way to transform the OSCfile provided in FANTOM5 updates into BED and a (rather simplistic) example of annotateBed usage, counting the number of entries of each annotation for each cluster (note that here the score is et to "1" but could be replaced by any of the column containing a quantitative value of interest, also for more clever use of the tool please have a look at BedTools man page)

awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' cluter_file.osc > cluster_file.bed
annotateBed -s -counts -i cluster_file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > cluster_file.annotated

Side note, since the ordering of the data in cluster_file.bed should be preserved in cluster_file.annotated, to get antisense count, one may use :

awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' file.osc > file.bed
annotateBed    -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > file.annotated_SAS
annotateBed -s -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > file.annotated_S
paste file.annotated_SAS file.annotated_S | awk '{FS="\t"}{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7-$14,$8-$15}' > file.annotated_AS



Promoter / 'gene' expression

   Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script which aggregates the value of a particular OSCfile given the associated annotation category cn be used to obtain promoter or 'gene' expression levels from an input OSCfile

CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
OPTIONS: -h Show this message
                  -s The column number that will be summed, aggregating data according to the BED6 annotation class
                 -o Path to the (optionally gzip-compressed) OSCfile to be annotated
                 -a Path to the file containing the list of sorted BED6 annotation file paths


Future directions / Suggestions

  • Add repeat elements (UCSC rmsk table) to the list of provided default annotations
  • Add CpG island (from UCSC cpgIslandExt table)
  • Add ESTs (obtained from UCSC all_est table)
  • Add TATA box (in particular take advantage of Michiel Motif Activity derived TPB scanning to define TATA box)
  • for TATA box and CpG island, write a script / wrapper around BedTools "closestBed" to not only provide the CpG/TATA status of promoters but more informatively give the distance to which the closest CpG/TATA is located


Do not hesitate to add your own suggestions here (this is a wiki page) or email me at nbertin@gsc.riken.jp

  • ...
  • ...