User:Nbertin/Transcript model derived annotation protocol

From Wiki
< User:Nbertin
Revision as of 21:05, 2 February 2011 by Nbertin (talk | contribs)
Jump to navigationJump to search


    Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron).
    The protocol and script described here use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters  produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile (see (User:Tlassmann/clustering)) with respect to an ordered serie of BED6 formatted annotations. This script will be run as part of the CAGE post-processing pipeline and data file for each independant libraries will be provided in datasets updates

   Alternatively you can also use the script CAGE-Tag-Cluster-Annotation.sh to redo this annotation step with your own clustered data or using a different list of annotations, altering their ordering, etc or to obtain Promoter / 'gene' expression aggregated over the provided annotations



Promoter annotations provided in "datasets updates"

Annotation datasets

RefSeq

Gencode (hg19 only)

FANTOM5 annotation pipeline

Timo and Hasegawa-san have appended the CAGE-Tag-Cluster-Annotation.sh script and the annotation sets mentionned ablove to their pipeline implementation, automating the annotation of HelicosCAGE clusters with respect to refseq transcript models (for all genomes) and gencode transcript models (hg19 only).

Below is the graphical representatin of this annotation pipeline :graphical representation of the cluster annotation pipeline



Annotating your own clusters

Using CAGE-Tag-Cluster-Annotation.sh

CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
OPTIONS: -h Show this message
                  -v Verbose level (1-9)
                  -o Path to the (optionally gzip-compressed) OSCfile to be annotated
                  -a Path to the file containing the list of sorted BED6 annotation file paths

This script will send to STDOUT the exact content of the input OSCfile with a slihgty modified header reflecting the annotation step and two additional columns

  • The class of the annotation (as given in the file containing the list of sorted BED6)
  • A comma delimited list of the (BED6)name of the evidences (aka transcript names in the case of the BED6 file processed as part of the CAGEcluster post-processing pipeline)

Path to the (optionally gzip-compressed) OSCfile to be annotated.
This file is likely obtained from F5 HelicosCAGE clustering pipeline, therefore, it assumes that :

  • The file contain a header, each line of which is marked by "##"
  • Comment lines that will be reported but whose position relative to the data might be altered are marked by "##"
  • The first non "#|##" line contains a short description of the column content
  • Columns are tab delimited
  • IMPORTANT : although this is not dicted stricto-sensu in the OSCtable specification :
    • the ordering of the column of the F5 HelicosCAGE clustering pipeline is fixed and corresponds to : id, chrom, start.0base, end, strand, raw.<exp_name>, norm.<exp_name>
    • this script will only work if the 2nd to 5th columns are  "chrom", "start.0base", "end"  and "strand"
    • positions are 0-based
    • chrom names are similar to that defined in the annotation files (aka similar to UCSC)
  • Data is sorted by position
  • The file may be in a gzip-compressed format

Path to the file containing the list of sorted BED6 annotation file paths
This list is to be sorted in the order in which potential annotations need to be looked-up, this ordering is crucial and allows for prioritarizing one of several possible annotations, for example a cluster matching the promoter of a transcript may also overlap the intron of another transcript,this alternative annotation being somehow less relevant than the 1st one (note that this can be turned of by setting the filter flag as anything but '1', see below)

  • This file must be a tab delimited txt file with the following 4 columns
    1) a name for the annotaion (aka prom500,S gencode_exon,AS ...) which describes synthetically the content of this annotation this string will be appended to the annotated OSCFile send to STDOUT in its one-before-last column
    2) the path to the BED6 formatted file holding the annotation
    3) the relative orientation of the annotation and input data (sense or  antisense) accepted values are "sense", "1", "S", or "antisense", "-1", "AS"
    4) a flag set to "1" for the filtering out of the overlaping data in subsequent sequential annotation look up
    this file may contain comments marked by the character "#" that will be ignored when parsed for execution of this script
  • The basename of this file will be used as the header associated metadata in :
    - ParameterValue[annotation_list_path] = <basename of this file>
    - ParameterValue[annotation_list] = <basename of this file>
    - ColumnVariable[annot.<basename of this file>.class] = the class of ...
    - ColumnVariable[annot.<basename of this file>.names] = comma delimited ...


Using the latest BedTools "annotateBed" tool

The latest version of Bedtools (v2.11.2, 31st January) includes a novel tool called "annotateBed" that annotates one BED/VCF/GFF file with the coverage and number of overlaps observed from multiple other BED/VCF/GFF files. In this way, it allows one to ask to what degree one feature coincides with multiple other feature types with a single command. This is a potentially good alternative to CAGE-Tag-Cluster-Annotation.sh provided the file you want to annotate is BED VCF or GFF formatted and you only care about one quantitative value (that you would bury in the BED6::score columns). 

Here is a quick way to transform the OSCfile provided in FANTOM5 updates into BED and a (rather simplistic) example of annotateBed usage, counting the number of entries of each annotation for each cluster (note that here the score is et to "1" but could be replaced by any of the column containing a quantitative value of interest, also for more clever use of the tool please have a look at BedTools man page)

awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' cluter_file.osc > cluster_file.bed
annotateBed -s -counts -i cluster_file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > cluster_file.annotated

Side note, since the ordering of the data in cluster_file.bed should be preserved in cluster_file.annotated, to get antisense count, one may use :

awk '{FS="\t"}{OFS="\t"}{print $2,$3,$4,$1,"1",$5}' file.osc > file.bed
annotateBed    -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > file.annotated_SAS
annotateBed -s -counts -i file.bed -files refseq_prom500.bed refseq_exon.bed refseq_intron.bed > file.annotated_S
paste file.annotated_SAS file.annotated_S | awk '{FS="\t"}{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7-$14,$8-$15}' > file.annotated_AS

Promoter / 'gene' expression

   Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script which aggregates the value of a particular OSCfile given the associated annotation category cn be used to obtain promoter or 'gene' expression levels from an input OSCfile

CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
OPTIONS: -h Show this message
                  -s The column number that will be summed, aggregating data according to the BED6 annotation class
                 -o Path to the (optionally gzip-compressed) OSCfile to be annotated
                 -a Path to the file containing the list of sorted BED6 annotation file paths