User:Nbertin/Transcript model derived annotation protocol
Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron).
The protocol and script described here use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile with respect to an ordered serie of BED6 formatted annotations. This script will be run as part of the CAGE post-processing pipeline and data file for each independant libraries will be provided in datasets updates
Alternatively you can also use the script CAGE-Tag-Cluster-Annotation.sh to redo this annotation step with your own clustered data or using a different list of annotations, altering their ordering, etc or to obtain Promoter / 'gene' expression aggregated over the provided annotations
Promoter annotations provided in datasets updates
RefSeq
Gencode (hg19 only)
Annotating your own clusters (CAGE-Tag-Cluster-Annotation.sh)
CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
OPTIONS: -h Show this message
-v Verbose level (1-9)
-o Path to the (optionally gzip-compressed) OSCfile to be annotated
-a Path to the file containing the list of sorted BED6 annotation file paths
Path to the (optionally gzip-compressed) OSCfile to be annotated.
This file is likely obtained from F5 HelicosCAGE clustering pipeline, therefore, it assumes that :
- The file contain a header, each line of which is marked by "##"
- Comment lines that will be reported but whose position relative to the data might be altered are marked by "##"
- The first non "#|##" line contains a short description of the column content
- Columns are tab delimited
- IMPORTANT : although this is not dicted stricto-sensu in the OSCtable specification :
- the ordering of the column of the F5 HelicosCAGE clustering pipeline is fixed and corresponds to : id, chrom, start.0base, end, strand, raw.<exp_name>, norm.<exp_name>
- this script will only work if the 2nd to 5th columns are "chrom", "start.0base", "end" and "strand"
- positions are 0-based
- chrom names are similar to that defined in the annotation files (aka similar to UCSC)
- Data is sorted by position
- The file may be in a gzip-compressed format
Path to the file containing the list of sorted BED6 annotation file paths
This list is to be sorted in the order in which potential annotations need to be looked-up, this ordering is crucial and allows for prioritarizing one of several possible annotations, for example a cluster matching the promoter of a transcript may also overlap the intron of another transcript,this alternative annotation being somehow less relevant than the 1st one (note that this can be turned of by setting the filter flag as anything but '1', see below)
- This file must be a tab delimited txt file with the following 4 columns
1) a name for the annotaion (aka prom500,S gencode_exon,AS ...) which describes synthetically the content of this annotation this string will be appended to the annotated OSCFile send to STDOUT in its one-before-last column
2) the path to the BED6 formatted file holding the annotation
3) the relative orientation of the annotation and input data (sense or antisense) accepted values are "sense", "1", "S", or "antisense", "-1", "AS"
4) a flag set to "1" for the filtering out of the overlaping data in subsequent sequential annotation look up
this file may contain comments marked by the character "#" that will be ignored when parsed for execution of this script - The basename of this file will be used as the header associated metadata in :
- ParameterValue[annotation_list_path] = <basename of this file>
- ParameterValue[annotation_list] = <basename of this file>
- ColumnVariable[annot.<basename of this file>.class] = the class of ...
- ColumnVariable[annot.<basename of this file>.names] = comma delimited ...
Promoter / 'gene' expression
Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script can be which aggregates the value of a particular OSCfile given the associated annotation category
CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
OPTIONS: -h Show this message
-s The column number that will be summed, aggregating data according to the BED6 annotation class
-o Path to the (optionally gzip-compressed) OSCfile to be annotated
-a Path to the file containing the list of sorted BED6 annotation file paths