User:Nbertin/Transcript model derived annotation protocol

Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron).
The protocol and script described here use a combination of calls to BedTools (intersectBed, groupBy, etc) to annotate clusters produced by F5 HelicosCAGE clustering pipeline and formatted as an OSCfile with respect to an ordered serie of BED6 formatted annotations. This script will be run as part of the CAGE post-processing pipeline and data file for each independant libraries will be provided in datasets updates

Alternatively you can also use the script CAGE-Tag-Cluster-Annotation.sh to redo this annotation step with your own clustered data or using a different list of annotations, altering their ordering, etc or to obtain Promoter / 'gene' expression aggregated over the provided annotations

Promoter annotations provided in datasets updates

RefSeq

Gencode (hg19 only)

Annotating your own clusters (CAGE-Tag-Cluster-Annotation.sh)

CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
OPTIONS: -h Show this message
                  -v Verbose level (1-9)
                  -o Path to the (optionally gzip-compressed) OSCfile to be annotated
                  -a Path to the file containing the list of sorted BED6 annotation file paths

Path to the (optionally gzip-compressed) OSCfile to be annotated.
This file is likely obtained from F5 HelicosCAGE clustering pipeline, therefore, it assumes that :

The file contain a header, each line of which is marked by "##"
Comment lines that will be reported but whose position relative to the data might be altered are marked by "##"
The first non "#|##" line contains a short description of the column content
Columns are tab delimited
IMPORTANT : although this is not dicted stricto-sensu in the OSCtable specification :
- the ordering of the column of the F5 HelicosCAGE clustering pipeline is fixed and corresponds to : id, chrom, start.0base, end, strand, raw.<exp_name>, norm.<exp_name>
- this script will only work if the 2nd to 5th columns are "chrom", "start.0base", "end" and "strand"
- positions are 0-based
- chrom names are similar to that defined in the annotation files (aka similar to UCSC)
Data is sorted by position
The file may be in a gzip-compressed format

Path to the file containing the list of sorted BED6 annotation file paths
This list is to be sorted in the order in which potential annotations need to be looked-up, this ordering is crucial and allows for prioritarizing one of several possible annotations, for example a cluster matching the promoter of a transcript may also overlap the intron of another transcript,this alternative annotation being somehow less relevant than the 1st one (note that this can be turned of by setting the filter flag as anything but '1', see below)

This file must be a tab delimited txt file with the following 4 columns
1) a name for the annotaion (aka prom500,S gencode_exon,AS ...) which describes synthetically the content of this annotation this string will be appended to the annotated OSCFile send to STDOUT in its one-before-last column
2) the path to the BED6 formatted file holding the annotation
3) the relative orientation of the annotation and input data (sense or antisense) accepted values are "sense", "1", "S", or "antisense", "-1", "AS"
4) a flag set to "1" for the filtering out of the overlaping data in subsequent sequential annotation look up
this file may contain comments marked by the character "#" that will be ignored when parsed for execution of this script
The basename of this file will be used as the header associated metadata in :
- ParameterValue[annotation_list_path] = <basename of this file>
- ParameterValue[annotation_list] = <basename of this file>
- ColumnVariable[annot.<basename of this file>.class] = the class of ...
- ColumnVariable[annot.<basename of this file>.names] = comma delimited ...

Promoter / 'gene' expression

Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script can be which aggregates the value of a particular OSCfile given the associated annotation category

CAGE-Tag-Cluster-Annotation.sh -o <OSCfile to annotate> -a <file listing the annotation to be used>
OPTIONS: -h Show this message
                  -s The column number that will be summed, aggregating data according to the BED6 annotation class
                 -o Path to the (optionally gzip-compressed) OSCfile to be annotated
                 -a Path to the file containing the list of sorted BED6 annotation file paths

User:Nbertin/Transcript model derived annotation protocol

Contents

Promoter annotations provided in datasets updates

RefSeq

Gencode (hg19 only)

Annotating your own clusters (CAGE-Tag-Cluster-Annotation.sh)

Promoter / 'gene' expression

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools