Data Analysis and Integration
Please edit this section to fit to the needs of Data Analysis and Integration.
General topics
Mapping and QC: Timo
Promoter clustering: Michiel/Timo
The current promoter clustering is produced in a FANTOM3-style manner with level-1 promoters ("transcription start sites"), level-2 promoters ("promoters"), and level-3 promoters ("promoter regions"). Level-1 promoters are produced by determining all positions in the genome to which the 5' end of at least one CAGE tag maps. This corresponds to the individual transcription start sites at 1 base resolution. Whereas typically we require a tag count of at least 1 in any of the libraries, for the Helicos data this doesn't matter since all CAGE tags are single-mapping. This means that effectively we don't apply any threshold on the level-1 promoters. For each level-1 promoter, we calculate the tag count and tags-per-million [tpm] in each of the libraries.
The level-2 promoters are constructed by joining all level-1 promoters that are within 20 bp of each other into one level-2 promoter. For each level-2 promoter, we add up the expression of the individual level-1 promoters it is made up of to arrive at the absolute tag count and tags-per-million in each library. Here, we require that a level-2 promoter is expressed at a level of at least 10 tpm in at least one library. Consequently, not all level-1 promoters are represented in a level-2 promoter.
Next, the level-3 promoters are constructed by joining all level-2 promoters that are within 400 bp of each other into one level-3 promoter. We do not apply any additional thresholding on level-3 promoters, so each level-2 promoter is represented in one level-3 promoter.
Python scripts to produce level-1,2,3 promoters are available here: Media:Level123.zip. Each of these scripts need Numerical Python (numpy; http://sourceforge.net/projects/numpy/files/NumPy/) to be installed; in addition, the level-1 script needs to pysam (http://code.google.com/p/pysam/), which is used to parse SAM/BAM files. If you want to use parameter values (such as thresholds) other than the default, you can do so by specifying them on the command line. Running a script without any arguments will display each of the options and arguments. Note that since we don't apply any thresholding on level-1 promoters, you can create level-2 or level-3 promoters for a subset of libraries by starting from the level-1 promoters that were precalculated for each library separately.
Expression normalization and differential expression (based on genes, peaks, or whatever) : Kawaji
- Differential expression
- Negative binomial model (adopted in edgeR PMID:19910308 and DEseq PMID:20979621) fits to HeliScopeCAGE data reasonably. The relative error (=sqrt(common dispersion)) among technical replicates is less than 9%, which suggest the data is very close to Poisson dist.
- Normalization
- Naive TPM normalization on HeliScopeCAGE works reasonably in a limited number of samples during the development.
- We would have to consider how to treat low quantity protocol.
- Since F5 deals with a very wide range of samples, which may has unexpected composition of gene expressions, we have to be careful to apply F4 way normalization.
- When we achieved a wide range of expression profiles, TMM normalization (PMID:20196867, implemented in edgeR) or DEseq fitting (PMID:20979621) could be an option.
See also here: Expression_normalization_method
Annotation of promoter / 'gene' expression : Nicolas
Transcript models derived annotation
Proposed here is an annotation of promoter based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron).
Details of the procedure, datasets and scripts can be found in the Transcript model derived annotation protocol page
Promoter / 'gene' expression
Complementing the annotation of promoters based on their overlap with model transcripts (RefSeq, Gencode) feature (proximal promoter region, UTR, exon, intron), an extension of the script used in the promoter/5'UTRs/alternative transcript expression page can be used which aggregates the value of a particular OSCfile given the associated annotation category
CAGEscan driven annotations
CAGEscan paired-end data link TSSes to downstream sequences. By annotating those downstream sequence with respect to know transcript(s) and transcript feature(s) (exons, introns, UTRs) it is possible to extend the annotations of (HelicosCAGE derived) TSS or TC by comparing them to such annotated CAGEscan derived TSS. Note that his can be accomplished either by comparing data arising from similar biological samples or comparing it to the full body of CAGEscan-derived TSS annotation files
Details of the procedure, datasets and scripts can be found in the CAGEscan driven annotation protocol page
Motif activity : Michiel
Motif activity calculations are performed in a similar, though not identical, approach as in FANTOM4. The motif activity calculations consist of the following steps:
- Alignment of promoter regions between homologous organisms
- Prediction of transcription factor binding sites based on these aligned promoters, taking homology into account
- Association of transcription factor binding sites to promoters, taking the distance from the promoter to the predicted transcription factor binding site into account
- Calculation of the motif activity, using the measured CAGE expression levels of each promoter
- Prediction of the gene regulatory network, using the measured CAGE expression levels of each promoter, its predicted transcription factor binding sites, and the calculated motif activities.
The alignments of promoter regions simply use the whole-genome alignments as precalculated by UCSC. While running T-Coffee on each promoter region may give a better prediction, in practice we found that typically the improvement is very small.
Transcription factor binding sites are predicted using a heavily modified version of the Monkey software. This software was modified such that it can use arbitrary selection patterns of TFBS conservation in the phylogenetic tree. This allows us to use the selection patterns of MotEvo for consistency with the FANTOM4 motif activity calculations. Monkey was also converted into a Python extension module. This was done for reasons of speed, as it enables us to keep precalculated information in memory, while streaming all the sequences for which the TFBS is to be calculated.
To associate transcription factor binding sites to promoters, in FANTOM4 an iterative approach was used in which the prior probability of finding a transcription factor binding site was calculated as a function of the distance to the transcription start site. This prior probability is different for different transcription factors. Instead, I am using a fixed prior probability; the distribution of this prior probability is then calculated (separately for each transcription factor) using a kernel density approach.
The calculation of the motif activity is done in the same way as in FANTOM4, using tau = 0.1 as the smoothing parameter. The gene regulatory network is then also predicted in the same way as in FANTOM4.
Conservation / orthologous mapping (variome, basins) : Max
Providing a set of orthologs for mapping across species for investigators interested in making quick comparisons. Two separate sets are provided; one containing complete set of genes and another containing only transcription factors. Genome coordinates are consistent with the FANTOM5-selected genome assemblies. The sets are human-centric in that they contain orthologs across species for human genes. Orthologs which exclude humans are not currently included. This should change, but it is currently a bit low on the priority list.
The complete gene set is based on Homologene definitions, which are in turn based on refseq gene models. Non-coding RNA is not currently included, this should change at some point. The second set is based on the manually-curated set established by Roach JC, et al. (PMID: 17913878) which was also used in FANTOM4. Entries in the old set which have no coordinates in hg19, which are no longer defined as genes in the current refseq model, or which are currently defined as pseudogenes have been removed from the original list. Gene names have been updated to reflect current definitions. Several newly-identified transcription factors were added to the list.
Please see the following file for some current numbers: Media:Orthologs.pdf.
Updated general gene list: Media:General_ortholog_table_updated.txt.gz
Updated transcription factor gene list: Media:Tf_ortholog_list_updated.txt.gz
Columns are defined as follows, tab delimited: 1) orthologous group ID 2) human gene name 3) human protein id 4) human transcript id 5) human chromosome 6) human chromosome start site 7) human chromosome stop site 8) human chromosome strand
Columns 2-8 are then repeated for each genome in the following order: mouse, rat, dog, chicken.
UPDATE 7/27/2011: The transcription factor list has been updated in two ways: 1) several more tfs (~30) have been added based on a list from Swiss-Regulon that were not found in the above-mentioned Roach list and 2) all Refseq IDs that associate with particular tf gene locus have been added by appending them to column 4, seperated by a comma. The genome start and stop sites for the newly-added identifiers have also been appended to columns 6 and 7, respectively, with the order consistent across all columns (in other words, the first identifier in column 4 corresponds to the first start and stop sites listed in columns 6 and 7). This has also been extended to mouse, rat, dog, and chicken in the appropriate columns. #2 has also been applied to the updated general gene list.
The purpose of the second change is two-fold: 1) it provides the complete scope of Refseq-defined 5' ends at each gene locus, giving better (if still limited) 5' resolution and 2) it makes it easier to join lists with only one refseq identifier for a given locus.
The primary outstanding task is extension of the list to include other gene definitions that are frequently employed in FANTOM, e.g. ENSEMBLE gene definitions.
HeliScopeCAGE analysis
Sample annotation and relationships (Win)
Version 1.0 Tree of relationships.
- 1. Manhattan distances generated using pathway fingerprints derived from CAGE
- 2. Grouping performed using simple distance matrix and Neighbor joining
- 3. Rooted with Mesenchymal stem cell
- 4. Colored arbritrarily by hand and some clades collapsed for simplification.
- 5. Actual tree drawn using the FIGTREE phylogenetic tree visualizer for download of software
- 6. PDF of tree
- 7. treefile
- 8. distance matrix and sample names
Version 1.0 of the Plum Tree contains the current samples and is made up of a phylogenetic analysis of the relationships of the molecular distances between each sample. It shows in a pdf of the first draft both functional profiling relationships (basically pathway utilization distance between each sample) and we think inferred developmental relationships think 'develops_from'. FANTOM Sample accessions are included.
Brain and associated spinal tissues cluster discretely.
Stem cells (in deep purple) (mesenchymal stem cell is the outgroup) group according to their related cells or to a more primitive type according to their relationship to mesenchymal stem cells. Cancers are usually discrete from the cell types and tissues they relate to. A clean example is melanocyte which is ancestral in terms of function (and cell type) to melanomas (search term 'mel' in pdf). Melanomas and gliomas represent one of the most 'similar' cancers to mesenchymal stem cells.
Blood
Mast cells are ancestral to all blood and immune samples but not blood cancers which are grouped in the same parent clade. Macrophages group in this clade as well but interestingly, the macrophages from THP1 that have been induced in the FANTOM4 time series are discrete from these groupings, do not share functional similarity with the blood clade and in fact are some of the most distant samples from a mesenchymal stem cell. They are more similar to smooth muscle than blood in this analysis.
To determine which pathways, and ultimately which genes differentiate each of the samples we need to perform more detailed phylogenetic and clustering analyses and this work is in process.
To search the tree it is possible to search within the pdf. For manipulation and further editing it is necessary to download the software (links below) and to load up the treefile.
All analysis groups page
The link above will take you to a separate page expanding on all groups' analysis done so far, including:
- 1) Win
input data is gene expression
phylogenetic algorithm is neighbour-joining (Manhattan distance)
- 2) Robin
input data is level 2 promotor expression
phylogenetic technique is neighbour-joining (KL divergence distance)
- 3) Owen
input data is gene expression (presence/absence using a threshold)
phylogenetic technique is maximum likelihood
- 4) Owen --preliminary result--
input data is presence/absence of TF network edges based on Motif activity (same as FANTOM4 Nature Genetics paper)
phylogenetic technique is maximum likelihood
input data is domain-level expression (converted from gene-level)
phylogenetic technique is average linkage clustering
- 6) Kawaji
input data is gene expression
phylogenetic technique is average linkage clustering (Pearson correlation)
Main paper specific analysis
- TSS classification - Timo Lassmann see: https://fantom5-collaboration.gsc.riken.jp/webdav/home/Lassmann/TSS_classification/TSSprediction_Sep5.pdf
- Tissue specificity / cell type specificity - ontology consious
- Annotation of transcript (ncRNA)
- Alternative promoters overlaid with other information (tissue/cell/protein content, ncRNAs, etc)
For long-ncRNA work, please consider using the human lncRNAome catalog: Media:F5_human_lncRNAome(Jia&Lipovich_Gencode_Lander).xls Media:F5_human_lncRNAome(Jia&Lipovich_Gencode)BED.zip Also it would be interesting to look at mRNA-lncRNA bidirectional promoters, including alternative promoters of conserved protein-coding genes that give rise to lncRNAs transcribed in the opposite orientation. - LL