SampleClassification: Difference between revisions
No edit summary |
|||
| (11 intermediate revisions by 2 users not shown) | |||
| Line 29: | Line 29: | ||
<p> |
<p> |
||
1 and 2 use the same phylogenetic algorithm, 3 and 4 use a different one, and 5 and 6 use a third. The information contained in the input data should make more difference than the phylogenetic algorithm applied to it. |
1 and 2 use the same phylogenetic algorithm, 3 and 4 use a different one, and 5 and 6 use a third. The information contained in the input data should make more difference than the phylogenetic algorithm applied to it. |
||
</p> |
|||
<p> |
<p> |
||
If the trees are compared, the results are in general very similar in 1,2, 6 and 5. They all group obvious clades such as macrophage, brain and blood. The trees in 3 and 4 fail to separate primary cells from tissues although they do separate the obvious clades similar to the others. The tree in 5 is unique in that it separates organisms perfectly as well as separating the sample type and obvious clades within organisms (where available). |
If the trees are compared, the results are in general very similar in 1,2, 6 and 5. They all group obvious clades such as macrophage, brain and blood. The trees in 3 and 4 fail to separate primary cells from tissues although they do separate the obvious clades similar to the others. The tree in 5 is unique in that it separates organisms perfectly as well as separating the sample type and obvious clades within organisms (where available). |
||
</p> |
|||
== Cross-species sample classifications based on structural domain-ome expression information == |
|||
*Motivations: |
|||
Complexity systems are modular. In proteins, modularity can be found at structural domains. More importantly, these structural domains can harbour significantly functional and evolutionary signals, just as defined by SCOP at the superfamily level. |
|||
For a given organism, domain-ome is a whole collection of SCOP domains existing. From the evo-dev point of view, present-absent binary pattern of domain-ome is shaped by evolution, and numeric expression pattern of domain-ome can be reflective of tissue origins and cell types. Intuitively, it would be very interesting to see whether we can perform inter-species/intra-species comparative analysis based on domain-ome expression. |
|||
Ideally, we need protein-level expression data and protein-domain mapping matrix to generate domain-level expression data. Practically, domain-ome expression for a panel of tissues and cell types across different species can be converted from CAGE data in FANTOM 5 or other microarray-based high-throughput data in public-access sources. |
|||
*Methods and Results: |
|||
Using CAGE data from FANTOM 5, extract CTSS (Cage Tag Start Sits) of known Refgenes, and then map to ENSENBL proteins and SUPERFAMILY domain assignments at SCOP superfamily levels, resulting in domain-sample expression matrix. In this matrix, each entry measures average expression level per domain, which is calculated following: SUM(tags for a refgene X number of domains in this refgene) divided by total number of domains. This calculation can account for the domain copy numbers in different genomes, important for comparative study. |
|||
In order to make sure the data follows the normal distribution and correct the systematic biases introduced during the conversion from gene-level to domain level, the domain-sample expression matrix can be first log-transformed, and then standardized in a column/sample-wise manner so that the median of each column is 0 and variation equals 1. |
|||
Using this transformed and standardized domain-level expression matrix, we apply conventional hierarchical algorithm for sample classifications (Euclidean distance as similarity metric, average linkage clustering method). See the result [[Media:Hierarchical_tree.pdf]], and its correlation matrix [[Media:correlation_matrix.pdf]]. |
|||
From these figures, several features stand out: |
|||
<p> |
|||
1. Samples are robustly grouped together first according to their species identity. Since this trend is so consistent, we expect it will still be preserved (as more samples are available for rat, dog, chicken). |
|||
</p> |
|||
<p> |
|||
2. Samples are generally further grouped based on the sample types. For example, in human, the primary cells are almost separated from tissues. |
|||
</p> |
|||
<p> |
|||
3. If looking into greater details for those samples belonging to the same types within the same species, the similar quality can be comparatively achieved using gene-level expression data (under the same parameter for sample classifications: [[Media:Hierarchical_tree_gene.pdf]]). |
|||
</p> |
|||
Also, we apply projection techniques (see [[Media:Sammon_projection.pdf]] and [[Media:SOM_projection.pdf]]) to capture their relationships. Clearly, the relationships revealed above are independent of algorithms used. |
|||
Why domain-centric expression? What I can come up with are: Domain-centric expression info may be another important measure of increasing phenotypic diversity: direct basis of interacting protein networks (which are considered to be molecular carriers of phenotype information). |
|||
== Binary Gene Expression Tree Building == |
|||
*Motivations: |
|||
We were interested to see what the data could tell us if we ignored the gene expression levels and simply looked at which genes were being expressed. It was our intention to follow this with the presence/absence of regulatory interactions and this allowed to see the additional information that we could extract from doing this. |
|||
*Methods: |
|||
Using the data in the gene expression table provided on the wiki (https://fantom5-collaboration.gsc.riken.jp/files/data/shared/contrib/110120-gene_expression_table-WP4/UPDATE_010/) we thresholded the gene expression of each gene at TPM. Those with an expression of 2 TPM or less we given a 0 and those a above a 1. So for each sample we had a string that described the gene presence/absence: |
|||
Eg |
|||
Brain_sample_1 0010101101110001 |
|||
Liver_sampple_1 1010001010101110 |
|||
. |
|||
. |
|||
. |
|||
We used RAxML which a widely used maximum likelihood software tool for creating trees. A decription of how it works can be found in this paper: (http://bioinformatics.oxfordjournals.org/content/22/21/2688.short). Since this process is stochastic the resulting trees from separate runs is not always the same. We can utilize the concept of consensus trees and bootstrapping to estimate the most likely tree and to get an idea of which samples always cluster together and which are rarely members of the same cluster. This is the tree provided here: [[Media:tree.pdf]] |
|||
*Results: |
|||
This tree contains some of the characteristics that we have seen in other trees, for instance the brain and macrophage samples clusters together robustly. However there is no separation of cell lines and primary cells that has been shown in other trees on the mailing list. |
|||
The samples that do not cluster at all using this method are probably worth investigating in the other methods since you would expect that simply by looking at the genes that are being expressed you would get some robust similarities to other samples. |
|||
== Interaction based Tree Building == |
|||
*Motivations: |
|||
We wanted to see if we could cluster the data simply based on the interactions that were present based on the Motif activity work that Michiel had done. It is our intention to use the idea of network motifs to explore this further since there have been a number of success stories about small network motif distinguishing tissue type in the literature. In the early stages of this investigation we have simply been looking at the presence/absence of interactions (or in terms of motif interactions involving two nodes) but we intend to extend this two triangles (3 nodes) and squares (4 nodes). |
|||
*Methods: |
|||
Michiel provided the results of his MASA and MARA analysis on the FANTOM5 snapshot data. We simply used this to construct networks in the same way that was carried out in the FANTOM4 nature genetics paper (fig 4). Once these were constructed (they can be viewed here further guidance usage will follow) we simply created binary string in the sample way as above but where the presence absence was for edges in the TF-TF network. RAxML was then used again and the consensus tree can be seen here: [[Media:edge_net_consensus.pdf]] |
|||
If you wish to look at a complete clustering the best tree (ie the one that was most likely according the the ML algorithm) can be seen here: [[Media:BestTree_edgenet.pdf]] |
|||
*Results: |
|||
Similar to above this tree contains sections that make sense (eg Brain, Epithelial, macrophage clusters) but also contains areas that are not well defined or don’t seem to make sense. |
|||
It is our hop that by extending this work we can identify not only important tissue specific genes but tissue specific regulatory programs that we can use to help our efforts to identifying genes to use in reprogramming experiments. |
|||
Latest revision as of 20:18, 4 April 2011
The link above will take you to a separate page expanding on all groups' analysis done so far, including:
- 1) Win
input data is gene expression
phylogenetic algorithm is neighbour-joining (Manhattan distance)
- 2) Robin
input data is level 2 promotor expression
phylogenetic technique is neighbour-joining (KL divergence distance)
- 3) Owen
input data is gene expression (presence/absence using a threshold)
phylogenetic technique is maximum likelihood
- 4) Owen --preliminary result--
input data is presence/absence of TF network edges based on Motif activity (same as FANTOM4 Nature Genetics paper)
phylogenetic technique is maximum likelihood
- 5) Hai
input data is domain-level expression (converted from gene-level)
phylogenetic technique is average linkage clustering
- 6) Kawaji
input data is gene expression
phylogenetic technique is average linkage clustering (Pearson correlation)
Comparing these trees (attached and numbered), we can see that 1 and 6 use the same input data, and 2 is very similar (adding in non-coding). 3 takes the same data from 1 and 6 but applies a threshold to convert it to binary (discarding information), and 4 is edge information instead of node information only for transcription factors. 5 is the same data from 1 and 6, but converted to to domains from genes.
1 and 2 use the same phylogenetic algorithm, 3 and 4 use a different one, and 5 and 6 use a third. The information contained in the input data should make more difference than the phylogenetic algorithm applied to it.
If the trees are compared, the results are in general very similar in 1,2, 6 and 5. They all group obvious clades such as macrophage, brain and blood. The trees in 3 and 4 fail to separate primary cells from tissues although they do separate the obvious clades similar to the others. The tree in 5 is unique in that it separates organisms perfectly as well as separating the sample type and obvious clades within organisms (where available).
Cross-species sample classifications based on structural domain-ome expression information
- Motivations:
Complexity systems are modular. In proteins, modularity can be found at structural domains. More importantly, these structural domains can harbour significantly functional and evolutionary signals, just as defined by SCOP at the superfamily level.
For a given organism, domain-ome is a whole collection of SCOP domains existing. From the evo-dev point of view, present-absent binary pattern of domain-ome is shaped by evolution, and numeric expression pattern of domain-ome can be reflective of tissue origins and cell types. Intuitively, it would be very interesting to see whether we can perform inter-species/intra-species comparative analysis based on domain-ome expression.
Ideally, we need protein-level expression data and protein-domain mapping matrix to generate domain-level expression data. Practically, domain-ome expression for a panel of tissues and cell types across different species can be converted from CAGE data in FANTOM 5 or other microarray-based high-throughput data in public-access sources.
- Methods and Results:
Using CAGE data from FANTOM 5, extract CTSS (Cage Tag Start Sits) of known Refgenes, and then map to ENSENBL proteins and SUPERFAMILY domain assignments at SCOP superfamily levels, resulting in domain-sample expression matrix. In this matrix, each entry measures average expression level per domain, which is calculated following: SUM(tags for a refgene X number of domains in this refgene) divided by total number of domains. This calculation can account for the domain copy numbers in different genomes, important for comparative study.
In order to make sure the data follows the normal distribution and correct the systematic biases introduced during the conversion from gene-level to domain level, the domain-sample expression matrix can be first log-transformed, and then standardized in a column/sample-wise manner so that the median of each column is 0 and variation equals 1.
Using this transformed and standardized domain-level expression matrix, we apply conventional hierarchical algorithm for sample classifications (Euclidean distance as similarity metric, average linkage clustering method). See the result Media:Hierarchical_tree.pdf, and its correlation matrix Media:correlation_matrix.pdf.
From these figures, several features stand out:
1. Samples are robustly grouped together first according to their species identity. Since this trend is so consistent, we expect it will still be preserved (as more samples are available for rat, dog, chicken).
2. Samples are generally further grouped based on the sample types. For example, in human, the primary cells are almost separated from tissues.
3. If looking into greater details for those samples belonging to the same types within the same species, the similar quality can be comparatively achieved using gene-level expression data (under the same parameter for sample classifications: Media:Hierarchical_tree_gene.pdf).
Also, we apply projection techniques (see Media:Sammon_projection.pdf and Media:SOM_projection.pdf) to capture their relationships. Clearly, the relationships revealed above are independent of algorithms used.
Why domain-centric expression? What I can come up with are: Domain-centric expression info may be another important measure of increasing phenotypic diversity: direct basis of interacting protein networks (which are considered to be molecular carriers of phenotype information).
Binary Gene Expression Tree Building
- Motivations:
We were interested to see what the data could tell us if we ignored the gene expression levels and simply looked at which genes were being expressed. It was our intention to follow this with the presence/absence of regulatory interactions and this allowed to see the additional information that we could extract from doing this.
- Methods:
Using the data in the gene expression table provided on the wiki (https://fantom5-collaboration.gsc.riken.jp/files/data/shared/contrib/110120-gene_expression_table-WP4/UPDATE_010/) we thresholded the gene expression of each gene at TPM. Those with an expression of 2 TPM or less we given a 0 and those a above a 1. So for each sample we had a string that described the gene presence/absence: Eg Brain_sample_1 0010101101110001 Liver_sampple_1 1010001010101110 . . .
We used RAxML which a widely used maximum likelihood software tool for creating trees. A decription of how it works can be found in this paper: (http://bioinformatics.oxfordjournals.org/content/22/21/2688.short). Since this process is stochastic the resulting trees from separate runs is not always the same. We can utilize the concept of consensus trees and bootstrapping to estimate the most likely tree and to get an idea of which samples always cluster together and which are rarely members of the same cluster. This is the tree provided here: Media:tree.pdf
- Results:
This tree contains some of the characteristics that we have seen in other trees, for instance the brain and macrophage samples clusters together robustly. However there is no separation of cell lines and primary cells that has been shown in other trees on the mailing list. The samples that do not cluster at all using this method are probably worth investigating in the other methods since you would expect that simply by looking at the genes that are being expressed you would get some robust similarities to other samples.
Interaction based Tree Building
- Motivations:
We wanted to see if we could cluster the data simply based on the interactions that were present based on the Motif activity work that Michiel had done. It is our intention to use the idea of network motifs to explore this further since there have been a number of success stories about small network motif distinguishing tissue type in the literature. In the early stages of this investigation we have simply been looking at the presence/absence of interactions (or in terms of motif interactions involving two nodes) but we intend to extend this two triangles (3 nodes) and squares (4 nodes).
- Methods:
Michiel provided the results of his MASA and MARA analysis on the FANTOM5 snapshot data. We simply used this to construct networks in the same way that was carried out in the FANTOM4 nature genetics paper (fig 4). Once these were constructed (they can be viewed here further guidance usage will follow) we simply created binary string in the sample way as above but where the presence absence was for edges in the TF-TF network. RAxML was then used again and the consensus tree can be seen here: Media:edge_net_consensus.pdf If you wish to look at a complete clustering the best tree (ie the one that was most likely according the the ML algorithm) can be seen here: Media:BestTree_edgenet.pdf
- Results:
Similar to above this tree contains sections that make sense (eg Brain, Epithelial, macrophage clusters) but also contains areas that are not well defined or don’t seem to make sense. It is our hop that by extending this work we can identify not only important tissue specific genes but tissue specific regulatory programs that we can use to help our efforts to identifying genes to use in reprogramming experiments.