Dataset introduction

From Wiki
Jump to navigationJump to search

Rough overview of HeliScopeCAGE

We started HeliScope CAGE with manual operation (with multi(8)-channel pipet), and are achieving automated preparation of the CAGE libraries to scale up the number of profiles (96 RNAs can be treated at once). The sequencing results of the libraries are computationally post-processed in the following steps:

  • Discard apparent artifacts or too-short sequences by filterSMS (a utility of Helicos software, HeliSphere)
  • Discard the CAGE tags derived from the ribosomal DNA repeating unit, which is not contained in the genome assembly, by rRNAdust (developed by Timo Lassmann)
  • Align the remained CAGE tags with the genome sequences with DELVE (developed by Timo Lassmann), which generate BAM files containing a single mapped position per read with mapping quality and alignments.
  • Aggregate the 5'-end of mapped CAGE tags as CAGE transcription starting site (CTSS)
  • Loaded them into the genome browsers (ZENBU F5 instance or the local mirror of UCSC Genome Browser), as well as inclusion into file release.

Note that the post-processing of the mapping is not very finalized yet. We are now working on these aspects: (i) discard problematic alignments, (ii) decide start side of the CAGE tags on the genome in a better way. CTSS will be updated after inclusion of these steps.

How to get access on the genome browser?

See Main_Page

Which files would be required for my analysis?

All the shared data is placed at here:

 https://fantom5-collaboration.gsc.riken.jp/files/data/shared/
  • For gene expression analysis (in a similar manner to gene expression microarray), gene expression tables https://fantom5-collaboration.gsc.riken.jp/files/data/shared/contrib/110120-gene_expression_table-WP4/ would be enough. These files contains the raw and normalized number of reads from each profiles.
  • For most cases of the promoter/TSS analysis, CTSS files (*.ctss.bed.gz), containing genomic coordinates and CAGE read counts aligned there, would be good enough.
  • For the discussion of alignment methods and definition of TSS, BAM files (*.bam) would be required.
  • To obtain all the reads (sequences), *.rdna.fa.gz + *.bam would be required. The former contains the reads match to the ribosomal DNA repeat unit, and the latter (*.bam) contains all the remained reads according the specification of BAM format (which means, both of mapped and unmapped reads are included here; note that all the information of FASTQ is included in BAM).

As rough estimation, total sizes of all the BAM files for phase1 and 2 would be 1Tbyte and 2Tbyte, respectively. the sizes of all the CTSS/level1 files would be 30Gbyte and 60Gbyte ( File:110104-datasize-estimation.pdf )

And for genome annotation, we made a snapshot of UCSC Genome Browser database and several relevant databases. we strongly recommend to use these data as possible, rather than downloading from outside, to synchronize our analysis each other.

 https://fantom5-collaboration.gsc.riken.jp/files/data/shared/external/

UCSC Table Browser of the RIKEN mirror is also available, which would be an easy entry point to get these data.

I found strange data in the released files. Correct!

We are continuously cleaning up the data set, and we are aware of some of the strange data and in a fixing process. Please check File_release and let us know if nothing is listed here.

When the data set is fixed?

Since everything is in progress, it is not possible to specify only one data set for every analysis. We are maintaining data set in the following way:

  • [UPDATE] Continuous update with versioning: As for daily/weekly update, we are versioning the data as UPDATE. Those updates can be found in data directory. You can use any of them for your preliminary analysis with referring the UPDATE number.
  • Recommended version for FANTOM5 meeting(s): it would be nice that most of the results are based on the same data set, so that we can combine and overlay analysis results each other at the meeting. We are going to recommended a specific version for each meeting.
  • [DATA FREEZE]: For paper submission/publication, we are going to make a fixed data set, which is termed data freeze.

Is it possible to ship/fedex the data in hard drive?

Possible, in principle. No official procedure here, and please contact fantom5-wp4@gsc.riken.jp.

However, we recommend to download only files you need, rather than obtain all the files, as much as possible. Since we are continuously update the data, you would not be able to obtain data timely, as well as time/effort consuming.

Why are the file names so strange?

CNhs* represent library ID, which is a trackable identifier in RIKEN OSC. To make it easier to recognize the data file, we add sample name in addition to the library ID. Since the sample name can contain space and symbols, it is encoded as URL encoding ( http://en.wikipedia.org/wiki/Percent-encoding ). To recover the original annotations, you can use "URLencode()" function in R for example.

 R> URLdecode("Burkitt%27s%20lymphoma%20cell%20line%3aDAUDI.CNhs10739.10422-106C8.hg19.ctss.bed.gz")
 [1] "Burkitt's lymphoma cell line:DAUDI.CNhs10739.10422-106C8.hg19.ctss.bed.gz"

equivalent function would be available for your language.

Where can I get all the details of the libraries?

Assay files in SDRF (*.assay_sdrf.txt) contains all the details after RNA extraction. For example,

 https://fantom5-collaboration.gsc.riken.jp/files/data/shared/UPDATE_007/f5pipeline/human.cell_line.hCAGE/00_human.cell_line.hCAGE.hg19.assay_sdrf.txt

We are summarizing the details of the cell annotation. Please wait for a while.

How the samples/data are treated in detail?

Find the SDRF files (see above) to understand which protocol is used for what. Protocols will contain descriptions of the protocols.

How to download a series of files?

To download a chunk of files, you can use lftp http://lftp.yar.ru/ over HTTPS on UNIX/LINUX. For example, you can login with lftp like this:

 % lftp -u oscf5  https://fantom5-collaboration.gsc.riken.jp/files/data/shared/UPDATE_007/f5pipeline/rat.tissue.hCAGE/

After login, type 'mget *.ctss.bed.gz' to get all the ctss data in the directory:

 > mget *.ctss.bed.gz

You can type any lftp commands, for example, 'mirror' to get everything. Note that the BAM files is huge, since it include all the alignments and unmapped reads. We recommend to download only CTSS files as far as you don't check the alignments in detail.

How to share my analysis results with the consortium member?

Great! We really encourage to share your results with the consortium member. We prepared a place for this purpose (below), please contact WP4 (fantom5-wp4@gsc.riken.jp).

 https://fantom5-collaboration.gsc.riken.jp/files/data/shared/contrib/

There are several options to send data files: (i) e-mail (<1Mbyte), (ii) attach the results to this wiki (<100Mbyte), and (iii) webdav, (iv) fedex. WP4 will take care to put your results to the contrib directory above.


Keywords, jargon

  • CAGE: cap analysis gene expression http://www.pnas.org/content/100/26/15776
  • HeliScope : Sequencer developed by Helicos BioSciences http://www.helicosbio.com
  • HeliScopeCAGE : CAGE protocol for HeliScope sequencer
  • Lane : One region of a sequencer's flow cell that can be used to sequence a library. When multiplexing is not used, it is the smallest unit of sequencing. Also called channel, depending on the platform.
  • CAGEscan : a technology where paired-ends of random-primed 5'-capped molecules are sequenced, thus associating TSS to collections of downstream exonic sequences scanning the transcript. See also http://www.nature.com/nmeth/journal/v7/n7/full/nmeth.1470.html
  • BAO sequence : base-addition order sequence. HeliScope sequence bases in the order of C, T, A, and G. One of the representative artifacts in HeliScope is CTAG repeat.
  • TPM : a unit of gene expression; tags per million ( = 1e6 * read counts / library size )
  • TSS: transcription starting site.
  • CTSS: CAGE tag starting site, which is a genomic coordinate (single base-pair) where 5'-end of aligned CAGE tags start. CTSS was introduced at FANTOM3 ( http://www.nature.com/ng/journal/v38/n6/full/ng1789.html ), and this is identical to the level1 cluster below.
  • TC: tag cluster. a continuous region on the genome where CAGE tags start. Firstly defined as a group of overlapping CAGE tags on the same strand at FANTOM3 ( http://www.nature.com/ng/journal/v38/n6/full/ng1789.html )
  • level1 cluster: the same to CTSS. In FANTOM4, CAGE tags are grouped into three levels: level1 (transcription starting site), level2 (promoter), level3 (promoter region) (http://www.nature.com/ng/journal/v41/n5/full/ng.375.html). For FANTOM5, we would have to re-discuss about the definitions of these genomic entities based the latest data and knowledge.
  • recapping: a process to add cap site to RNAs who don't have cap-site at the 5'-end. A minor population of CAGE tags can be a consequence of this.
  • SAM format : SAM (Sequence Alignment/Map) format to include alignments of the short reads with the genome. See http://samtools.sourceforge.net/
  • BAM format : Compressed SAM files. See http://samtools.sourceforge.net/.
  • BED format : BED (Browser Extensible Data) format. http://genome.ucsc.edu/FAQ/FAQformat.html#format1
  • indexDP: alignment tool developed by Helicos Biosciences http://open.helicosbio.com/helisphere_user_guide/ch07s11.html Initially we used indexDP to align HeliScopeCAGE reads, but not now. We are using DELVE instead.
  • DELVE : alignment tool developed by Timo Lassmann, which is used for the HeliScopeCAGE reads in FANTOM5.
  • ISA-Tab format : A format to describe metadata proposed by Investigation/Study/Assay (ISA) infrastructure http://isatab.sourceforge.net/. It uses SDRF to describe relationships between samples and data.
  • SDRF : Sample and Data Relationship Format (SDRF) to describe metadata, which is introduced as a part of MAGE-tab. http://www.biomedcentral.com/1471-2105/7/489 . We are using SDRF to describe our experiments in detail.
  • CL : Cell type ontology http://www.obofoundry.org/cgi-bin/detail.cgi?id=cell