Discussion/Promoter definition

From Wiki
(Redirected from Promoter definition)
Jump to navigationJump to search
  • See ML threads starting from [fantom5:00032] and [fantom5:00064]

Data format

proposal 1

A OSCtable format. Specifically,

  • A tab delimited file
  • At the beginning of the file, these lines should appear:
 ##ProtocolREF = ARBITRARY_NAME_TO_DISTINGUISH_MEHTODS
 ##Date = 2011-01-12
 ##InputFile = https://fantom5-collaboration.gsc.riken.jp/files/data/shared/UPDATE_008/f5pipeline/human.cell_line.hCAGE/*.ctss.bed.gz
 ##InputFile = https://fantom5-collaboration.gsc.riken.jp/files/data/shared/UPDATE_008/f5pipeline/human.primary_cell.hCAGE/*.ctss.bed.gz
 ...
 ##ContactName = YOUR_NAME
 ##ContactEmail = YOUR_MAIL
  • The first 4 columns should be: chrom, start.0base, stop, strand
  • The subsequent columns (arbitrary order) should include the number of reads in a following way: counts.RNA_DESCRIPTION.CNhsXXXX.XXXX-XXXX
  • The subsequent columns (arbitrary order) should include the tpm (tags per million) in a following way: tpm.RNA_DESCRIPTION.CNhsXXXX.XXXX-XXXX

for example,

chrom start.0base stop strand counts.Burkitt%27s%20lymphoma%20cell%20line%3aDAUDI.CNhs10739.10422-106C8 counts.acute%20lymphoblastic%20leukemia%20%28B-ALL%29%20cell%20line%3aBALL-1.CNhs11251.10455-106G5 counts.acute%20lymphoblastic%20leukemia%20%28B-ALL%29%20cell%20line%3aNALM-6.CNhs11282.10534-107G3 tpm.Burkitt%27s%20lymphoma%20cell%20line%3aDAUDI.CNhs10739.10422-106C8 tpm.acute%20lymphoblastic%20leukemia%20%28B-ALL%29%20cell%20line%3aBALL-1.CNhs11251.10455-106G5 tpm.acute%20lymphoblastic%20leukemia%20%28B-ALL%29%20cell%20line%3aNALM-6.CNhs11282.10534-107G3
chr1 10 20 + 8 14 103 0.8 1.4 10.3
chr1 30 40 - 24 3 7 2.4 0.3 0.7

proposal 2

Suggestions/requests

  • Many analyses will require a single point coordinate for a TSS, rather than a range. Previously we have used modal tag position. Modal tag may or may not end up being the right choice for Fantom5 but the reference position probably won't be the 5' or 3' edge of the tag distribution. So a useful additional field in this format would be "refPos".
    • Good point. How about treat this as distinct clustering method? For example, one file for 'tag cluster' method, and one for 'refPos' method [kawaji]