TFBS i1

From Wiki
Jump to navigationJump to search

1. Integrating data from different sources [Ivan]

The strategy from

http://www.springerlink.com/content/q86n48151u25278w/ was based on the

equal weighting

of independent data sources. We suggest ln(N)-based weighting where N

is the number of sequences in the set

(so the big sequence sets will have comparable weights while small

sets of 1-2 sequences won't have too big impact on motifs).


For 'big' datasets (>100 sequences) - taking data from only for a

selected species/TF.

For other cases - use family-wide motifs, join data from

mammalian/vertebrate species (mostly from TRANSFAC)

For ChIP-Seq take top X peaks for each cell line/condition etc (X to

be estimated, approx. ~1000-10000 sequences) and consider it as a

'big' dataset.


2. Naming conventions

TF naming according to the UniProt/SwissProt 'Entry name'

(http://www.uniprot.org/manual/entry_name)


File naming:

TF_TYPE_ID.mfa


where

TF = transcription factor name (includes reference to the species,

like NR4A1_HUMAN)

For poorly annotated data use UNKNOWN, MAMMAL, VERTEBRATE (useful

mostly for TRANSFAC data)

TYPE = W or S or P (weighted, simple or peak data, see below). For

this iteration - S or W.

ID = some unique ID (datasource-based)


Input data format: extended multifasta

Internal structure:


> weight

sequence


for data with some relevant weights (sequence quality, peak height etc.)


or

> positional_weight positional_weight ...

sequence


for ChIP-Seq base coverage peak data (where each sequence position has

its own weight)


More examples here:

http://line.imb.ac.ru/smbsm/librettos/libretto_chipmunk/chipmunk_run.rhtml


3. Preparing data

(a) Parsing TRANSFAC [Ivan, work-in-progress]

(b) Downloading and processing UCSC ChIP-Seq collections [?]

(c) Downloading and processing High-Throughoutput SELEX data from http://genome.cshlp.org/content/20/6/861/suppl/DC1

(d) Methyl-binding proteins data [Yulia]


4. Motif representation

aligned words + position matrices (possible to use as a basis for a

more complex motif models if necessary)