TFBS i1

1. Integrating data from different sources [Ivan]

The strategy from

http://www.springerlink.com/content/q86n48151u25278w/ was based on the

equal weighting

of independent data sources. We suggest ln(N)-based weighting where N

is the number of sequences in the set

(so the big sequence sets will have comparable weights while small

sets of 1-2 sequences won't have too big impact on motifs).

For 'big' datasets (>100 sequences) - taking data from only for a

selected species/TF.

For other cases - use family-wide motifs, join data from

mammalian/vertebrate species (mostly from TRANSFAC)

For ChIP-Seq take top X peaks for each cell line/condition etc (X to

be estimated, approx. ~1000-10000 sequences) and consider it as a

'big' dataset.

2. Naming conventions

TF naming according to the UniProt/SwissProt 'Entry name'

(http://www.uniprot.org/manual/entry_name)

File naming:

TF_TYPE_ID.mfa

where

TF = transcription factor name (includes reference to the species,

like NR4A1_HUMAN)

For poorly annotated data use UNKNOWN, MAMMAL, VERTEBRATE (useful

mostly for TRANSFAC data)

TYPE = W or S or P (weighted, simple or peak data, see below). For

this iteration - S or W.

ID = some unique ID (datasource-based)

Input data format: extended multifasta

Internal structure:

> weight

sequence

for data with some relevant weights (sequence quality, peak height etc.)

or

> positional_weight positional_weight ...

sequence

for ChIP-Seq base coverage peak data (where each sequence position has

its own weight)

More examples here:

http://line.imb.ac.ru/smbsm/librettos/libretto_chipmunk/chipmunk_run.rhtml

3. Preparing data

(a) Parsing TRANSFAC [Ivan, work-in-progress]

(b) Downloading and processing UCSC ChIP-Seq collections [?]

(c) Downloading and processing High-Throughoutput SELEX data from http://genome.cshlp.org/content/20/6/861/suppl/DC1

(d) Methyl-binding proteins data [Yulia]

4. Motif representation

aligned words + position matrices (possible to use as a basis for a

more complex motif models if necessary)

TFBS i1

Navigation menu

Page actions

Page actions

Personal tools

Menu

Search

Special topics

Resources

ZENBU genome browser

UCSC Genome Browser RIKEN mirror

Navigation

Tools