TFBS i1
1. Integrating data from different sources [Ivan]
The strategy from
http://www.springerlink.com/content/q86n48151u25278w/ was based on the
equal weighting
of independent data sources. We suggest ln(N)-based weighting where N
is the number of sequences in the set
(so the big sequence sets will have comparable weights while small
sets of 1-2 sequences won't have too big impact on motifs).
For 'big' datasets (>100 sequences) - taking data from only for a
selected species/TF.
For other cases - use family-wide motifs, join data from
mammalian/vertebrate species (mostly from TRANSFAC)
For ChIP-Seq take top X peaks for each cell line/condition etc (X to
be estimated, approx. ~1000-10000 sequences) and consider it as a
'big' dataset.
2. Naming conventions
TF naming according to the UniProt/SwissProt 'Entry name'
(http://www.uniprot.org/manual/entry_name)
File naming:
TF_TYPE_ID.mfa
where
TF = transcription factor name (includes reference to the species,
like NR4A1_HUMAN)
For poorly annotated data use UNKNOWN, MAMMAL, VERTEBRATE (useful
mostly for TRANSFAC data)
TYPE = W or S or P (weighted, simple or peak data, see below). For
this iteration - S or W.
ID = some unique ID (datasource-based)
Input data format: extended multifasta
Internal structure:
> weight
sequence
for data with some relevant weights (sequence quality, peak height etc.)
or
> positional_weight positional_weight ...
sequence
for ChIP-Seq base coverage peak data (where each sequence position has
its own weight)
More examples here:
http://line.imb.ac.ru/smbsm/librettos/libretto_chipmunk/chipmunk_run.rhtml
3. Preparing data
(a) Parsing TRANSFAC [Ivan, work-in-progress]
(b) Downloading and processing UCSC ChIP-Seq collections [?]
(c) Downloading and processing High-Throughoutput SELEX data from http://genome.cshlp.org/content/20/6/861/suppl/DC1
(d) Methyl-binding proteins data [Yulia]
4. Motif representation
aligned words + position matrices (possible to use as a basis for a
more complex motif models if necessary)