>> BTB Database Home >> BTB Database Statistics >> BTB Domain Analysis >> BACK Domain Home >> Prive Lab Home

Construction of the BTB domain database

Why build a separate database for this domain?

Source of sequences:

The source of sequences for identifying BTB domains are the most up-to-date versions of Ensembl and Uniprot. The database is automatically updated as new versions of these databases are released (usually on a monthly basis).

Construction of HMM's and sequence searches:

Profile HMMs were employed to detect BTB domains in sequence collections. These allowed sequence searches using our in-house definition of the BTB domain, based on structural alignment. These definitions of the BTB domain differ from the SMART and PFAM database, and therefore lead to slightly different numbers of identified BTB domains. Specifically, our HMMs detect the beta1 and alpha1 secondary structure elements (the "N-terminal extension") that are missed by SMART and PFAM. Furthermore, the sequence identity between the various BTB domain subfamilies (zinc fingers, kelch, T1, Skp1, etc.) is negligible and therefore required careful alignment in the training of our custom HMMs. Once these HMMs were constructed, this allowed better coverage in the identification of BTB domains than other databases. Lastly, usage of custom HMMs allowed higher quality alignments than those available in SMART and PFAM to be generated once BTB domain sequences are identified.

For more about the panel of HMM's that was generated, click here.

hmmsearch was employed to search the Uniprot and Ensembl sequence collections.

Filtering of hits and identification of other domains:

hmmsearch results of SWISSPROT/SpTrEMBL and Ensembl sequence collections were filtered based on E-values and duplicate sequences. Sequences scoring with E-values less than 10 against any HMM were retained. Sequences scoring in the range 0.1 to 10 were evaluated manually and correlated with annotation at Ensembl or Uniprot to ensure presence of the BTB domain. Sequences with identical E-values and pairwise percent identity = 100% were combined into a single record, retaining accession numbers and aliases.

Domains other than the BTB domain in the full-length protein were identified using the Interpro and/or PFAM field from Ensembl.

Clustering of splice variants

Splice variants were collapsed into clusters with a single BTB ID by checking Ensembl Gene IDs and/or pairwise BLAST searches.

FASTA definition lines:

Various FASTA definition lines have been assigned to each protein sequence. This is to facilitate later analyses, such as alignment and clustering, which can be expedited with informative FASTA definition lines. For example, viewing a phylogenetic clustering diagram using Treeview can only support eight characters to label a protein sequence, therefore in this case the Short FASTA definition line is used, which still allows identification of the species and family of the sequence.

Three types of FASTA definition line were created for each sequence:

  1. Uniprot or Ensembl assigned FASTA definition line
  2. Short FASTA definition: 8 characters according to the formula

>btb_id(4 numbers) species(2 letters) otherdomains(2 letters)

i.e. btb_id = 0001 is Homo sapiens PLZF, a protein containing C2H2 zinc-finger domains. This sequence received the short FASTA code >0001HSZF

  1. Named FASTA definition: according to the formula

>species(2 letters) . protein_name(4 letters)

PLZF would receive the code >HS.PLZF

Loading into Oracle database and construction of interface:

Tables were loaded into Oracle. PHP was utilized to process entries from the HTML interface, and pass these along as queries to the Oracle database.

This page last updated: June 14, 2005.