[Frontiers in Bioscience E5, 785-797, January 1, 2013]

MotifOrganizer: a scalable model-based motif clustering tool for mammalian genomes

Zhaohui S. Qin1,2,3, Misha Bilenky4, Gang Su5, Steven J. M. Jones4

1Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta GA 30322, USA, 2Department of Biomedical Informatics, Emory University School of Medicine, Atlanta GA 30322, USA. 3Center for Comprehensive Informatics, Emory University, Atlanta GA 30322, USA. 4British Columbia Cancer Agency Genome Sciences Centre, Vancouver, BC, V5Z 4E6, Canada.5Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor MI 48109, USA

TABLE OF CONTENTS

1. Abstract
2. Introduction
3. Material and Methods
3.1. Input Data
3.2. BMCES
3.3. MotifOrganizer
3.4. Distance-based clustering approaches
3.5. Quality and uncertainty measures
3.6. Clustering accuracy
3.7. Motif matching tools
4. Results
4.1. JASPAR
4.2. TRANSFAC
4.3. cisRED
5. Discussion
6. Acknowledgement
7. References

1. ABSTRACT

Assembling a comprehensive catalog of all transcription factors (TFs) and the genes that they regulate (regulon) is important for understanding gene regulation. The sequence-specific conserved binding profiles of TFs can be characterized from whole genome sequences with phylogenetic approaches, and a large number of such profiles have been released. Effective mining of these data sources could reveal novel functional elements computationally. Due to the variability of the binding sites, it is necessary to generalize profiles pertinent to the same TF by clustering. The summarized familial profile is effective in identifying unknown binding sites, thus lead to gene co-regulation prediction. Here we report MotifOrganizer, a scalable model-based clustering algorithm designed for grouping motifs identified from large scale comparative genomics studies on mammalian species. The new algorithm allows grouping of motifs with variable widths and a novel two-stage operation scheme further increases the scalability. MotifOrgainzer demonstrated favorable performance comparing to distance-based and single-stage model-based clustering tools on simulated data. Tests on approximately 150k motifs from the cisRED human database demonstrated that MotifOrganizer can effectively cluster whole genome sets of mammalian motifs.