(653c) Computational Methods of a Generic Motif Discovery Algorithm for Sequential Data

Conference

AIChE Annual Meeting

Year

2006

Proceeding

2006 Annual Meeting

Group

Computing and Systems Technology Division

Session

Advances in Computational Methods and Numerical Analysis II

Time

Friday, November 17, 2006 - 1:20pm to 1:45pm

Authors

Jensen, K. L. - Presenter, MIT

Stephanopoulos, G. - Presenter, Massachusetts Institute of Technology

Styczynski, M. - Presenter, MIT

Rigoutsos, I. - Presenter, IBM

Motif discovery in sequential data is a problem of great interest and with many applications in biology and chemical engineering. However, previous methods for motif discovery have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. While in some cases simply finding a good set of pattern occurrences is acceptable, in many cases we would like to find all possible patterns in a dataset that meet some threshold of importance, as each set of pattern occurrences may have distinct biochemical relevance.

Here we present a Generic Motif Discovery Algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real--valued data. Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation--agnostic: they can be represented using regular expressions, position weight matrices, or any number of other models for any type of sequential data. Since motif discovery tools with even two or three of these qualities are exceedingly rare, Gemoda is particularly novel in having all four of these characteristics.

We demonstrate a number of applications of the algorithm, ranging from the discovery of motifs in amino acid and nucleotide sequences to the discovery of conserved protein sub?structures. We also briefly highlight the potential applications of Gemoda for metabolomic studies and for simple classification tasks. Finally, we discuss some of the issues faced in implementing the algorithm and the computational concepts taken from chemical engineering that were adapted or used as a basis for methods implemented in Gemoda. These methods are of particular importance, as they frequently bring otherwise intractable problems into the realm of feasibility using intelligent reduction of search space.

Other Sites & Tools

Technical Groups

Technical

Professional/Personal Growth

Societal Needs

Leadership

2025 Spring Meeting and 21st Global Congress on Process Safety

2025 AIChE Annual Meeting

Upcoming Conferences & Events

CEP: November 2024

CEP: October 2024

Explore Areas of Advancement:

Learning Center:

Want to be an Entrepreneur? Personal Stories From Three Successful Entrepreneurs Who Have Traveled This Path.