Covers a broad range of topics from the fundamentals of pattern discovery algorithms to basic statistics to state-of-the-art implementation techniquesClarifies how computational methods work and what problems they solveIntroduces basic algorithmic concepts, such as trees, before exploring them in detailFocuses on models of biological sequences, including DNA, RNA, and protein sequencesIncludes illustrative proofs to justify concise factual statements as well as exercises at the end of each chapter for self-study or classroom teaching
The computational methods of bioinformatics are being used more and more to process the large volume of current biological data. Promoting an understanding of the underlying biology that produces this data, Pattern Discovery in Bioinformatics: Theory and Algorithms provides the tools to study regularities in biological data.
Taking a systematic approach to pattern discovery, the book supplies sound mathematical definitions and efficient algorithms to explain vital information about biological data. It explores various data patterns, including strings, clusters, permutations, topology, partial orders, and boolean expressions. Each of these classes captures a different form of regularity in the data, providing possible answers to a wide range of questions. The book also reviews basic statistics, including probability, information theory, and the central limit theorem.
This self-contained book provides a solid foundation in computational methods, enabling the solution of difficult biological questions.
Table of Contents
Ubiquity of Patterns
Motivations Form Biology
The Need for Rigor
Who Is a Reader of This Book?
Tree Problem 1: (Minimum Spanning Tree)
Tree Problem 2: (Steiner Tree)
Tree Problem 3: (Minimum Mutation Labeling)
Storing and Retrieving Elements
NP-Complete Class of Problems
The Bare Truth about Inferential Statistics
WHAT ARE PATTERNS?
When Is a Pattern Specification Non-Trivial?
Classes of Patterns
PATTERNS ON LINEAR STRINGS
MODELING THE STREAM OF LIFE
Modeling a Biopolymer
Hidden Markov Model (HMM)
Comparison of the Schemes
STRING PATTERN SPECIFICATIONS
ALGORITHMS AND PATTERN STATISTICS
Measure of Surprise
Introduction: Local Multiple Alignment
Probabilistic Model: Motif Profile
The Learning Problem
Algorithms to Learn a Motif Profile
An Expectation Maximization Framework
A Gibbs Sampling Strategy
Interpreting the Motif Profile in Terms of p
THE SUBTLE MOTIF
Introduction: Consensus Motif
Combinatorial Model: Subtle Motif
Distance between Motifs
Statistics of Subtle Motifs
A Combinatorial Algorithm
A Probabilistic Algorithm
A Modular Solution
PATTERNS ON META-DATA
How Many Permutation Patterns?
Parikh Mapping-Based Algorithm
Intervals to PQ Trees
PERMUTATION PATTERN PROBABILITIES
What Are Topological Motifs?
The Topological Motif
Compact Topological Motifs
The Discovery Method
Related Classical Problems
SET-THEORETIC ALGORITHMIC TOOLS
Some Basic Properties of Finite Sets
Partial Order Graph G(S,E) of Sets
Boolean Closure of Sets
Consecutive (Linear) Arrangement of Set Members
Maximal Set Intersection Problem (maxSIP)
Minimal Set Intersection Problem (minSIP)
Adapting the Enumeration Scheme
EXPRESSION AND PARTIAL ORDER MOTIFS
Extracting (monotone CNF) Boolean Expressions
Extracting Partial Orders
Statistics of Partial Orders
Application: Partial Order of Expressions
Exercises appear at the end of every chapter.
"Pattern Discovery in Bioinformatics provides an important message to all bioinformatics students and researchers. In contrast to some books that present computational biology as a combination of a BLAST user manual and storytelling, Laxmi Parida's new book makes the crucial point that most bioinformatics success stories require algorithmic and statistical ingenuity. It is a comprehensive and excellent book on the subject and I recommend it to everyone who has an interest in bioinformatics."
-Pavel Pevzner, Ronald R. Taylor Professor of Computer Science, University of California, San Diego, USA
"Laxmi Parida has written a comprehensive and very readable treatment of the discovery and analysis of patterns. Not just another textbook on string theory or combinatorics, nor an arcane reference book on the design of algorithms, nor a source book of sequence analysis software, but a coherent treatment of the kinds of patterns that computational biologists discover at the genomic level, in nucleic acid sequence, in protein structure, in microarray data, and in other linear orders and more general formal structures. Eclectically drawing on the whole gamut of discrete mathematical domains, Parida has built on her own numerous contributions during a distinguished career in putting together her unique take on this rapidly growing field. She insists throughout on mathematical rigour but makes the material eminently approachable by careful development from first principles, clear motivation and exposition, and a wealth of interesting and illustrative examples and exercises ranging from easy to challenging. Any enterprising student or neophyte in this field will gain a clear advantage by working through this book and adopting the wide-ranging but precise and sophisticated attitude to the many new problems that are continually emerging as biological data become more diverse and complex, as well as numerous. And established researchers seeking to broaden their perspective will enjoy the leisurely and insightful presentation of both new and familiar aspects of currently relevant research techniques."
-David Sankoff, University of Ottawa, Canada
"Biology has been transformed from a data poor to a data rich field, with massive accumulation of disparate types of data, for example huge databases of sequences (DNA, RNA, or protein). This data allows important biological insights to be made, partly by finding patterns and motifs that are conserved across many individuals or species; there is now a huge biological literature reporting on such conserved patterns and motifs that have been found in biological datasets. In contrast to the area of pattern matching, the patterns and motifs are generally not known ahead of time, but must be identified or discovered from the data; this task is often very subtle and difficult because the patterns and motifs may be short, may be highly degenerate (containing wildcards and variable length elements), may be ordered differently in different genomes, and are generally hidden in that they make up a small fraction of the data. For particular biological applications, even the definition of a relevant pattern may be difficult to state clearly, or may be unresolved. This has lead to a wonderful opportunity to develop a rigorous mathematical and algorithmic theory of patterns and pattern discovery in biological data. Laxmi Parida has been a leader in this effort, making significant contributions to both deep theory and to practical methods in biological pattern discovery. Her book explores a large subset of that theory, motivated by biological sequence and network data. I believe this book will be viewed as a seminal contribution and hope it draws many new people into the area of rigorous pattern discovery in bioinformatics."
-Dan Gusfield, University of California, Davis, USA