Data Mining for Bioinformatics

Free Standard Shipping

Purchasing Options

ISBN 9780849328015
Cat# 2801



SAVE 20%

eBook (VitalSource)
ISBN 9781420004304
Cat# E2801



SAVE 30%

eBook Rentals

Other eBook Options:


  • Discusses design principles of data mining systems for bioinformatics
  • Examines data mining of protein data, gene expression data, and medical image data
  • Includes data cleansing, translation and transformation, and dimensionality reduction


Covering theory, algorithms, and methodologies, as well as data mining technologies, Data Mining for Bioinformatics provides a comprehensive discussion of data-intensive computations used in data mining with applications in bioinformatics. It supplies a broad, yet in-depth, overview of the application domains of data mining for bioinformatics to help readers from both biology and computer science backgrounds gain an enhanced understanding of this cross-disciplinary field.

The book offers authoritative coverage of data mining techniques, technologies, and frameworks used for storing, analyzing, and extracting knowledge from large databases in the bioinformatics domains, including genomics and proteomics. It begins by describing the evolution of bioinformatics and highlighting the challenges that can be addressed using data mining techniques. Introducing the various data mining techniques that can be employed in biological databases, the text is organized into four sections:

  1. Supplies a complete overview of the evolution of the field and its intersection with computational learning
  2. Describes the role of data mining in analyzing large biological databases—explaining the breath of the various feature selection and feature extraction techniques that data mining has to offer
  3. Focuses on concepts of unsupervised learning using clustering techniques and its application to large biological data
  4. Covers supervised learning using classification techniques most commonly used in bioinformatics—addressing the need for validation and benchmarking of inferences derived using either clustering or classification

The book describes the various biological databases prominently referred to in bioinformatics and includes a detailed list of the applications of advanced clustering algorithms used in bioinformatics. Highlighting the challenges encountered during the application of classification on biological databases, it considers systems of both single and ensemble classifiers and shares effort-saving tips for model selection and performance estimation strategies.

Table of Contents

Introduction to Bioinformatics
Transcription and Translation
     The Central Dogma of Molecular Biology
The Human Genome Project
Beyond the Human Genome Project
     Sequencing Technology
          Dideoxy Sequencing
          Cyclic Array Sequencing
          Sequencing by Hybridization
          Mass Spectrometry
          Nanopore Sequencing
     Next-Generation Sequencing
          Challenges of Handling NGS Data
     Sequence Variation Studies
          Kinds of Genomic Variations
          SNP Characterization
     Functional Genomics
          Splicing and Alternative Splicing
          Microarray-Based Functional Genomics
     Comparative Genomics
     Functional Annotation
          Function Prediction Aspects

Biological Databases and Integration
Introduction: Scientific Work Flows and Knowledge Discovery
Biological Data Storage and Analysis
     Challenges of Biological Data
     Classification of Bioscience Databases
          Primary versus Secondary Databases
          Deep versus Broad Databases
          Point Solution versus General Solution Databases
     Gene Expression Omnibus (GEO) Database
     The Protein Data Bank (PDB)
The Curse of Dimensionality
Data Cleaning
     Problems of Data Cleaning
     Challenges of Handling Evolving Databases
          Problems Associated with Single-Source Techniques
          Problems Associated with Multisource Integration
     Data Argumentation: Cleaning at the Schema Level
     Knowledge-Based Framework: Cleaning at the Instance Level
     Data Integration
          Sequence Retrieval System (SRS)
          IBM’s DiscoveryLink
          Wrappers: Customizable Database Software
          Data Warehousing: Data Management with Query Optimization
          Data Integration in the PDB

Knowledge Discovery in Databases
Analysis of Data Using Large Databases
     Distance Metrics
     Data Cleaning and Data Preprocessing
Challenges in Data Cleaning
     Models of Data Cleaning
          Proximity-Based Techniques
          Parametric Methods
          Nonparametric Methods
          Semiparametric Methods
          Neural Networks
          Machine Learning
          Hybrid Systems
Data Integration
     Data Integration and Data Linkage
     Schema Integration Issues
     Field Matching Techniques
          Character-Based Similarity Metrics
          Token-Based Similarity Metrics
          Data Linkage/Matching Techniques
Data Warehousing
     Online Analytical Processing
     Differences between OLAP and OLTP
     OLAP Tasks
     Life Cycle of a Data Warehouse

Section II

Feature Selection and Extraction Strategies in Data Mining
Data Transformation
     Data Smoothing by Discretization
          Discretization of Continuous Attributes
     Normalization and Standardization
          Min-Max Normalization
           z-Score Standardization
          Normalization by Decimal Scaling
Features and Relevance
      Strongly Relevant Features
     Weakly Relevant to the Dataset/Distribution
     Pearson Correlation Coefficient
     Information Theoretic Ranking Criteria
Overview of Feature Selection
      Filter Approaches
     Wrapper Approaches
Filter Approaches for Feature Selection
     FOCUS Algorithm
     Relief Method—Weight-Based Approach.
Feature Subset Selection Using Forward Selection
     Gram-Schmidt Forward Feature Selection
Other Nested Subset Selection Methods
Feature Construction and Extraction
     Matrix Factorization
          LU Decomposition
          QR Factorization to Extract Orthogonal Features
           Eigenvalues and Eigenvectors of a Matrix
     Other Properties of a Matrix
     A Square Matrix and Matrix Diagonalization
          Symmetric Real Matrix: Spectral Theorem
          Singular Vector Decomposition (SVD)
     Principal Component Analysis (PCA)
          Jordan Decomposition of a Matrix
          Principal Components
     Partial Least-Squares-Based Dimension Reduction (PLS)
     Factor Analysis (FA)
     Independent Component Analysis (ICA)
     Multidimensional Scaling (MDS)

Feature Interpretation for Biological Learning
Normalization Techniques for Gene Expression Analysis
     Normalization and Standardization Techniques
          Expression Ratios
          Intensity-Based Normalization
          Total Intensity Normalization
          Intensity-Based Filtering of Array Elements
     Identification of Differentially Expressed Genes
     Selection Bias of Gene Expression Data
Data Preprocessing of Mass Spectrometry Data
     Data Transformation Techniques
          Baseline Subtraction (Smoothing)
          Peak Detection
          Peak Alignment
     Application of Dimensionality Reduction
Techniques for MS Data Analysis
     Feature Selection Techniques
          Univariate Methods
          Multivariate Methods
Data Preprocessing for Genomic Sequence Data
     Feature Selection for Sequence Analysis
Ontologies in Bioinformatics
     The Role of Ontologies in Bioinformatics
          Description Logics
          Gene Ontology (GO)
          Open Biomedical Ontologies (OBO)

Section III

Clustering Techniques in Bioinformatics
Clustering in Bioinformatics
Clustering Techniques
     Distance-Based Clustering and Measures
          Mahalanobis Distance
          Minkowiski Distance
          Pearson Correlation
          Binary Features
          Nominal Features
          Mixed Variables
     Distance Measure Properties
     k-Means Algorithm
     k-Modes Algorithm
     Genetic Distance Measure (GDM)
Applications of Distance-Based Clustering in Bioinformatics
     New Distance Metric in Gene Expressions for Coexpressed Genes
     Gene Expression Clustering Using Mutual Information Distance Measure
     Gene Expression Data Clustering Using a Local Shape-Based Clustering
          Exact Similarity Computation
          Approximate Similarity Computation
Implementation of k-Means in WEKA
Hierarchical Clustering
     Agglomerative Hierarchical Clustering
     Cluster Splitting and Merging
     Calculate Distance between Clusters
     Applications of Hierarchical Clustering Techniques in Bioinformatics
          Hierarchical Clustering Based on Partially Overlapping and Irregular Data
          Cluster Stability Estimation for Microarray Data
          Comparing Gene Expression Sequences Using Pairwise Average Linking
Implementation of Hierarchical Clustering
Self-Organizing Maps Clustering
     SOM Algorithm
     Application of SOM in Bioinformatics
          Identifying Distinct Gene Expression Patterns Using SOM
          SOTA: Combining SOM and Hierarchical Clustering for Representation of Genes
Fuzzy Clustering
     Fuzzy c-Means (FCM)
     Application of Fuzzy Clustering in Bioinformatics
          Clustering Genes Using Fuzzy J-Means and VNS Methods
          Fuzzy k-Means Clustering on Gene Expression
          Comparison of Fuzzy Clustering Algorithms
Implementation of Expectation Maximization Algorithm

Advanced Clustering Techniques
Graph-Based Clustering
     Graph-Based Cluster Properties
     Cut in a Graph
     Intracluster and Intercluster Density
Measures for Identifying Clusters
      Identifying Clusters by Computing Values for the Vertices or Vertex Similarity
          Distance and Similarity Measure
          Adjacency-Based Measures
          Connectivity Measures
     Computing the Fitness Measure
          Density Measure
          Cut-Based Measures
Determining a Split in the Graph
     Spectral Methods
Graph-Based Algorithms
     Chameleon Algorithm
     CLICK Algorithm
Application of Graph-Based Clustering in Bioinformatics
     Analysis of Gene Expression Data Using Shortest Path (SP)
     Construction of Genetic Linkage Maps Using Minimum Spanning Tree of a Graph
     Finding Isolated Groups in a Random Graph Process
     Implementation in Cytoscape
          Seeding Method
Kernel-Based Clustering
     Kernel Functions
     Gaussian Function
Application of Kernel Clustering in Bioinformatics
     Kernel Clustering
     Kernel-Based Support Vector Clustering
     Analyzing Gene Expression Data Using SOM and Kernel-Based Clustering
Model-Based Clustering for Gene Expression Data
     Gaussian Mixtures
     Diagonal Model
     Model Selection
Relevant Number of Genes
     A Resampling-Based Approach for Identifying Stable and Tight Patterns
     Overcoming the Local Minimum Problem in k-Means Clustering
     Tight Clustering
     Tight Clustering of Gene Expression Time Courses
Higher-Order Mining
     Clustering for Association Rule Discovery
     Clustering of Association Rules
     Clustering Clusters

Section IV

Classification Techniques in Bioinformatics
     Bias-Variance Trade-Off in Supervised Learning
     Linear and Nonlinear Classifiers
     Model Complexity and Size of Training Data
     Dimensionality of Input Space
Supervised Learning in Bioinformatics
Support Vector Machines (SVMs)
     Large Margin of Separation
     Soft Margin of Separation
     Kernel Functions
     Applications of SVM in Bioinformatics
          Gene Expression Analysis
          Remote Protein Homology Detection
Bayesian Approaches
     Bayes’ Theorem
     Naive Bayes Classification
          Handling of Prior Probabilities
          Handling of Posterior Probability
     Bayesian Networks
          Capturing Data Distributions Using Bayesian Networks
          Equivalence Classes of Bayesian Networks
          Learning Bayesian Networks
          Bayesian Scoring Metric
     Application of Bayesian Classifiers in Bioinformatics
          Binary Classification
          Multiclass Classification
          Computational Challenges for Gene Expression Analysis
Decision Trees
     Tree Pruning
Ensemble Approaches
          Unweighed Voting Methods
          Confidence Voting Methods
          Ranked Voting Methods
          Seeking Prospective Classifiers to Be Part of the Ensemble
          Choosing an Optimal Set of Classifiers
          Assigning Weight to the Chosen Classifier
     Random Forest
     Application of Ensemble Approaches in Bioinformatics
Computational Challenges of Supervised Learning

Validation and Benchmarking
Introduction: Performance Evaluation Techniques
Classifier Validation
     Model Selection
          Challenges Model Selection
     Performance Estimation Strategies
          Three-Way Split
          k-Fold Cross-Validation
          Random Subsampling
Performance Measures
     Sensitivity and Specificity
     Precision, Recall, and f-Measure
     ROC Curve
Cluster Validation Techniques
     The Need for Cluster Validation
          External Measures
          Internal Measures
     Performance Evaluation Using Validity Indices
          Silhouette Index (SI)
          Davies-Bouldin and Dunn’s Index
          Calinski Harabasz (CH) Index
          Rand Index


Author Bio(s)

Sumeet Dua is an Upchurch endowed professor of computer science and interim director of computer science, electrical engineering, and electrical engineering technology in the College of Engineering and Science at Louisiana Tech University. He obtained his PhD in computer science from Louisiana State University in 2002. He has coauthored/edited 3 books, has published over 50 research papers in leading journals and conferences, and has advised over 22 graduate thesis and dissertations in the areas of data mining, knowledge discovery, and computational learning in high-dimensional datasets. NIH, NSF, AFRL, AFOSR, NASA, and LA-BOR have supported his research. He frequently serves as a panelist for the NSF and NIH (over 17 panels) and has presented over 25 keynotes, invited talks, and workshops at international conferences and educational institutions. He has also served as the overall program chair for three international conferences and as a chair for multiple conference tracks in the areas of data mining applications and information intelligence. He is a senior member of the IEEE and the ACM. His research interests include information discovery in heterogeneous and distributed datasets, semisupervised learning, content-based feature extraction and modeling, and pattern tracking.

Pradeep Chowriappa is a research assistant professor in the College of Engineering and Science at Louisiana Tech University. His research focuses on the application of data mining algorithms and frameworks on biological and clinical data. Before obtaining his PhD in computer analysis and modeling from Louisiana Tech University in 2008, he pursued a yearlong internship at the Indian Space Research Organization (ISRO), Bangalore, India. He received his masters in computer applications from the University of Madras, Chennai, India, in 2003 and his bachelor’s in science and engineering from Loyola Academy, Secunderabad, India, in 2000. His research interests include design and analysis of algorithms for knowledge discovery and modeling in high-dimensional data domains in computational biology, distributed data mining, and domain integration.