2nd Edition

Statistics and Data Analysis for Microarrays Using R and Bioconductor

By Sorin Drăghici Copyright 2012
    1036 Pages 344 B/W Illustrations
    by Chapman & Hall

    Richly illustrated in color, Statistics and Data Analysis for Microarrays Using R and Bioconductor, Second Edition provides a clear and rigorous description of powerful analysis techniques and algorithms for mining and interpreting biological information. Omitting tedious details, heavy formalisms, and cryptic notations, the text takes a hands-on, example-based approach that teaches students the basics of R and microarray technology as well as how to choose and apply the proper data analysis tool to specific problems.

    New to the Second Edition
    Completely updated and double the size of its predecessor, this timely second edition replaces the commercial software with the open source R and Bioconductor environments. Fourteen new chapters cover such topics as the basic mechanisms of the cell, reliability and reproducibility issues in DNA microarrays, basic statistics and linear models in R, experiment design, multiple comparisons, quality control, data pre-processing and normalization, Gene Ontology analysis, pathway analysis, and machine learning techniques. Methods are illustrated with toy examples and real data and the R code for all routines is available on an accompanying downloadable resource.

    With all the necessary prerequisites included, this best-selling book guides students from very basic notions to advanced analysis techniques in R and Bioconductor. The first half of the text presents an overview of microarrays and the statistical elements that form the building blocks of any data analysis. The second half introduces the techniques most commonly used in the analysis of microarray data.

    Introduction
    Bioinformatics — An Emerging Discipline

    The Cell and Its Basic Mechanisms
    The Cell
    The Building Blocks of Genomic Information
    Expression of Genetic Information
    The Need for High-Throughput Methods

    Microarrays
    Microarrays — Tools for Gene Expression Analysis
    Fabrication of Microarrays
    Applications of Microarrays
    Challenges in Using Microarrays in Gene Expression Studies
    Sources of Variability

    Reliability and Reproducibility Issues in DNA Microarray Measurements
    Introduction
    What Is Expected from Microarrays?
    Basic Considerations of Microarray Measurements
    Sensitivity
    Accuracy
    Reproducibility
    Cross Platform Consistency
    Sources of Inaccuracy and Inconsistencies in Microarray Measurements
    The MicroArray Quality Control (MAQC) Project

    Image Processing
    Introduction
    Basic Elements of Digital Imaging
    Microarray Image Processing
    Image Processing of cDNA Microarrays
    Image Processing of Affymetrix Arrays

    Introduction to R
    Introduction to R
    The Basic Concepts
    Data Structures and Functions
    Other Capabilities
    The R Environment
    Installing Bioconductor
    Graphics
    Control Structures in R
    Programming in R vs C/C++/Java

    Bioconductor: Principles and Illustrations
    Overview
    The Portal
    Some Explorations and Analyses

    Elements of Statistics
    Introduction
    Some Basic Concepts
    Elementary Statistics
    Degrees of Freedom
    Probabilities
    Bayes’ Theorem
    Testing for (or Predicting) a Disease

    Probability Distributions
    Probability Distributions
    Central Limit Theorem
    Are Replicates Useful?

    Basic Statistics in R
    Introduction
    Descriptive Statistics in R
    Probabilities and Distributions in R
    Central Limit Theorem

    Statistical Hypothesis Testing
    Introduction
    The Framework
    Hypothesis Testing and Significance
    "I Do Not Believe God Does Not Exist"
    An Algorithm for Hypothesis Testing
    Errors in Hypothesis Testing

    Classical Approaches to Data Analysis
    Introduction
    Tests Involving a Single Sample
    Tests Involving Two Samples

    Analysis of Variance (ANOVA)
    Introduction
    One-Way ANOVA
    Two-Way ANOVA
    Quality Control

    Linear Models in R
    Introduction and Model Formulation
    Fitting Linear Models in R
    Extracting Information from a Fitted Model: Testing Hypotheses and Making Predictions Some Limitations of the Linear Models
    Dealing with Multiple Predictors and Interactions in the Linear Models, and Interpreting Model Coefficients

    Experiment Design
    The Concept of Experiment Design
    Comparing Varieties
    Improving the Production Process
    Principles of Experimental Design
    Guidelines for Experimental Design
    A Short Synthesis of Statistical Experiment Designs
    Some Microarray Specific Experiment Designs

    Multiple Comparisons
    Introduction
    The Problem of Multiple Comparisons
    A More Precise Argument
    Corrections for Multiple Comparisons
    Corrections for Multiple Comparisons in R

    Analysis and Visualization Tools
    Introduction
    Box Plots
    Gene Pies
    Scatter Plots
    Volcano Plots
    Histograms
    Time Series
    Time Series Plots in R
    Principal Component Analysis (PCA)
    Independent Component Analysis (ICA)

    Cluster Analysis
    Introduction
    Distance Metric
    Clustering Algorithms
    Partitioning around Medoids (PAM)
    Biclustering
    Clustering in R

    Quality Control
    Introduction
    Quality Control for Affymetrix Data
    Quality Control of Illumina Data

    Data Pre-Processing and Normalization
    Introduction
    General Pre-Processing Techniques
    Normalization Issues Specific to cDNA Data
    Normalization Issues Specific to Affymetrix Data
    Other Approaches to the Normalization of Affymetrix Data
    Useful Pre-Processing and Normalization Sequences
    Normalization Procedures in R
    Batch Pre-Processing
    Normalization Functions and Procedures for Illumina Data

    Methods for Selecting Differentially Regulated Genes
    Introduction
    Criteria
    Fold Change
    Unusual Ratio
    Hypothesis Testing, Corrections for Multiple Comparisons, and Resampling
    ANOVA
    Noise Sampling
    Model-Based Maximum Likelihood Estimation Methods
    Affymetrix Comparison Calls
    Significance Analysis of Microarrays (SAM)
    A Moderated t-Statistic
    Other Methods
    Reproducibility
    Selecting Differentially Expressed (DE) Genes in R

    The Gene Ontology (GO)
    Introduction
    The Need for an Ontology
    What Is the Gene Ontology (GO)?
    What Does GO Contain?
    Access to GO
    Other Related Resources

    Functional Analysis and Biological Interpretation of Microarray Data
    Over-Representation Analysis (ORA)
    Onto-Express
    Functional Class Scoring
    The Gene Set Enrichment Analysis (GSEA)

    Uses, Misuses, and Abuses in GO Profiling
    Introduction
    "Known Unknowns"
    Which Way Is Up?
    Negative Annotations
    Common Mistakes in Functional Profiling
    Using a Custom Level of Abstraction through the GO Hierarchy
    Correlation between GO Terms
    GO Slims and Subsets

    A Comparison of Several Tools for Ontological Analysis
    Introduction
    Existing tools for Ontological Analysis
    Comparison of Existing Functional Profiling Tools
    Drawbacks and Limitations of the Current Approach

    Focused Microarrays — Comparison and Selection
    Introduction
    Criteria for Array Selection
    Onto-Compare
    Some Comparisons

    ID Mapping Issues
    Introduction
    Name Space Issues in Annotation Databases
    A Comparison of Some ID Mapping Tools

    Pathway Analysis
    Terms and Problem Definition
    Over-Representation and Functional Class Scoring Approaches in Pathway Analysis
    An Approach for the Analysis of Metabolic Pathways
    An Impact Analysis of Signaling Pathways
    Variations on the Impact Analysis Theme
    Pathway Guide
    Kinetic models vs. Impact Analysis
    Conclusions
    Data Sets and Software Availability

    Machine Learning Techniques
    Introduction
    Main Concepts and Definitions
    Supervised Learning
    Practicalities Using R

    The Road Ahead
    What Next?

    References

    A Summary appears at the end of each chapter.

    Biography

    Sorin Drăghici the Robert J. Sokol MD Endowed Chair in Systems Biology in the Department of Obstetrics and Gynecology, professor in the Department of Clinical and Translational Science and Department of Computer Science, and head of the Intelligent Systems and Bioinformatics Laboratory at Wayne State University. He is also the chief of the Bioinformatics and Data Analysis Section in the Perinatology Research Branch of the National Institute for Child Health and Development. A senior member of IEEE, Dr. Drăghici is an editor of IEEE/ACM Transactions on Computational Biology and Bioinformatics, Journal of Biomedicine and Biotechnology, and International Journal of Functional Informatics and Personalized Medicine. He earned a Ph.D. in computer science from the University of St. Andrews.

    Praise for the First Edition
    The book by Draghici is an excellent choice to be used as a textbook for a graduate-level bioinformatics course. This well-written book with two accompanying CD-ROMs will create much-needed enthusiasm among statisticians.
    Journal of Statistical Computation and Simulation, Vol. 74

    I really like Draghici's book. As the author explains in the Preface, the book is intended to serve both the statistician who knows very little about DNA microarrays and the biologist who has no expertise in data analysis. The author lays out a study plan for the statistician that excludes 5 of the 17 chapters (4-8). These chapters present the basics of statistical distributions, estimation, hypothesis testing, ANOVA, and experimental design. What that leaves for the statistician is the three-chapter primer on microarrays and image processing, plus all of the data analysis tools specific to the microarray situation. … it includes two CDs with trial versions of several specialised software packages. Anyone who uses microarray data should certainly own a copy.
    Technometrics, Vol. 47, No. 1, February 2005