616 Pages 16 Color & 84 B/W Illustrations
    by Chapman & Hall

    616 Pages 16 Color & 84 B/W Illustrations
    by Chapman & Hall

    616 Pages 16 Color & 84 B/W Illustrations
    by Chapman & Hall

    Praise for the Second Edition:
    "The authors present an intuitive and easy-to-read book. … accompanied by many examples, proposed exercises, good references, and comprehensive appendices that initiate the reader unfamiliar with MATLAB."
    —Adolfo Alvarez Pinto, International Statistical Review

    "Practitioners of EDA who use MATLAB will want a copy of this book. … The authors have done a great service by bringing together so many EDA routines, but their main accomplishment in this dynamic text is providing the understanding and tools to do EDA.

    —David A Huckaby, MAA Reviews

    Exploratory Data Analysis (EDA) is an important part of the data analysis process. The methods presented in this text are ones that should be in the toolkit of every data scientist. As computational sophistication has increased and data sets have grown in size and complexity, EDA has become an even more important process for visualizing and summarizing data before making assumptions to generate hypotheses and models.

    Exploratory Data Analysis with MATLAB, Third Edition presents EDA methods from a computational perspective and uses numerous examples and applications to show how the methods are used in practice. The authors use MATLAB code, pseudo-code, and algorithm descriptions to illustrate the concepts. The MATLAB code for examples, data sets, and the EDA Toolbox are available for download on the book’s website.

    New to the Third Edition

    • Random projections and estimating local intrinsic dimensionality
    • Deep learning autoencoders and stochastic neighbor embedding
    • Minimum spanning tree and additional cluster validity indices
    • Kernel density estimation
    • Plots for visualizing data distributions, such as beanplots and violin plots
    • A chapter on visualizing categorical data

    Part I

    Introduction to Exploratory Data Analysis

    What is Exploratory Data Analysis

    Overview of the Text

    A Few Words about Notation

    Data Sets Used in the Book

    Unstructured Text Documents

    Gene Expression Data

    Oronsay Data Set

    Software Inspection

    Transforming Data

    Power Transformations

    Standardization

    Sphering the Data

    Further Reading

    Exercises

    Part II

    EDA as Pattern Discovery

    Dimensionality Reduction — Linear Methods

    Introduction

    Principal Component Analysis — PCA

    PCA Using the Sample Covariance Matrix

    PCA Using the Sample Correlation Matrix

    How Many Dimensions Should We Keep?

    Singular Value Decomposition — SVD

    Nonnegative Matrix Factorization

    Factor Analysis

    Fisher’s Linear Discriminant

    Random Projections

    Intrinsic Dimensionality

    Nearest Neighbor Approach

    Correlation Dimension

    Maximum Likelihood Approach

    Estimation Using Packing Numbers

    Estimation of Local Dimension

    Summary and Further Reading

    Exercises


    Dimensionality Reduction — Nonlinear Methods

    Multidimensional Scaling — MDS

    Metric MDS

    Nonmetric MDS

    Manifold Learning

    Locally Linear Embedding

    Isometric Feature Mapping — ISOMAP

    Hessian Eigenmaps

    Artificial Neural Network Approaches

    Self-Organizing Maps

    Generative Topographic Maps

    Curvilinear Component Analysis

    Autoencoders

    Stochastic Neighbor Embedding

    Summary and Further Reading

    Exercises


    Data Tours

    Grand Tour

    Torus Winding Method

    Pseudo Grand Tour

    Interpolation Tours

    Projection Pursuit

    Projection Pursuit Indexes

    Posse Chi-Square Index

    Moment Index

    Independent Component Analysis

    Summary and Further Reading

    Exercises

    Finding Clusters

    Introduction

    Hierarchical Methods

    Optimization Methods — k-Means

    Spectral Clustering

    Document Clustering

    Nonnegative Matrix Factorization — Revisited

    Probabilistic Latent Semantic Analysis

    Minimal Spanning Trees and Clustering

    Definitions

    Minimum Spanning Tree Clustering

    Evaluating the Clusters

    Rand Index

    Cophenetic Correlation

    Upper Tail Rule

    Silhouette Plot

    Gap Statistic

    Cluster Validity Indices

    Summary and Further Reading

    Exercises

     

    Model-Based Clustering

    Overview of Model-Based Clustering

    Finite Mixtures

    Multivariate Finite Mixtures

    Component Models — Constraining the Covariances

    Expectation-Maximization Algorithm

    Hierarchical Agglomerative Model-Based Clustering

    Model-Based Clustering

    MBC for Density Estimation and Discriminant Analysis

    Introduction to Pattern Recognition

    Bayes Decision Theory

    Estimating Probability Densities with MBC

    Generating Random Variables from a Mixture Model

    Summary and Further Reading

    Exercises

    Smoothing Scatterplots

    Introduction

    Loess

    Robust Loess

    Residuals and Diagnostics with Loess

    Residual Plots

    Spread Smooth

    Loess Envelopes — Upper and Lower Smooths

    Smoothing Splines

    Regression with Splines

    Smoothing Splines

    Smoothing Splines for Uniformly Spaced Data

    Choosing the Smoothing Parameter

    Bivariate Distribution Smooths

    Pairs of Middle Smoothings

    Polar Smoothing

    Curve Fitting Toolbox

    Summary and Further Reading

    Exercises


    Part III

    Graphical Methods for EDA

    Visualizing Clusters

    Dendrogram

    Treemaps

    Rectangle Plots

    ReClus Plots

    Data Image

    Summary and Further Reading

    Exercises

     

    Distribution Shapes

    Histograms

    Univariate Histograms

    Bivariate Histograms

    Kernel Density

    Univariate Kernel Density Estimation

    Multivariate Kernel Density Estimation

    Boxplots

    The Basic Boxplot

    Variations of the Basic Boxplot

    Violin Plots

    Beeswarm Plot

    Bean Plot

    Quantile Plots

    Probability Plots

    Quantile-Quantile Plot

    Quantile Plot

    Bagplots

    Rangefinder Boxplot

    Summary and Further Reading

    Exercises

    Multivariate Visualization

    Glyph Plots

    Scatterplots

    2-D and 3-D Scatterplots

    Scatterplot Matrices

    Scatterplots with Hexagonal Binning

    Dynamic Graphics

    Identification of Data

    Linking

    Brushing

    Coplots

    Dot Charts

    Basic Dot Chart

    Multiway Dot Chart

    Plotting Points as Curves

    Parallel Coordinate Plots

    Andrews’ Curves

    Andrews’ Images

    More Plot Matrices

    Data Tours Revisited

    Grand Tour

    Permutation Tour

    Biplots

    Summary and Further Reading

    Exercises

     

    Visualizing Categorical Data

    Discrete Distributions

    Binomial Distribution

    Poisson Distribution

    Exploring Distribution Shapes

    Poissonness Plot

    Binomialness Plot

    Hanging Rootogram

    Contingency Tables

    Background

    Bar Plots

    Spine Plots

    Mosaic Plots

    Sieve Diagrams

    Log Odds Plot

    Summary and Further Reading

    Exercises


    Appendix A

    Proximity Measures

    Appendix B

    Software Resources for EDA

    Appendix C

    Appendix D

    MATLAB® Basics

    Biography

    Wendy L. Martinez is a mathematical statistician with the U.S. Bureau of Labor Statistics. She is a fellow of the American Statistical Association, a co-author of several popular Chapman & Hall/CRC books, and a MATLAB® user for more than 20 years. Her research interests include text data mining, probability density estimation, signal processing, scientific visualization, and statistical pattern recognition. She earned an M.S. in aerospace engineering from George Washington University and a Ph.D. in computational sciences and informatics from George Mason University.

    Angel R. Martinez is fully retired after a long career with the U.S. federal government and as an adjunct professor at Strayer University, where he taught undergraduate and graduate courses in statistics and mathematics. Before retiring from government service, he worked for the U.S. Navy as an operations research analyst and a computer scientist. He earned an M.S. in systems engineering from the Virginia Polytechnic Institute and State University and a Ph.D. in computational sciences and informatics from George Mason University.

    Since 1984, Jeffrey L. Solka has been working in statistical pattern recognition for the Department of the Navy. He has published over 120 journal, conference, and technical papers; has won numerous awards; and holds 4 patents. He earned an M.S. in mathematics from James Madison University, an M.S. in physics from Virginia Polytechnic Institute and State University, and a Ph.D. in computational sciences and informatics from George Mason University.

    “This book presents an extensive coverage in exploratory data analysis (EDA) using the software Matlab. Although this software is used throughout the book, readers can modify the algorithms for different statistical packages. … This book is intended for a wide audience including statisticians, computer scientists, and engineers. A wide range of topics along with Matlab codes are given. Each chapter ends with a good number of exercises which would be very helpful to complement the knowledge learned from the chapter. It is a great source for the students/researchers. It is suitable for a course in the targeted areas at the senior undergraduate or graduate courses. Although Matlab is used throughout the book, the algorithms can easily be converted in other platforms.”
    —Morteza Marzjarani in Technometrics, November 2019