1st Edition

Statistical Foundations of Data Science

    774 Pages 100 B/W Illustrations
    by Chapman & Hall

    774 Pages 100 B/W Illustrations
    by Chapman & Hall

    Statistical Foundations of Data Science gives a thorough introduction to commonly used statistical models, contemporary statistical machine learning techniques and algorithms, along with their mathematical insights and statistical theories. It aims to serve as a graduate-level textbook and a research monograph on high-dimensional statistics, sparsity and covariance learning, machine learning, and statistical inference. It includes ample exercises that involve both theoretical studies as well as empirical applications.

    The book begins with an introduction to the stylized features of big data and their impacts on statistical analysis. It then introduces multiple linear regression and expands the techniques of model building via nonparametric regression and kernel tricks. It provides a comprehensive account on sparsity explorations and model selections for multiple regression, generalized linear models, quantile regression, robust regression, hazards regression, among others. High-dimensional inference is also thoroughly addressed and so is feature screening. The book also provides a comprehensive account on high-dimensional covariance estimation, learning latent factors and hidden structures, as well as their applications to statistical estimation, inference, prediction and machine learning problems. It also introduces thoroughly statistical machine learning theory and methods for classification, clustering, and prediction. These include CART, random forests, boosting, support vector machines, clustering algorithms, sparse PCA, and deep learning.

    I. Introduction

    Rise of Big Data and Dimensionality

    Biological Sciences

    Health Sciences

    Computer and Information Sciences

    Economics and Finance

    Business and Program Evaluation

    Earth Sciences and Astronomy

    Impact of Big Data

    Impact of Dimensionality

    Computation

    Noise Accumulation

    Spurious Correlation

    Statistical theory

    Aim of High-dimensional Statistical Learning

    What big data can do

    Scope of the book

    2. Multiple and Nonparametric Regression

    Introduction

    Multiple Linear Regression

    The Gauss-Markov Theorem

    Statistical Tests

    Weighted Least-Squares

    Box-Cox Transformation

    Model Building and Basis Expansions

    Polynomial Regression

    Spline Regression

    Multiple Covariates

    Ridge Regression

    Bias-Variance Tradeo

    Penalized Least Squares

    Bayesian Interpretation

    Ridge Regression Solution Path

    Kernel Ridge Regression

    Regression in Reproducing Kernel Hilbert Space

    Leave-one-out and Generalized Cross-validation

    Exercises

    3. Introduction to Penalized Least-Squares

    Classical Variable Selection Criteria

    Subset selection

    Relation with penalized regression

    Selection of regularization parameters

    Folded-concave Penalized Least Squares

    Orthonormal designs

    Penalty functions

    Thresholding by SCAD and MCP

    Risk properties

    Characterization of folded-concave PLS

    Lasso and L Regularization

    Nonnegative garrote

    Lasso

    Adaptive Lasso

    Elastic Net

    Dantzig selector

    SLOPE and Sorted Penalties

    Concentration inequalities and uniform convergence

    A brief history of model selection

    Bayesian Variable Selection

    Bayesian view of the PLS

    A Bayesian framework for selection

    Numerical Algorithms

    Quadratic programs

    Least angle regression_

    Local quadratic approximations

    Local linear algorithm

    Penalized linear unbiased selection_

    Cyclic coordinate descent algorithms

    Iterative shrinkage-thresholding algorithms

    Projected proximal gradient method

    ADMM

    Iterative Local Adaptive Majorization and Minimization

    Other Methods and Timeline

    Regularization parameters for PLS

    Degrees of freedom

    Extension of information criteria

    Application to PLS estimators

    Residual variance and refitted cross-validation

    Residual variance of Lasso

    Refitted cross-validation

    Extensions to Nonparametric Modeling

    Structured nonparametric models

    Group penalty

    Applications

    Bibliographical notes

    Exercises

    4. Penalized Least Squares: Properties

    Performance Benchmarks

    Performance measures

    Impact of model uncertainty

    Bayes lower bounds for orthogonal design

    Minimax lower bounds for general design

    Performance goals, sparsity and sub-Gaussian noise

    Penalized L Selection

    Lasso and Dantzig Selector

    Selection consistency

    Prediction and coefficient estimation errors

    Model size and least squares after selection

    Properties of the Dantzig selector

    Regularity conditions on the design matrix

    Properties of Concave PLS

    Properties of penalty functions

    Local and oracle solutions

    Properties of local solutions

    Global and approximate global solutions

    Smaller and Sorted Penalties

    Sorted concave penalties and its local approximation

    Approximate PLS with smaller and sorted penalties

    Properties of LLA and LCA

    Bibliographical notes

    Exercises

    5. Generalized Linear Models and Penalized Likelihood

    Generalized Linear Models

    Exponential family

    Elements of generalized linear models

    Maximum likelihood

    Computing MLE: Iteratively reweighed least squares

    Deviance and Analysis of Deviance

    Residuals

    Examples

    Bernoulli and binomial models

    Models for count responses

    Models for nonnegative continuous responses

    Normal error models

    Sparest solution in high confidence set

    A general setup

    Examples

    Properties

    Variable Selection via Penalized Likelihood

    Algorithms

    Local quadratic approximation

    Local linear approximation

    Coordinate descent

    Iterative Local Adaptive Majorization and Minimization

    Tuning parameter selection

    An Application

    Sampling Properties in low-dimension

    Notation and regularity conditions

    The oracle property

    Sampling Properties with Diverging Dimensions

    Asymptotic properties of GIC selectors

    Properties under Ultrahigh Dimensions

    The Lasso penalized estimator and its risk property

    Strong oracle property

    Numeric studies

    Risk properties

    Bibliographical notes

    Exercises

    6. Penalized M-estimators

    Penalized quantile regression

    Quantile regression

    Variable selection in quantile regression

    A fast algorithm for penalized quantile regression

    Penalized composite quantile regression

    Variable selection in robust regression

    Robust regression

    Variable selection in Huber regression

    Rank regression and its variable selection

    Rank regression

    Penalized weighted rank regression

    Variable Selection for Survival Data

    Partial likelihood

    Variable selection via penalized partial likelihood and its properties

    Theory of folded-concave penalized M-estimator

    Conditions on penalty and restricted strong convexity

    Statistical accuracy of penalized M-estimator with

    folded concave penalties

    Computational accuracy

    Bibliographical notes

    Exercises

    7. High Dimensional Inference

    Inference in linear regression

    Debias of regularized regression estimators

    Choices of weights

    Inference for the noise level

    Inference in generalized linear models

    Desparsified Lasso

    Decorrelated score estimator

    Test of linear hypotheses

    Numerical comparison

    An application

    Asymptotic efficiency

    Statistical efficiency and Fisher information

    Linear regression with random design

    Partial linear regression

    Gaussian graphical models

    Inference via penalized least squares

    Sample size in regression and graphical models

    General solutions_

    Local semi-LD decomposition

    Data swap

    Gradient approximation

    Bibliographical notes

    Exercises

    8. Feature Screening

    Correlation Screening

    Sure screening property

    Connection to multiple comparison

    Iterative SIS

    Generalized and Rank Correlation Screening

    Feature Screening for Parametric Models

    Generalized linear models

    A unified strategy for parametric feature screening

    Conditional sure independence screening

    Nonparametric Screening

    Additive models

    Varying coefficient models

    Heterogeneous nonparametric models

    Model-free Feature Screening

    Sure independent ranking screening procedure

    Feature screening via distance correlation

    Feature screening for high-dimensional categorial data

    Screening and Selection

    Feature screening via forward regression

    Sparse maximum likelihood estimate

    Feature screening via partial correlation

    Refitted Cross-Validation

    RCV algorithm

    RCV in linear models

    RCV in nonparametric regression

    An Illustration

    Bibliographical notes

    Exercises

    9. Covariance Regularization and Graphical Models

    Basic facts about matrix

    Sparse Covariance Matrix Estimation

    Covariance regularization by thresholding and banding

    Asymptotic properties

    Nearest positive definite matrices

    Robust covariance inputs

    Sparse Precision Matrix and Graphical Models

    Gaussian graphical models

    Penalized likelihood and M-estimation

    Penalized least-squares

    CLIME and its adaptive version

    Latent Gaussian Graphical Models

    Technical Proofs

    Proof of Theorem

    Proof of Theorem

    Proof of Theorem

    Proof of Theorem

    Bibliographical notes

    Exercises

    10. Covariance Learning and Factor Models

    Principal Component Analysis

    Introduction to PCA

    Power Method

    Factor Models and Structured Covariance Learning

    Factor model and high-dimensional PCA

    Extracting latent factors and POET

    Methods for selecting number of factors

    Covariance and Precision Learning with Known Factors

    Factor model with observable factors

    Robust initial estimation of covariance matrix

    Augmented factor models and projected PCA

    Asymptotic Properties

    Properties for estimating loading matrix

    Properties for estimating covariance matrices

    Properties for estimating realized latent factors

    Properties for estimating idiosyncratic components

    Technical Proofs

    Proof of Theorem

    Proof of Theorem

    Proof of Theorem

    Proof of Theorem

    Bibliographical Notes

    Exercises

    11. Applications of Factor Models and PCA

    Factor-adjusted Regularized Model Selection

    Importance of factor adjustments

    FarmSelect

    Application to forecasting bond risk premia

    Application to a neuroblastoma data

    Asymptotic theory for FarmSelect

    Factor-adjusted robust multiple testing

    False discovery rate control

    Multiple testing under dependence measurements

    Power of factor adjustments

    FarmTest

    Application to neuroblastoma data

    Factor Augmented Regression Methods

    Principal Component Regression

    Augmented Principal Component Regression

    Application to Forecast Bond Risk Premia

    Applications to Statistical Machine Learning

    Community detection

    Topic model

    Matrix completion

    Item ranking

    Gaussian Mixture models

    Bibliographical Notes

    Exercises

    12. Supervised Learning

    Model-based Classifiers

    Linear and quadratic discriminant analysis

    Logistic regression

    Kernel Density Classifiers and Naive Bayes

    Nearest Neighbor Classifiers

    Classification Trees and Ensemble Classifiers

    Classification trees

    Bagging

    Random forests

    Boosting

    Support Vector Machines

    The standard support vector machine

    Generalizations of SVMs

    Sparse Classifiers via Penalized Empirical Loss

    The importance of sparsity under high-dimensionality

    Sparse support vector machines

    Sparse large margin classifiers

    Sparse Discriminant Analysis

    Nearest shrunken centroids classifier

    Features annealed independent rule

    Selection bias of sparse independence rules

    Regularized optimal affine discriminant

    Linear programming discriminant

    Direct sparse discriminant analysis

    Solution path equivalence between ROAD and DSDA

    Feature Augmention and Sparse Additive Classifiers

    Feature augmentation

    Penalized additive logistic regression

    Semiparametric sparse discriminant analysis

    Bibliographical notes

    Exercises

    13. Unsupervised Learning

    Cluster Analysis

    K-means clustering

    Hierarchical clustering

    Model-based clustering

    Spectral clustering

    Data-driven choices of the number of clusters

    Variable Selection in Clustering

    Sparse clustering

    Sparse model-based clustering

    Sparse mixture of experts model

    An Introduction to High Dimensional PCA

    Inconsistency of the regular PCA

    Consistency under sparse eigenvector model

    Sparse Principal Component Analysis

    Sparse PCA

    An iterative SVD thresholding approach

    A penalized matrix decomposition approach

    A semidefinite programming approach

    A generalized power method

    Bibliographical notes

    Exercises

    14. An Introduction to Deep Learning

    Rise of Deep Learning

    Feed-forward neural networks

    Model setup

    Back-propagation in computational graphs

    Popular models

    Convolutional neural networks

    Recurrent neural networks

    Vanilla RNNs

    GRUs and LSTM

    Multilayer RNNs

    Modules

    Deep unsupervised learning

    Autoencoders

    Generative adversarial networks

    Sampling view of GANs

    Minimum distance view of GANs

    Training deep neural nets

    Stochastic gradient descent

    Mini-batch SGD

    Momentum-based SGD

    SGD with adaptive learning rates

    Easing numerical instability

    ReLU activation function

    Skip connections

    Batch normalization

    Regularization techniques

    Weight decay

    Dropout

    Data augmentation

    Example: image classification

    Bibliography notes

    Biography

    The authors are international authorities and leaders on the presented topics. All are fellows of the Institute of Mathematical Statistics and the American Statistical Association.

    Jianqing Fan is Frederick L. Moore Professor, Princeton University. He is co-editing Journal of Business and Economics Statistics and was the co-editor of The Annals of Statistics, Probability Theory and Related Fields, and Journal of Econometrics and has been recognized by the 2000 COPSS Presidents' Award, AAAS Fellow, Guggenheim Fellow, Guy medal in silver, Noether Senior Scholar Award, and Academician of Academia Sinica.

    Runze Li is Elberly family chair professor and AAAS fellow, Pennsylvania State University, and was co-editor of The Annals of Statistics.

    Cun-Hui Zhang is distinguished professor, Rutgers University and was co-editor of Statistical Science.

    Hui Zou is professor, University of Minnesota and was action editor of Journal of Machine Learning Research.

    "This book delivers a very comprehensive summary of the development of statistical foundations of data science. The authors no doubt are doing frontier research and have made several crucial contributions to the field. Therefore, the book offers a very good account of the most cutting-edge development. The book is suitable for both master and Ph.D. students in statistics, and also for researchers in both applied and theoretical data science. Researchers can take this book as an index of topics, as it summarizes in brief many significant research articles in an accessible way. Each chapter can be read independently by experienced researchers. It provides a nice cover of key concepts in those topics and researchers can benefit from reading the specific chapters and paragraphs to get a big picture rather than diving into many technical articles. There are altogether 14 chapters. It can serve as a textbook for two semesters. The book also provides handy codes and data sets, which is a great treasure for practitioners."
    ~Journal of Time Series Analysis

    "This text—collaboratively authored by renowned statisticians Fan (Princeton Univ.), Li (Pennsylvania State Univ.), Zhang (Rutgers Univ.), and Zhou (Univ. of Minnesota)—laboriously compiles and explains theoretical and methodological achievements in data science and big data analytics. Amid today's flood of coding-based cookbooks for data science, this book is a rare monograph addressing recent advances in mathematical and statistical principles and the methods behind regularized regression, analysis of high-dimensional data, and machine learning. The pinnacle achievement of the book is its comprehensive exploration of sparsity for model selection in statistical regression, considering models such as generalized linear regression, penalized least squares, quantile and robust regression, and survival regression. The authors discuss sparsity not only in terms of various types of penalties but also as an important feature of numerical optimization algorithms, now used in manifold applications including deep learning. The text extensively probes contemporary high-dimensional data modeling methods such as feature screening, covariate regularization, graphical modeling, and principal component and factor analysis. The authors conclude by introducing contemporary statistical machine learning, spanning a range of topics in supervised and unsupervised learning techniques and deep learning. This book is a must-have bookshelf item for those with a thirst for learning about the theoretical rigor of data science."
    ~Choice Review, S-T. Kim, North Carolina A&T State University, August 2021