1st Edition
Statistical Foundations of Data Science
Statistical Foundations of Data Science gives a thorough introduction to commonly used statistical models, contemporary statistical machine learning techniques and algorithms, along with their mathematical insights and statistical theories. It aims to serve as a graduate-level textbook and a research monograph on high-dimensional statistics, sparsity and covariance learning, machine learning, and statistical inference. It includes ample exercises that involve both theoretical studies as well as empirical applications.
The book begins with an introduction to the stylized features of big data and their impacts on statistical analysis. It then introduces multiple linear regression and expands the techniques of model building via nonparametric regression and kernel tricks. It provides a comprehensive account on sparsity explorations and model selections for multiple regression, generalized linear models, quantile regression, robust regression, hazards regression, among others. High-dimensional inference is also thoroughly addressed and so is feature screening. The book also provides a comprehensive account on high-dimensional covariance estimation, learning latent factors and hidden structures, as well as their applications to statistical estimation, inference, prediction and machine learning problems. It also introduces thoroughly statistical machine learning theory and methods for classification, clustering, and prediction. These include CART, random forests, boosting, support vector machines, clustering algorithms, sparse PCA, and deep learning.
I. Introduction
Rise of Big Data and Dimensionality
Biological Sciences
Health Sciences
Computer and Information Sciences
Economics and Finance
Business and Program Evaluation
Earth Sciences and Astronomy
Impact of Big Data
Impact of Dimensionality
Computation
Noise Accumulation
Spurious Correlation
Statistical theory
Aim of High-dimensional Statistical Learning
What big data can do
Scope of the book
2. Multiple and Nonparametric Regression
Introduction
Multiple Linear Regression
The Gauss-Markov Theorem
Statistical Tests
Weighted Least-Squares
Box-Cox Transformation
Model Building and Basis Expansions
Polynomial Regression
Spline Regression
Multiple Covariates
Ridge Regression
Bias-Variance Tradeo
Penalized Least Squares
Bayesian Interpretation
Ridge Regression Solution Path
Kernel Ridge Regression
Regression in Reproducing Kernel Hilbert Space
Leave-one-out and Generalized Cross-validation
Exercises
3. Introduction to Penalized Least-Squares
Classical Variable Selection Criteria
Subset selection
Relation with penalized regression
Selection of regularization parameters
Folded-concave Penalized Least Squares
Orthonormal designs
Penalty functions
Thresholding by SCAD and MCP
Risk properties
Characterization of folded-concave PLS
Lasso and L Regularization
Nonnegative garrote
Lasso
Adaptive Lasso
Elastic Net
Dantzig selector
SLOPE and Sorted Penalties
Concentration inequalities and uniform convergence
A brief history of model selection
Bayesian Variable Selection
Bayesian view of the PLS
A Bayesian framework for selection
Numerical Algorithms
Quadratic programs
Least angle regression_
Local quadratic approximations
Local linear algorithm
Penalized linear unbiased selection_
Cyclic coordinate descent algorithms
Iterative shrinkage-thresholding algorithms
Projected proximal gradient method
ADMM
Iterative Local Adaptive Majorization and Minimization
Other Methods and Timeline
Regularization parameters for PLS
Degrees of freedom
Extension of information criteria
Application to PLS estimators
Residual variance and refitted cross-validation
Residual variance of Lasso
Refitted cross-validation
Extensions to Nonparametric Modeling
Structured nonparametric models
Group penalty
Applications
Bibliographical notes
Exercises
4. Penalized Least Squares: Properties
Performance Benchmarks
Performance measures
Impact of model uncertainty
Bayes lower bounds for orthogonal design
Minimax lower bounds for general design
Performance goals, sparsity and sub-Gaussian noise
Penalized L Selection
Lasso and Dantzig Selector
Selection consistency
Prediction and coefficient estimation errors
Model size and least squares after selection
Properties of the Dantzig selector
Regularity conditions on the design matrix
Properties of Concave PLS
Properties of penalty functions
Local and oracle solutions
Properties of local solutions
Global and approximate global solutions
Smaller and Sorted Penalties
Sorted concave penalties and its local approximation
Approximate PLS with smaller and sorted penalties
Properties of LLA and LCA
Bibliographical notes
Exercises
5. Generalized Linear Models and Penalized Likelihood
Generalized Linear Models
Exponential family
Elements of generalized linear models
Maximum likelihood
Computing MLE: Iteratively reweighed least squares
Deviance and Analysis of Deviance
Residuals
Examples
Bernoulli and binomial models
Models for count responses
Models for nonnegative continuous responses
Normal error models
Sparest solution in high confidence set
A general setup
Examples
Properties
Variable Selection via Penalized Likelihood
Algorithms
Local quadratic approximation
Local linear approximation
Coordinate descent
Iterative Local Adaptive Majorization and Minimization
Tuning parameter selection
An Application
Sampling Properties in low-dimension
Notation and regularity conditions
The oracle property
Sampling Properties with Diverging Dimensions
Asymptotic properties of GIC selectors
Properties under Ultrahigh Dimensions
The Lasso penalized estimator and its risk property
Strong oracle property
Numeric studies
Risk properties
Bibliographical notes
Exercises
6. Penalized M-estimators
Penalized quantile regression
Quantile regression
Variable selection in quantile regression
A fast algorithm for penalized quantile regression
Penalized composite quantile regression
Variable selection in robust regression
Robust regression
Variable selection in Huber regression
Rank regression and its variable selection
Rank regression
Penalized weighted rank regression
Variable Selection for Survival Data
Partial likelihood
Variable selection via penalized partial likelihood and its properties
Theory of folded-concave penalized M-estimator
Conditions on penalty and restricted strong convexity
Statistical accuracy of penalized M-estimator with
folded concave penalties
Computational accuracy
Bibliographical notes
Exercises
7. High Dimensional Inference
Inference in linear regression
Debias of regularized regression estimators
Choices of weights
Inference for the noise level
Inference in generalized linear models
Desparsified Lasso
Decorrelated score estimator
Test of linear hypotheses
Numerical comparison
An application
Asymptotic efficiency
Statistical efficiency and Fisher information
Linear regression with random design
Partial linear regression
Gaussian graphical models
Inference via penalized least squares
Sample size in regression and graphical models
General solutions_
Local semi-LD decomposition
Data swap
Gradient approximation
Bibliographical notes
Exercises
8. Feature Screening
Correlation Screening
Sure screening property
Connection to multiple comparison
Iterative SIS
Generalized and Rank Correlation Screening
Feature Screening for Parametric Models
Generalized linear models
A unified strategy for parametric feature screening
Conditional sure independence screening
Nonparametric Screening
Additive models
Varying coefficient models
Heterogeneous nonparametric models
Model-free Feature Screening
Sure independent ranking screening procedure
Feature screening via distance correlation
Feature screening for high-dimensional categorial data
Screening and Selection
Feature screening via forward regression
Sparse maximum likelihood estimate
Feature screening via partial correlation
Refitted Cross-Validation
RCV algorithm
RCV in linear models
RCV in nonparametric regression
An Illustration
Bibliographical notes
Exercises
9. Covariance Regularization and Graphical Models
Basic facts about matrix
Sparse Covariance Matrix Estimation
Covariance regularization by thresholding and banding
Asymptotic properties
Nearest positive definite matrices
Robust covariance inputs
Sparse Precision Matrix and Graphical Models
Gaussian graphical models
Penalized likelihood and M-estimation
Penalized least-squares
CLIME and its adaptive version
Latent Gaussian Graphical Models
Technical Proofs
Proof of Theorem
Proof of Theorem
Proof of Theorem
Proof of Theorem
Bibliographical notes
Exercises
10. Covariance Learning and Factor Models
Principal Component Analysis
Introduction to PCA
Power Method
Factor Models and Structured Covariance Learning
Factor model and high-dimensional PCA
Extracting latent factors and POET
Methods for selecting number of factors
Covariance and Precision Learning with Known Factors
Factor model with observable factors
Robust initial estimation of covariance matrix
Augmented factor models and projected PCA
Asymptotic Properties
Properties for estimating loading matrix
Properties for estimating covariance matrices
Properties for estimating realized latent factors
Properties for estimating idiosyncratic components
Technical Proofs
Proof of Theorem
Proof of Theorem
Proof of Theorem
Proof of Theorem
Bibliographical Notes
Exercises
11. Applications of Factor Models and PCA
Factor-adjusted Regularized Model Selection
Importance of factor adjustments
FarmSelect
Application to forecasting bond risk premia
Application to a neuroblastoma data
Asymptotic theory for FarmSelect
Factor-adjusted robust multiple testing
False discovery rate control
Multiple testing under dependence measurements
Power of factor adjustments
FarmTest
Application to neuroblastoma data
Factor Augmented Regression Methods
Principal Component Regression
Augmented Principal Component Regression
Application to Forecast Bond Risk Premia
Applications to Statistical Machine Learning
Community detection
Topic model
Matrix completion
Item ranking
Gaussian Mixture models
Bibliographical Notes
Exercises
12. Supervised Learning
Model-based Classifiers
Linear and quadratic discriminant analysis
Logistic regression
Kernel Density Classifiers and Naive Bayes
Nearest Neighbor Classifiers
Classification Trees and Ensemble Classifiers
Classification trees
Bagging
Random forests
Boosting
Support Vector Machines
The standard support vector machine
Generalizations of SVMs
Sparse Classifiers via Penalized Empirical Loss
The importance of sparsity under high-dimensionality
Sparse support vector machines
Sparse large margin classifiers
Sparse Discriminant Analysis
Nearest shrunken centroids classifier
Features annealed independent rule
Selection bias of sparse independence rules
Regularized optimal affine discriminant
Linear programming discriminant
Direct sparse discriminant analysis
Solution path equivalence between ROAD and DSDA
Feature Augmention and Sparse Additive Classifiers
Feature augmentation
Penalized additive logistic regression
Semiparametric sparse discriminant analysis
Bibliographical notes
Exercises
13. Unsupervised Learning
Cluster Analysis
K-means clustering
Hierarchical clustering
Model-based clustering
Spectral clustering
Data-driven choices of the number of clusters
Variable Selection in Clustering
Sparse clustering
Sparse model-based clustering
Sparse mixture of experts model
An Introduction to High Dimensional PCA
Inconsistency of the regular PCA
Consistency under sparse eigenvector model
Sparse Principal Component Analysis
Sparse PCA
An iterative SVD thresholding approach
A penalized matrix decomposition approach
A semidefinite programming approach
A generalized power method
Bibliographical notes
Exercises
14. An Introduction to Deep Learning
Rise of Deep Learning
Feed-forward neural networks
Model setup
Back-propagation in computational graphs
Popular models
Convolutional neural networks
Recurrent neural networks
Vanilla RNNs
GRUs and LSTM
Multilayer RNNs
Modules
Deep unsupervised learning
Autoencoders
Generative adversarial networks
Sampling view of GANs
Minimum distance view of GANs
Training deep neural nets
Stochastic gradient descent
Mini-batch SGD
Momentum-based SGD
SGD with adaptive learning rates
Easing numerical instability
ReLU activation function
Skip connections
Batch normalization
Regularization techniques
Weight decay
Dropout
Data augmentation
Example: image classification
Bibliography notes
Biography
The authors are international authorities and leaders on the presented topics. All are fellows of the Institute of Mathematical Statistics and the American Statistical Association.
Jianqing Fan is Frederick L. Moore Professor, Princeton University. He is co-editing Journal of Business and Economics Statistics and was the co-editor of The Annals of Statistics, Probability Theory and Related Fields, and Journal of Econometrics and has been recognized by the 2000 COPSS Presidents' Award, AAAS Fellow, Guggenheim Fellow, Guy medal in silver, Noether Senior Scholar Award, and Academician of Academia Sinica.
Runze Li is Elberly family chair professor and AAAS fellow, Pennsylvania State University, and was co-editor of The Annals of Statistics.
Cun-Hui Zhang is distinguished professor, Rutgers University and was co-editor of Statistical Science.
Hui Zou is professor, University of Minnesota and was action editor of Journal of Machine Learning Research.
"This book delivers a very comprehensive summary of the development of statistical foundations of data science. The authors no doubt are doing frontier research and have made several crucial contributions to the field. Therefore, the book offers a very good account of the most cutting-edge development. The book is suitable for both master and Ph.D. students in statistics, and also for researchers in both applied and theoretical data science. Researchers can take this book as an index of topics, as it summarizes in brief many significant research articles in an accessible way. Each chapter can be read independently by experienced researchers. It provides a nice cover of key concepts in those topics and researchers can benefit from reading the specific chapters and paragraphs to get a big picture rather than diving into many technical articles. There are altogether 14 chapters. It can serve as a textbook for two semesters. The book also provides handy codes and data sets, which is a great treasure for practitioners."
~Journal of Time Series Analysis"This text—collaboratively authored by renowned statisticians Fan (Princeton Univ.), Li (Pennsylvania State Univ.), Zhang (Rutgers Univ.), and Zhou (Univ. of Minnesota)—laboriously compiles and explains theoretical and methodological achievements in data science and big data analytics. Amid today's flood of coding-based cookbooks for data science, this book is a rare monograph addressing recent advances in mathematical and statistical principles and the methods behind regularized regression, analysis of high-dimensional data, and machine learning. The pinnacle achievement of the book is its comprehensive exploration of sparsity for model selection in statistical regression, considering models such as generalized linear regression, penalized least squares, quantile and robust regression, and survival regression. The authors discuss sparsity not only in terms of various types of penalties but also as an important feature of numerical optimization algorithms, now used in manifold applications including deep learning. The text extensively probes contemporary high-dimensional data modeling methods such as feature screening, covariate regularization, graphical modeling, and principal component and factor analysis. The authors conclude by introducing contemporary statistical machine learning, spanning a range of topics in supervised and unsupervised learning techniques and deep learning. This book is a must-have bookshelf item for those with a thirst for learning about the theoretical rigor of data science."
~Choice Review, S-T. Kim, North Carolina A&T State University, August 2021