FEATURES
Bridges the gap between the traditional statistical methods and computational tools for small genetic and epigenetic data analysis and the modern advanced statistical methods for big data
Provides tools for high dimensional data reduction
Discusses searching algorithms for model and variable selection including randomization algorithms, Proximal methods and matrix subset selection
Provides real-world examples and case studies
Will have an accompanying website with R code
Provides a natural extension and companion volume to Big Data in Omic and Imaging: Association Analysis, but can be read independently.
Introduce causal inference theory to genomic, epigenomic and imaging data analysis
Develop novel statistics for genome-wide causation studies and epigenome-wide causation studies.
Bridge the gap between the traditional association analysis and modern causation analysis
Use combinatorial optimization methods and various causal models as a general framework for inferring multilevel omic and image causal networks
Present statistical methods and computational algorithms for searching causal paths from genetic variant to disease
Develop causal machine learning methods integrating causal inference and machine learning
Develop statistics for testing significant difference in directed edge, path, and graphs, and for assessing causal relationships between two networks
The book is designed for graduate students and researchers in genomics, bioinformatics, and data science. It represents the paradigm shift of genetic studies of complex diseases– from shallow to deep genomic analysis, from low-dimensional to high dimensional, multivariate to functional data analysis with next-generation sequencing (NGS) data, and from homogeneous populations to heterogeneous population and pedigree data analysis. Topics covered are: advanced matrix theory, convex optimization algorithms, generalized low rank models, functional data analysis techniques, deep learning principle and machine learning methods for modern association, interaction, pathway and network analysis of rare and common variants, biomarker identification, disease risk and drug response prediction.
Mathematical Foundation
Sparsity-Inducing Norms, Dual Norms and Fenchel Conjugate
Subdifferential
Definition of Subgradient
Subgradients of differentiable functions
Calculus of subgradients
Proximal Methods
Introduction
Basics of Proximate Methods
Properties of the Proximal Operator
Proximal Algorithms
Computing the Proximal Operator
Matrix Calculus
Derivative of a Function with Respect to a Vector
Derivative of a Function with Respect to a Matrix
Derivative of a Matrix with Respect to a Scalar
Derivative of a Matrix with Respect to a Matrix or a Vector
Derivative of a Vector Function of a Vector
Chain Rules
Widely Used Formulae
Functional Principal Component Analysis (FPCA)
Principal Component Analysis (PCA)
Basic Mathematical Tools for Functional Principal Component Analysis
Unsmoothed Functional Principal Component Analysis
Smoothed Principal Component Analysis
Computations for the Principal Component Function and the Principal Component Score
Canonical Correlation Analysis
Linkage Disequilibrium
Concepts of Linkage Disequilibrium
Measures of Two-locus Linkage Disequilibrium
Linkage Disequilibrium Coefficient D
Normalized Measure of Linkage Disequilibrium
Correlation Coefficient r
Composite Measure of Linkage Disequilibrium
The Relationship Between the Measure of LD and Physical Distance
Haplotype Reconstruction
Clark’s Algorithm
EM algorithm
Bayesian and Coalescence-based Methods
Multi-locus Measures of Linkage Disequilibrium
Mutual Information Measure of LD
Multi-Information and Multi-locus Measure of LD
Joint Mutual Information and a Measure of LD between a Marker and a Haplotype Block or Between Two Haplotype Blocks
Interaction Information
Conditional Interaction Information
Normalized Multi-Information
Distribution of Estimated Mutual Information, Multi-information and Interaction Information
Canonical Correlation Analysis Measure for LD between Two Genomic Regions
Association Measure between Two Genomic Regions Based on CCA
Relationship between Canonical Correlation and Joint Information
Software Package
Association Studies for Qualitative Traits
Population-based Association Analysis for Common Variants
Introduction
The Hardy-Weinberg Equilibrium
Genetic Models
Odds Ratio
Single Marker Association Analysis
Multi-marker Association Analysis
Population-based Multivariate Association Analysis for Next-generation Sequencing
Multivariate Group Tests
Score Tests and Logistic Regression
Application of Score Tests for Association of Rare Variants
Variance-component Score Statistics and Logistic Mixed Effects Models
Population-based Functional Association Analysis for Next-generation Sequencing
Introduction
Functional Principal Component Analysis for Association Test
Smoothed Functional Principal Component Analysis for Association TestSoftware Package
Association Studies for Quantitative Traits
Fixed Effect Model for a Single Trait
Introduction
Genetic Effects
Linear Regression for a Quantitative Trait
Multiple Linear Regression for a Quantitative Trait
Gene-based Quantitative Trait Analysis
Functional Linear Model for a Quantitative Trait
Canonical Correlation Analysis for Gene-based Quantitative Trait Analysis
Kernel Approach to Gene-based Quantitative Trait Analysis
Kernel and RKHS
Covariance Operator and Dependence Measure
Simulations and Real Data Analysis
Power Evaluation
Application to Real Data Examples
Software Package
Multiple Phenotype Association Studies
Pleiotropic Additive and Dominance Effects
Multivariate Marginal Regression
Models
Estimation of Genetic Effects
Test Statistics
Linear Models for Multiple Phenotypes and Multiple Markers
Multivariate Multiple Linear Regression Models
Multivariate Functional Linear Models for Gene-based Genetic Analysis of Multiple Phenotypes
Canonical Correlation Analysis for Gene-based Genetic Pleiotropic Analysis
Multivariate Canonical Correlation Analysis (CCA)
Kernel CCA
Functional CCA
Quadratically Regularized Functional CCA
Dependence Measure and Association Tests of Multiple Traits
Principal Component for Phenotype Dimension Reduction
Principal Component Analysis
Kernel Principal Component Analysis
Quadratically Regularized PCA or Kernel PCA
Other Statistics for Pleiotropic Genetics Analysis
Sum of Squared Score Test
Unified Score-based Association Test (USAT)
Combining Marginal Tests
FPCA-based Kernel Measure Test of Independence
Connection between Statistics
Simulations and Real Data Analysis
Type Error Rate and Power Evaluation
Application to Real Data Example
Software Package
Family-based Association Analysis
Genetic Similarity and Kinship Coefficients
Kinship Coefficients
Identity Coefficients
Relation between identity coefficients and kinship coefficient
Estimation of Genetic Relations from the Data
Genetic Covariance between Relatives
Assumptions and Genetic Models
Analysis for Genetic Covariance between Relatives
Mixed Linear Model for a Single Trait
Genetic Random Effect
Mixed Linear Model for Quantitative Trait Association Analysis
Estimating Variance Components
Hypothesis Test in Mixed Linear Models
Mixed Linear Models for Quantitative Trait Analysis with Sequencing Data
Mixed Functional Linear Models for Sequence-based Quantitative Trait Analysis
Mixed Functional Linear Models (Type )
Mixed Functional Linear Models (Type : Functional Variance Component Models)
Multivariate Mixed Linear Model for Multiple Traits
Multivariate Mixed Linear Model
Maximum Likelihood Estimate of Variance Components
REML Estimate of Variance Components
Heritability
Heritability Estimation for a Single Trait
Heritability Estimation for Multiple Traits
Family-based Association Analysis for Qualitative Trait
The Generalized T Test with Families and Additional Population Structures
Collapsing Method
CMC with Families
The Functional Principal Component Analysis and Smooth Functional Principal Component Analysis with Families
Software Package
Interaction Analysis
Measures of Gene-gene and Gene-environment Interaction for Qualitative Trait
Binary Measure of Gene-gene and Gene-environment Interaction
Disequilibrium Measure of Gene-gene and Gene-environment Interaction
Information Measure of Gene-gene and Gene-environment Interaction
Measure of Interaction between Gene and Continuous Environment
Statistics for Testing Gene-gene and Gene-Environment Interaction for Qualitative Trait with Common Variants
Relative Risk and Odds-ration-based Statistics for Testing Interaction between Gene and Discrete Environment
Disequilibrium-based Statistics for Testing Gene-gene Interaction
Information-based Statistics for Testing Gene-Gene Interaction
Haplotype-Odds Ratio and Tests for Gene-Gene Interaction
Multiplicative Measure-based Statistics for Testing Interaction between Gene and Continuous Environment
Information Measure-based Statistics for Testing Interaction between Gene and Continuous Environment
Real Example
Statistics for Testing Gene-gene and Gene-Environment Interaction for Qualitative Trait with Next-generation Sequencing Data
Multiple Logistic Regression Model for Gene-Gene Interaction Analysis
Functional logistic regression model for gene-gene interaction analysis
Statistics for Testing Interaction between Two Genomic Regions
Statistics for Testing Gene-gene and Gene-Environment Interaction for Quantitative Traits
Genetic Models for Epistasis Effects of Quantitative Traits
Regression Model for Interaction Analysis with Quantitative Traits
Functional Regression Model for Interaction Analysis with a Quantitative Trait
Functional Regression Model for Interaction Analysis with Multiple Quantitative Traits
Multivariate and Functional Canonical Correlation as a Unified Framework for Testing Gen-Gene and Gene-Environment Interaction for both Qualitative and Quantitative Traits
Data Structure of CCA for Interaction Analysis
CCA and Functional CCA
Kernel CCA
Software Package
Machine Learning, Low Rank Models and Their Application to Disease Risk Prediction and Precision Medicine
Logistic Regression
Two Class Logistic Regression
Multiclass Logistic Regression
Parameter Estimation
Test Statistics
Network Penalized Two-class Logistic Regression
Network Penalized Multiclass Logistic Regression
Fisher’s Linear Discriminant Analysis
Fisher’s Linear Discriminant Analysis for Two Classes
Multi-class Fisher’s Linear Discriminant Analysis
Connections between Linear Discriminant Analysis, Optimal Scoring and Canonical Correlation Analysis (CCA)
Support Vector Machine
Introduction
Linear Support Vector Machines
Nonlinear SVM
Penalized SVMs
Low Rank Approximation
Quadratically Regularized PCA
Generalized Regularization
Generalized Canonical Correlation Analysis (CCA)
Quadratically Regularized Canonical Correlation Analysis
Sparse Canonical Correlation Analysis
Sparse Canonical Correlation Analysis via a Penalized Matrix Decomposition
Inverse Regression (IR) and Sufficient Dimension Reduction
Sufficient Dimension Reduction (SDR) and Sliced Inverse Regression (SIR)
Sparse SDRSoftware Package
Genotype-Phenotype Network Analysis
Undirected Graphs for Genotype Network
Gaussian Graphic Model
Alternating Direction Method of Multipliers for Estimation of Gaussian Graphical Model
Coordinate Descent Algorithm and Graphical Lasso
Multiple Graphical Models
Directed Graphs and Structural Equation Models for Networks
Directed Acyclic Graphs
Linear Structural Equation Models
Estimation Methods
Sparse Linear Structural Equations
Penalized Maximum Likelihood Estimation
Penalized Two Stage Least Square Estimation
Penalized Three Stage Least Square Estimation
Functional Structural Equation Models for Genotype-Phenotype Networks
Functional Structural Equation Models
Group Lasso and ADMM for Parameter Estimation in the Functional Structural Equation Models
Causal Calculus
Effect Decomposition and Estimation
Graphical Tools for Causal Inference in Linear SEMs
Identification and Single-door Criterion
Instrument Variables
Total Effects and Backdoor Criterion
Counterfactuals and Linear SEMs
Simulations and Real Data Analysis
Simulations for Model Evaluation
Application to Real Data Examples
Causal analysis and network biology
Bayesian Networks as a General Framework for Causal Inference
Parameter Estimation and Bayesian Dirichlet Equivalent Uniform Score for Discrete Bayesian Networks
Structural Equations and Score Metrics for Continuous Causal Networks
Multivariate SEMs for Generating Node Core Metrics
Mixed SEMs for Pedigree-based Causal Inference
Bayesian Networks with Discrete and Continuous Variable
Two-class Network Penalized Logistic Regression for Learning Hybrid Bayesian Networks
Multiple Network Penalized Functional Logistic Regression Models for NGS Data
Multi-class Network Penalized Logistic Regression for Learning Hybrid Bayesian Networks
Other Statistical Models for Quantifying Node Score Function
Integer Programming for Causal Structure Leaning
Introduction
Integer Linear Programming Formulation of DAG Learning
Cutting Plane for Integer Linear Programming
Branch and Cut Algorithm for Integer Linear Programming
Sink Finding Primal Heuristic Algorithm
Simulations and Real Data Analysis
Simulations
Real Data Analysis
Smoothing Spline Regression for a Single Variable
Smoothing Spline Regression for Multiple Variables
Wearable Computing and Genetic Analysis of Function-valued Traits
Classification of Wearable Biosensor Data
Introduction
Functional Data Analysis for Classification of Time Course Wearable Biosensor Data
Differential Equations for Extracting Features of the Dynamic Process and for Classification of Time Course Data
Deep Learning for Physiological Time Series Data Analysis
Association Studies of Function-Valued Traits
Introduction
Functional Linear Models with both Functional Response and Predictors for Association Analysis of Function-valued Traits
Test Statistics
Null Distribution of Test Statistics
Power
Real Data Analysis
Association Analysis of Multiple Function-valued Traits
Gene-gene Interaction Analysis of Function-Valued Traits
Introduction
Functional Regression Models
Estimation of Interaction Effect Function
Test Statistics
Simulations
Real Data Analysis
Networks
Multilayer Feedforward Pass
Backpropagation Pass
Convolutional Layer
RNA-seq Data Analysis
Normalization Methods on RNA-seq Data Analysis
Gene Expression
RNA Sequencing Expression Profiling
Methods for Normalization
Differential Expression Analysis for RNA-Seq Data
Distribution-based Approach to Differential Expression Analysis
Functional Expansion Approach to Differential Expression Analysis of RNA-Seq Data
Differential Analysis of Allele Specific Expressions with RNA-Seq Data
eQTL and eQTL Epistasis Analysis with RNA-Seq Data
Matrix Factorization
Quadratically Regularized Matrix Factorization and Canonical Correlation Analysis
QRFCCA for eQTL and eQTL Epistasis Analysis of RNA-Seq Data
Real Data Analysis
Gene Co-expression Network and Gene Regulatory Networks
Co-expression Network Construction with RNA-Seq Data by CCA and FCCA
Graphical Gaussian Models
Real Data Applications
Directed Graph and Gene Regulatory Networks
Hierarchical Bayesian Networks for Whole Genome Regulatory Networks
Linear Regulatory Networks
Nonlinear Regulatory Networks
Dynamic Bayesian Network and Longitudinal Expression Data Analysis
Single Cell RNA-Seq Data Analysis, Gene Expression Deconvolution and Genetic Screening
Cell Type Identification
Gene Expression Deconvolution and Cell Type-Specific Expression
Normalization
Variational Methods for expectation-maximization (EM) algorithm
Variational Methods for Bayesian Learning
Methylation Data Analysis
DNA Methylation Analysis
Epigenome-wide Association Studies (EWAS)
Single-Locus Test
Set-based Methods
Epigenome-wide Causal Studies
Introduction
Additive Functional Model for EWCS
Genome-wide DNA Methylation Quantitative Trait Locus (mQTL) Analysis
Causal Networks for Genetic-Methylation Analysis
Structural Equation Models with Scalar Endogenous Variables and Functional Exogenous Variables
Functional Structural Equation Models with Functional Endogenous Variables and Scalar Exogenous Variables (FSEMS)
Functional Structural Equation Models with both Functional Endogenous Variables an Exogenous Variables (FSEMF)
Imaging and Genomics
Introduction
Image Segmentation
Unsupervised Learning Methods for Image Segmentation
Supervised Deep Learning Methods for Image Segmentation
Two or Three dimensional Functional Principal Component Analysis for Image Data Reduction 645
Formulation
Integral Equation and Eigenfunctions
Association Analysis of Imaging-Genomic Data
Multivariate Functional Regression Models for Imaging-Genomic Data Analysis
Multivariate Functional Regression Models for Longitudinal Imaging-Genetics Analysis
Quadratically Regularized Functional Canonical Correlation Analysis for Gene-Gene Interaction Detection in Imaging-Genetic Studies
Causal Analysis of Imaging-Genomic Data
Sparse SEMs for Joint Causal Analysis of Structural Imaging and Genomic Data
Sparse Functional Structural Equation Models for phenotype and genotype networks.
Conditional Gaussian Graphical Models (CGGMs) for Structural Imaging and Genomic Data Analysis.
Time Series SEMs for Integrated Causal Analysis of fMRI and Genomic Data Models
Reduced Form Equations
Single Equation and Generalized Least Square Estimator
Sparse SEMs and Alternating Direction Method of Multipliers
Causal machine learning
From Association Analysis to Integrated Causal Inference
Genome-wide Causal Studies
Mathematical Formulation of Causal Analysis
Basic Causal Assumptions
Linear Additive SEMs with non-Gaussian Noise
Information Geometry Approach
Causal Inference on Discrete Data
Multivariate Causal Inference and Causal Networks
Markov Condition, Markov Equivalence, Faithfulness and Minimality
Multilevel Causal Networks for Integrative Omics and Imaging Data Analysis
Causal Inference with Confounders
Causal Sufficiency
Instrumental Variables
Biography
Momiao Xiong is a professor of Biostatistics at the University of Texas Health Science Center in Houston where he has worked since 1997. He received his PhD in 1993 from the University of Georgia.