Other eBook Options:

- Distinguishes between statistical data mining and machine-learning data mining techniques, leading
*to better predictive modeling and analysis of big data*Illustrates the power of machine-learning data mining that starts where statistical data mining stops - Addresses common problems with more powerful and reliable alternative data-mining solutions than those commonly accepted
- Explores uncommon problems for which there are no universally acceptable solutions and introduces creative and robust solutions
- Discusses everyday statistical concepts to show the hidden assumptions not every statistician/data analyst knows—underlining the importance of having good statistical practice

The second edition of a bestseller, **Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data** is still the only book, to date, to distinguish between statistical data mining and machine-learning data mining. The first edition, titled *Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data,* contained 17 chapters of innovative and practical statistical data mining techniques. In this second edition, renamed to reflect the increased coverage of machine-learning data mining techniques, the author has completely revised, reorganized, and repositioned the original chapters and produced 14 new chapters of creative and useful machine-learning data mining techniques. In sum, the 31 chapters of simple yet insightful quantitative techniques make this book unique in the field of data mining literature.

The statistical data mining methods effectively consider big data for identifying structures (variables) with the appropriate predictive power in order to yield reliable and robust large-scale statistical models and analyses. In contrast, the author's own GenIQ Model provides machine-learning solutions to common and virtually unapproachable statistical problems. GenIQ makes this possible — its utilitarian data mining features start where statistical data mining stops.

This book contains essays offering detailed background, discussion, and illustration of specific methods for solving the most commonly experienced problems in predictive modeling and analysis of big data. They address each methodology and assign its application to a specific type of problem. To better ground readers, the book provides an in-depth discussion of the basic methodologies of predictive modeling and analysis. While this type of overview has been attempted before, this approach offers a truly nitty-gritty, step-by-step method that both tyros and experts in the field can enjoy playing with.

**Introduction**The Personal Computer and Statistics

Statistics and Data Analysis

EDA

The EDA Paradigm

EDA Weaknesses

Small and Big Data

Data Mining Paradigm

Statistics and Machine Learning

Statistical Data Mining

References

Introduction

Correlation Coefficient

Scatterplots

Data Mining

Smoothed Scatterplot

General Association Test

Summary

References

Introduction

The Scatterplot

The Smooth Scatterplot

Primer on CHAID

CHAID-Based Data Mining for a Smoother Scatterplot

Summary

References

Appendix

Introduction

Straightness and Symmetry in Data

Data Mining Is a High Concept

The Correlation Coefficient

Scatterplot of (xx3, yy3)

Data Mining the Relationship of (xx3, yy3)

What Is the GP-Based Data Mining Doing to the Data?

Straightening a Handful of Variables and a Dozen of Two Baker’s Dozens of Variables

Summary

References

Introduction

Scales of Measurement

Stem-and-Leaf Display

Box-and-Whiskers Plot

Illustration of the Symmetrizing Ranked Data Method

Summary

References

Principal Component Analysis: A Statistical Data Mining Method for Many-Variable Assessment

Introduction

EDA Reexpression Paradigm

What Is the Big Deal?

PCA Basics

Exemplary Detailed Illustration

Algebraic Properties of PCA

Uncommon Illustration

PCA in the Construction of a Quasi-Interaction Variable

Summary

Introduction

Basics of the Correlation Coefficient

Calculation of the Correlation Coefficient

Rematching

Calculation of the Adjusted Correlation Coefficient

Implication of Rematching

Summary

Introduction

Logistic Regression Model

Case Study

Logits and Logit Plots

The Importance of Straight Data

Reexpressing for Straight

Straight Data for Case Study

Technique†s When Bulging Rule Does Not Apply

Reexpressing MOS_OPEN

Assessing the Importance of Variables

Important Variables for Case Study

Relative Importance of the Variables

Best Subset of Variables for Case Study

Visual Indicators of Goodness of Model Predictions

Evaluating the Data Mining Work

Smoothing a Categorical Variable

Additional Data Mining Work for Case Study

Summary

Introduction

Ordinary Regression Model

Mini Case Study

Important Variables for Mini Case Study

Best Subset of Variables for Case Study

Suppressor Variable AGE

Summary

References

Introduction

Background

Frequently Used Variable Selection Methods

Weakness in the Stepwise

Enhanced Variable Selection Method

Exploratory Data Analysis

Summary

References

Introduction

Logistic Regression Model

Database Marketing Response Model Case Study

CHAID

Multivariable CHAID Trees

CHAID Market Segmentation

CHAID Tree Graphs

Summary

Introduction

The Ordinary Regression Model

Four Questions

Important Predictor Variables

P Values and Big Data

Returning to Question 1

Effect of Predictor Variable on Prediction

The Caveat

Returning to Question 2

Ranking Predictor Variables by Effect on Prediction

Returning to Question 3

Returning to Question 4

Summary

References

Introduction

Background

Illustration of the Difference between Reliability and Validity

Illustration of the Relationship between Reliability and Validity

The Average Correlation

Summary

Reference

Introduction

Interaction Variables

Strategy for Modeling with Interaction Variables

Strategy Based on the Notion of a Special Point

Example of a Response Model with an Interaction Variable

CHAID for Uncovering Relationships

Illustration of CHAID for Specifying a Model

An Exploratory Look

Database Implication

Summary

References

Introduction

Binary Logistic Regression

Polychotomous Logistic Regression Model

Model Building with PLR

Market Segmentation Classification Model

Summary

CHAID as a Method for Filling in Missing Values

Introduction

Introduction to the Problem of Missing Data

Missing Data Assumption

CHAID Imputation

Illustration

CHAID Most Likely Category Imputation for a Categorical Variable

Summary

References

Introduction

Some Definitions

Illustration of a Flawed Targeting Effort

Well-Defined Targeting Effort

Predictive Profiles

Continuous Trees

Look-Alike Profiling

Look-Alike Tree Characteristics

Summary

Accuracy for Response Model

Accuracy for Profit Model

Decile Analysis and Cum Lift for Response Model

Decile Analysis and Cum Lift for Profit Model

Precision for Response Model

Precision for Profit Model

Separability for Response and Profit Models

Guidelines for Using Cum Lift, HL/SWMAD, and CV

Summary

Introduction

Traditional Model Validation

Illustration

Three Questions

The Bootstrap

How to Bootstrap

Bootstrap Decile Analysis Validation

Another Question

Bootstrap Assessment of Model Implementation Performance

Summary

References

Introduction

Logistic Regression Model

The Bootstrap Validation Method

Summary

Reference

Introduction

Brief History of the Graph

Star Graph Basics

Star Graphs for Single Variables

Star Graphs for Many Variables Considered Jointly

Profile Curves Method

Illustration

Summary

References

Appendix 2: SAS Code for Star Graphs for Each Decile about the Demographic Variables

Appendix 3: SAS Code for Profile Curves: All Deciles

Introduction

Background

Illustration of Decision Rule

Predictive Contribution Coefficient

Calculation of Predictive Contribution Coefficient

Extra Illustration of Predictive Contribution Coefficient

Summary

Reference

Introduction

Shakespearean Modelogue

Interpretation of the Shakespearean Modelogue

Summary

Reference

Introduction

Background

Objective

A Pithy Summary of the Development of Genetic Programming

The GenIQ Model: A Brief Review of Its Objective and Salient Features

The GenIQ Model: How It Works

Summary

References

Introduction

Data Reuse?

Illustration of Data Reuse

Modified Data Reuse: A GenIQ-Enhanced Regression Model

Summary

Background

Moderating Outliers Instead of Discarding Them

Summary

Introduction

Background

The GenIQ Model Solution to Overfitting

Summary

Introduction

Restatement of Why It Is Important to Straighten

Restatement of Section 4.6"Data Mining the Relationship of (xx3, yy3)"

Summary

Introduction

What Is Optimization?

What Is Genetic Modeling?

Genetic Modeling: An Illustration

Parameters for Controlling a Genetic Model Run

Genetic Modeling: Strengths and Limitations

Goals of Marketing Modeling

The GenIQ Response Model

The GenIQ Profit

Case Study: Response Model

Case Study: Profit Model

Summary

Reference

Background

Weakness in the Variable Selection Methods

Goals of Modeling in Marketing

Variable Selection with GenIQ

Nonlinear Alternative to Logistic Regression Model

Summary

References

Introduction

The Linear Regression Coefficient

The Quasi-Regression Coefficient for Simple Regression Models

Partial Quasi-RC for the Everymodel

Quasi-RC for a Coefficient-Free Model

Summary

Bruce Ratner, DM STAT-l Consulting