1st Edition

Big Data Analytics A Practical Guide for Managers

By Kim H. Pries, Robert Dunnigan Copyright 2015
    576 Pages 58 B/W Illustrations
    by Auerbach Publications

    576 Pages 58 B/W Illustrations
    by Auerbach Publications

    With this book, managers and decision makers are given the tools to make more informed decisions about big data purchasing initiatives. Big Data Analytics: A Practical Guide for Managers not only supplies descriptions of common tools, but also surveys the various products and vendors that supply the big data market.

    Comparing and contrasting the different types of analysis commonly conducted with big data, this accessible reference presents clear-cut explanations of the general workings of big data tools. Instead of spending time on HOW to install specific packages, it focuses on the reasons WHY readers would install a given package.

    The book provides authoritative guidance on a range of tools, including open source and proprietary systems. It details the strengths and weaknesses of incorporating big data analysis into decision-making and explains how to leverage the strengths while mitigating the weaknesses.

    • Describes the benefits of distributed computing in simple terms
    • Includes substantial vendor/tool material, especially for open source decisions
    • Covers prominent software packages, including Hadoop and Oracle Endeca
    • Examines GIS and machine learning applications
    • Considers privacy and surveillance issues

    The book further explores basic statistical concepts that, when misapplied, can be the source of errors. Time and again, big data is treated as an oracle that discovers results nobody would have imagined. While big data can serve this valuable function, all too often these results are incorrect, yet are still reported unquestioningly. The probability of having erroneous results increases as a larger number of variables are compared unless preventative measures are taken.

    The approach taken by the authors is to explain these concepts so managers can ask better questions of their analysts and vendors as to the appropriateness of the methods used to arrive at a conclusion. Because the world of science and medicine has been grappling with similar issues in the publication of studies, the authors draw on their efforts and apply them to big data.

    Introduction
    So What Is Big Data?
    Growing Interest in Decision Making
    What This Book Addresses
    The Conversation about Big Data
    Technological Change as a Driver of Big Data
    The Central Question: So What?
    Our Goals as Authors
    References

    The Mother of Invention’s Triplets: Moore’s Law, the Proliferation of Data, and Data Storage Technology
    Moore’s Law
    Parallel Computing, Between and Within Machines
    Quantum Computing
    Recap of Growth in Computing Power
    Storage, Storage Everywhere
    Grist for the Mill: Data Used and Unused
    Agriculture
    Automotive
    Marketing in the Physical World
    Online Marketing
    Asset Reliability and Efficiency
    Process Tracking and Automation
    Toward a Definition of Big Data
    Putting Big Data in Context
    Key Concepts of Big Data and Their Consequences
    Summary
    References.

    Hadoop
    Power through Distribution
         Cost Effectiveness of Hadoop
    Not Every Problem Is a Nail
         Some Technical Aspects
    Troubleshooting Hadoop
    Running Hadoop
    Hadoop File System
         MapReduce
    Pig and Hive
    Installation
    Current Hadoop Ecosystem
    Hadoop Vendors
         Cloudera
    Amazon Web Services (AWS)
    Hortonworks
    IBM
    Intel
    MapR
    Microsoft
         To Run Pig Latin Using Powershell
    Pivotal
    References

    HBase and Other Big Data Databases
    Evolution from Flat File to the Three V’s
         Flat File
         Hierarchical Database
         Network Database
         Relational Database
         Object-Oriented Databases
         Relational-Object Databases
    Transition to Big Data Databases
         What Is Different bbout HBase?
         What Is Bigtable?
         What Is MapReduce?
         What Are the Various Modalities for Big Data Databases?
    Graph Databases
         How Does a Graph Database Work?
         What is the Performance of a Graph Database?
    Document Databases
    Key-Value Databases
    Column-Oriented Databases
         HBase
         Apache Accumulo
    References

    Machine Learning
    Machine Learning Basics
    Classifying with Nearest Neighbors
    Naive Bayes
    Support Vector Machines
    Improving Classification with Adaptive Boosting
    Regression
    Logistic Regression
    Tree-Based Regression
    K-Means Clustering
    Apriori Algorithm
    Frequent Pattern-Growth
    Principal Component Analysis (PCA)
    Singular Value Decomposition
    Neural Networks
    Big Data and MapReduce
    Data Exploration
    Spam Filtering
    Ranking
    Predictive Regression
    Text Regression
    Multidimensional Scaling
    Social Graphing
    References

    Statistics
    Statistics, Statistics Everywhere
    Digging into the Data
    Standard Deviation: The Standard Measure of Dispersion
    The Power of Shapes: Distributions
    Distributions: Gaussian Curve
    Distributions: Why Be Normal?
    Distributions: The Long Arm of the Power Law
    The Upshot? Statistics Are not Bloodless
    Fooling Ourselves: Seeing What We Want to See in the Data
    We Can Learn Much from an Octopus
    Hypothesis Testing: Seeking a Verdict
         Two-Tailed Testing
    Hypothesis Testing: A Broad Field
    Moving on to Specific Hypothesis Tests
    Regression and Correlation
    p Value in Hypothesis Testing: A Successful Gatekeeper?
    Specious Correlations and Overfitting the Data
    A Sample of Common Statistical Software Packages
         Minitab
         SPSS
         R
         SAS
              Big Data Analytics
              Hadoop Integration
         Angoss
         Statistica
              Capabilities
    Summary
    References

    Google
    Big Data Giants
    Google
         Go
         Android
         Google Product Offerings
         Google Analytics
              Advertising and Campaign Performance
              Analysis and Testing
    Facebook
    Ning
    Non-United States Social Media
         Tencent
         Line
         Sina Weibo
         Odnoklassniki
         Vkontakte
         Nimbuzz
    Ranking Network Sites
    Negative Issues with Social Networks
    Amazon
    Some Final Words
    References

    Geographic Information Systems (GIS)
    GIS Implementations
    A GIS Example
    GIS Tools
    GIS Databases
    References

    Discovery
    Faceted Search versus Strict Taxonomy
    First Key Ability: Breaking Down Barriers
    Second Key Ability: Flexible Search and Navigation
    Underlying Technology
    The Upshot
    Summary
    References

    Data Quality
    Know Thy Data and Thyself
    Structured, Unstructured, and Semistructured Data
    Data Inconsistency: An Example from This Book
    The Black Swan and Incomplete Data
    How Data Can Fool Us
         Ambiguous Data
         Aging of Data or Variables
         Missing Variables May Change the Meaning
         Inconsistent Use of Units and Terminology
    Biases
         Sampling Bias
         Publication Bias
         Survivorship Bias
    Data as a Video, Not a Snapshot: Different Viewpoints as a Noise Filter
    What Is My Toolkit for Improving My Data?
         Ishikawa Diagram
         Interrelationship Digraph
         Force Field Analysis
    Data-Centric Methods
         Troubleshooting Queries from Source Data
         Troubleshooting Data Quality beyond the Source System
         Using Our Hidden Resources
    Summary
    References

    Benefits
    Data Serendipity
    Converting Data Dreck to Usefulness
    Sales
    Returned Merchandise
    Security
    Medical
    Travel
         Lodging
         Vehicle
         Meals
    Geographical Information Systems
         New York City
         Chicago CLEARMAP
         Baltimore
         San Francisco
         Los Angeles
         Tucson, Arizona, University of Arizona, and COPLINK
    Social Networking
    Education
         General Educational Data
         Legacy Data
         Grades and other Indicators
         Testing Results
         Addresses, Phone Numbers, and More
    Concluding Comments
    References

    Concerns
    Part Two: Basic Principles of National Application
         Collection Limitation Principle
         Data Quality Principle
         Purpose Specification Principle
         Use Limitation Principle
         Security Safeguards Principle
         Openness Principle
         Individual Participation Principle
         Accountability Principle
    Logical Fallacies
         Affirming the Consequent
         Denying the Antecedent
         Ludic Fallacy
    Cognitive Biases
         Confirmation Bias
         Notational Bias
         Selection/Sample Bias
         Halo Effect
         Consistency and Hindsight Biases
         Congruence Bias
         Von Restorff Effect
    Data Serendipity
         Converting Data Dreck to Usefulness Sales
    Merchandise Returns
    Security
         CompStat
         Medical
    Travel
         Lodging
         Vehicle
         Meals
    Social Networking
    Education
    Making Yourself Harder to Track
         Misinformation
         Disinformation
         Reducing/Eliminating Profiles
              Social Media
              Self Redefinition
              Identity Theft
         Facebook
    Concluding Comments
    References

    Epilogue
         Michael Porter’s Five Forces Model
              Bargaining Power of Customers
              Bargaining Power of Suppliers
              Threat of New Entrants
              Others
    The OODA Loop
    Implementing Big Data
    Nonlinear, Qualitative Thinking
    Closing
    References

    Biography

    Kim H. Pries has four college degrees: a bachelor of arts in history from the University of Texas at El Paso (UTEP), a bachelor of science in metallurgical engineering from UTEP, a master of science in engineering from UTEP, and a master of science in metallurgical engineering and materials science from Carnegie-Mellon University.

    Pries worked as a computer systems manager, a software engineer for an electrical utility, and a scientific programmer under a defense contract for Stoneridge, Incorporated (SRI). He has worked as software manager, engineering services manager, reliability section manager, and product integrity and reliability director.

    In addition to his other responsibilities, Pries has provided Six Sigma training for both UTEP and SRI and cost reduction initiatives for SRI. Pries is also a founding faculty member of Practical Project Management. Additionally, in concert with Jon Quigley, Pries was a cofounder and principal with Value Transformation, LLC, a training, testing, cost improvement, and product development consultancy.

    He trained for Introduction to Engineering Design and Computer Science and Software Engineering with Project Lead the Way. He currently teaches biotechnology, computer science and software engineering, and introduction to engineering design at the beautiful Parkland High School in the Ysleta Independent School District of El Paso, Texas.

    Robert Dunnigan is a manager with Janus Consulting Partners and is based in Dallas, Texas. He holds a bachelor of science in psychology and in sociology with an anthropology emphasis from North Dakota State University. He also holds a master of business administration from INSEAD, "the business school for the world," where he attended the Singapore campus.

    As a Peace Corps volunteer, Robert served over 3 years in Honduras developing agribusiness opportunities. As a consultant, he later worked on the Afghanistan Small and Medium Enterprise Development project in Afghanistan, where he traveled the country with his Afghan colleagues and friends seeking opportunities to develop a manufacturing sector in the country.

    Robert is an American Society for Quality–certified Six Sigma Black Belt and a Scrum Alliance–certified Scrum Master.