356 Pages
    by Chapman & Hall

    356 Pages
    by Chapman & Hall

    Cybersecurity Analytics is for the cybersecurity student and professional who wants to learn data science techniques critical for tackling cybersecurity challenges, and for the data science student and professional who wants to learn about cybersecurity adaptations. Trying to build a malware detector, a phishing email detector, or just interested in finding patterns in your datasets? This book can let you do it on your own. Numerous examples and datasets links are included so that the reader can "learn by doing." Anyone with a basic college-level calculus course and some probability knowledge can easily understand most of the material.

    The book includes chapters containing: unsupervised learning, semi-supervised learning, supervised learning, text mining, natural language processing, and more. It also includes background on security, statistics, and linear algebra. The website for the book contains a listing of datasets, updates, and other resources for serious practitioners.

    Preface

    Introduction

    What is Data Analytics?

    Data Ingestion

    Data Processing and Cleaning

    Visualization and Exploratory Analysis

    Scatterplots

    Pattern Recognition

    Classification

    Clustering

    Feature extraction

    Feature Selection

    Random Projections

    Modeling

    Model Specification

    Model Selection and Fitting

    Evaluation

    Strengths and Limitations

    The Curse of Dimensionality

    Security: Basics and Security Analytics

    Basics of Security

    Know Thy Enemy – Attackers and Their Motivations

    Security Goals

    Mechanisms for Ensuring Security Goals

    Confidentiality

    Integrity

    Availability

    Authentication

    Access Control

    Accountability

    Non-repudiation

    Threats, Attacks and Impacts

    Passwords

    Malware

    Spam, Phishing and its Variants

    Intrusions

    Internet Surfing

    System Maintenance and Firewalls

    Other Vulnerabilities

    Protecting Against Attacks

    Applications of Data Science to Security Challenges

    Cybersecurity Datasets

    Data Science Applications

    Passwords

    Malware

    Intrusions

    Spam/Phishing

    Credit Card Fraud/Financial Fraud

    Opinion Spam

    Denial of Service

    Security Analytics and Why Do We Need It

    Statistics

    Probability Density Estimation

    Models

    Poisson

    Uniform

    Normal

    Parameter Estimation

    The Bias-Variance Trade-Off

    The Law of Large Numbers and the Central Limit Theorem

    Confidence Intervals

    Hypothesis Testing

    Bayesian Statistics

    Regression

    Logistic Regression

    Regularization

    Principal Components

    Multidimensional Scaling

    Procrustes

    Nonparametric Statistics

    Time Series

    Data Mining – Unsupervised Learning

    Data Collection

    Types of Data and Operations

    Properties of Datasets

    Data Exploration and Preprocessing

    Data Exploration

    Data Preprocessing/Wrangling

    Data Representation

    Association Rule Mining

    Variations on the Apriori Algorithm

    Clustering

    Partitional Clustering

    Choosing K

    Variations on K-means Algorithm

    Hierarchical Clustering

    Other Clustering Algorithms

    Measuring the Clustering Quality

    Clustering Miscellany: Clusterability, Robustness, Incremental,

    Manifold Discovery

    Spectral Embedding

    Anomaly Detection

    Statistical Methods

    Distance-based Outlier Detection

    kNN based approach

    Density-based Outlier Detection

    Clustering-based Outlier Detection

    One-class learning based Outliers

    Security Applications and Adaptations

    Data Mining for Intrusion Detection

    Malware Detection

    Stepping-stone Detection

    Malware Clustering

    Directed Anomaly Scoring for Spear Phishing Detection

    Concluding Remarks and Further Reading

    Machine Learning – Supervised Learning

    Fundamentals of Supervised Learning

    The Bayes Classifier

    Naïve Bayes

    Nearest Neighbors Classifiers

    Linear Classifiers

    Decision Trees and Random Forests

    Random Forest

    Support Vector Machines

    Semi-Supervised Classification

    Neural Networks and Deep Learning

    Perceptron

    Neural Networks

    Deep Networks

    Topological Data Analysis

    Ensemble Learning

    Majority

    Adaboost

    One-class Learning

    Online Learning

    Adversarial Machine Learning

    Adversarial Examples

    Adversarial Training

    Adversarial Generation

    Beyond Continuous Data

    Evaluation of Machine Learning

    Cost-sensitive Evaluation

    New Metrics for Unbalanced Datasets

    Security Applications and Adaptations

    Intrusion Detection

    Malware Detection

    Spam and Phishing Detection

    For Further Reading

    Text Mining

    Tokenization

    Preprocessing

    Bag-Of-Words

    Vector space model

    Weighting

    Latent Semantic Indexing

    Embedding

    Topic Models: Latent Dirichlet Allocation

    Sentiment Analysis

    Natural Language Processing

    Challenges of NLP

    Basics of Language Study and NLP Techniques

    Text Preprocessing

    Feature Engineering on Text Data

    Morphological, Word and Phrasal Features

    Clausal and Sentence Level Features

    Statistical Features

    Corpus-based Analysis

    Advanced NLP Tasks

    Part of Speech Tagging

    Word sense Disambiguation

    Language Modeling

    Topic Modeling

    Sequence to Sequence Tasks

    Knowledge Bases and Frameworks

    Natural Language Generation

    Issues with Pipelining

    Security Applications of NLP

    Password Checking

    Email Spam Detection

    Phishing Email Detection

    Malware Detection

    Attack Generation

    Big Data Techniques and Security

    Key terms

    Ingesting the Data

    Persistent Storage

    Computing and Analyzing

    Techniques for Handling Big Data

    Visualizing

    Streaming Data

    Big Data Security

    Implications of Big Data Characteristics on Security and Privacy

    Mechanisms for Big Data Security Goals

    Linear Algebra Basics

    Vectors

    Matrices

    Eigenvectors and Eigenvalues

    The Singular Value Decomposition

    Graphs

    Graph Invariants

    The Laplacian

    Probability

    Probability

    Conditional Probability and Bayes’ Rule

    Base Rate Fallacy

    Expected Values and Moments

    Distribution Functions and Densities

    Models

    Bernoulli and Binomial

    Multinomial

    Uniform

    Bibliography

    Author Index

    Index

    Biography

    Rakesh Verma is a professor of computer science at the University of Houston where he is leading a research group that applies reasoning and data science to cybersecurity challenges. He teaches a course on security analytics that includes some of the material here. Since 2015, he has been co-organizing and editing the proceedings of the ACM International Workshop on Security and Privacy Analytics. He is an editor of Frontiers of Big Data in the Cybersecurity Area, an ACM Distinguished Speaker (2011-2018), and the winner of two Best Paper Awards. He received the Lifetime Mentoring Award from the University of Houston and he is a Fulbright Senior Specialist in Computer Science.

    David Marchette is a principal scientist at the Naval Surface Warfare Center, Dahlgren Division where he is responsible for leading basic and applied research projects in computational statistics, graph theory, network analysis, pattern recognition, computer intrusion detection, and text analysis. He is a fellow of the American Statistical Association (ASA) and the American Association for the Advancement of Science (AAAS) and an elected member of the International Statistical Institute (ISI).