1st Edition

Data Mining Tools for Malware Detection

    450 Pages 131 B/W Illustrations
    by Auerbach Publications

    Although the use of data mining for security and malware detection is quickly on the rise, most books on the subject provide high-level theoretical discussions to the near exclusion of the practical aspects. Breaking the mold, Data Mining Tools for Malware Detection provides a step-by-step breakdown of how to develop data mining tools for malware detection. Integrating theory with practical techniques and experimental results, it focuses on malware detection applications for email worms, malicious code, remote exploits, and botnets.

    The authors describe the systems they have designed and developed: email worm detection using data mining, a scalable multi-level feature extraction technique to detect malicious executables, detecting remote exploits using data mining, and flow-based identification of botnet traffic by mining multiple log files. For each of these tools, they detail the system architecture, algorithms, performance results, and limitations.

    • Discusses data mining for emerging applications, including adaptable malware detection, insider threat detection, firewall policy analysis, and real-time data mining
    • Includes four appendices that provide a firm foundation in data management, secure systems, and the semantic web
    • Describes the authors’ tools for stream data mining

    From algorithms to experimental results, this is one of the few books that will be equally valuable to those in industry, government, and academia. It will help technologists decide which tools to select for specific applications, managers will learn how to determine whether or not to proceed with a data mining project, and developers will find innovative alternative designs for a range of applications.

    Introduction
    Trends
    Data Mining and Security Technologies
    Data Mining for Email Worm Detection
    Data Mining for Malicious Code Detection
    Data Mining for Detecting Remote Exploits
    Data Mining for Botnet Detection
    Stream Data Mining 
    Emerging Data Mining Tools for Cyber Security Applications
    Organization of This Book
    Next Steps

    Part I: DATA MINING AND SECURITY
    Introduction to Part I: Data Mining and Security

    Data Mining Techniques
    Introduction
    Overview of Data Mining Tasks and Techniques
    Artificial Neural Network
    Support Vector Machines
    Markov Model
    Association Rule Mining (ARM)
    Multi-class Problem
    2.7.1 One-VS-One
    2.7.2 One-VS-All
    Image Mining
    2.8.1 Feature Selection
    2.8.2 Automatic Image Annotation
    2.8.3 Image Classification
    Summary
    References

    Malware
    Introduction 
    Viruses
    Worms
    Trojan Horses
    Time and Logic Bombs 
    Botnet
    Spyware
    Summary
    References

    Data Mining for Security Applications
    Overview
    Data Mining for Cyber Security
    4.2.1 Overview
    4.2.2 Cyber-terrorism, Insider Threats, and External Attacks
    4.2.3 Malicious Intrusions
    4.2.4 Credit Card Fraud and Identity Theft
    4.2.5 Attacks on Critical Infrastructures
    4.2.6 Data Mining for Cyber Security
    Current Research and Development
    Summary
    References

    Design and Implementation of Data Mining Tools 
    Introduction
    Intrusion Detection
    Web Page Surfing Prediction
    Image Classification 
    Summary and Directions
    References

    Conclusion to Part I

    DATA MINING FOR EMAIL WORM DETECTION

    Introduction to Part II

    Email Worm Detection
    Introduction
    Architecture 
    Related Work
    Overview of Our Approach 
    Summary
    References

    Design of the Data Mining Tool
    Introduction
    Architecture 
    Feature Description
    7.3.1 Per-Email Features
    7.3.2 Per-Window Features
    Feature Reduction Techniques
    7.4.1 Dimension Reduction
    7.4.2 Two-Phase Feature Selection (TPS)
    7.4.2.1 Phase I
    7.4.2.2 Phase II
    Classification Techniques 
    Summary
    References

    Evaluation and Results
    Introduction
    Dataset 
    Experimental Setup
    Results
    8.4.1 Results from Unreduced Data
    8.4.2 Results from PCA-Reduced Data
    8.4.3 Results from Two-Phase Selection
    Summary
    References

    Conclusion to Part II

    Part III: DATA MINING FOR DETECTING MALICIOUS EXECUTABLES
    Introduction to Part III

    Malicious Executables 
    Introduction
    Architecture 
    Related Work
    Hybrid Feature Retrieval (HFR) Model 
    Summary and Directions
    References

    Design of the Data Mining Tool
    Introduction
    Feature Extraction Using n-Gram Analysis
    10.2.1 Binary n-Gram Feature
    10.2.2 Feature Collection
    10.2.3 Feature Selection
    10.2.4 Assembly n-Gram Feature
    10.2.5 DLL Function Call Feature
    The Hybrid Feature Retrieval Model
    10.3.1 Description of the Model
    10.3.2 The Assembly Feature Retrieval (AFR) Algorithm
    10.3.3 Feature Vector Computation and Classification
    Summary and Directions
    References

    Evaluation and Results
    Introduction
    Experiments
    Dataset
    Experimental Setup
    Results
    11.5.1 Accuracy
    11.5.1.1 Dataset1
    11.5.1.2 Dataset2
    11.5.1.3 Statistical Significance Test
    11.5.1.4 DLL Call Feature
    11.5.2 ROC Curves
    11.5.3 False Positive and False Negative
    11.5.4 Running Time
    11.5.5 Training and Testing with Boosted J48
    Example Run
    Summary and Directions
    References

    Conclusion to Part III

    DATA MINING FOR DETECTING REMOTE EXPLOITS

    Introduction to Part IV

    Detecting Remote Exploits
    Introduction
    Architecture
    Related Work
    Overview of Our Approach
    Summary and Directions
    References

    Design of the Data Mining Tool
    Introduction
    DExtor Architecture
    Disassembly
    Feature Extraction
    13.4.1 Useful Instruction Count (UIC)
    13.4.2 Instruction Usage Frequencies (IUF)
    13.4.3 Code vs. Data Length (CDL)
    Combining Features and Compute Combined Feature Vector
    Classification 
    Summary and Directions
    References

    Evaluation and Results
    Introduction
    Dataset
    Experimental Setup
    14.3.1 Parameter Settings
    14.2.2 Baseline Techniques
    Results
    14.4.1 Running Time
    Analysis
    Robustness and Limitations
    14.6.1 Robustness against Obfuscations
    14.6.2 Limitations
    Summary and Directions
    References

    Conclusion to Part IV

    Part V: DATA MINING FOR DETECTING BOTNETS

    Introduction to Part V

    Detecting Botnets
    Introduction
    Botnet Architecture 
    Related Work
    Our Approach
    Summary and Directions
    References

    Design of the Data Mining Tool 
    Introduction
    Architecture
    System Setup
    Data Collection
    Bot Command Categorization
    Feature Extraction
    16.6.1 Packet-level Features
    16.6.2 Flow-level Features
    Log File Correlation
    Classification
    Packet Filtering
    Summary and Directions
    References

    Evaluation and Results
    Introduction
    17.1.1 Baseline Techniques
    17.1.2 Classifiers
    Performance on Different Datasets
    Comparison with Other Techniques
    Further Analysis 
    Summary and Directions
    References

    Conclusion to Part V

    STREAM MINING FOR SECURITY APPLICATIONS

    Introduction to Part VI

    Stream Mining
    Introduction 
    Architecture
    Related Work
    Our Approach
    Overview of the Novel Class Detection Algorithm
    Classifiers Used
    Security Applications
    Summary
    References

    Design of the Data Mining Tool 
    Introduction
    Definitions
    Novel Class Detection
    19.3.1 Saving the Inventory of Used Spaces during Training
    19.3.1.1 Clustering
    19.3.1.2 Storing the Cluster Summary Information
    19.3.2 Outlier Detection and Filtering
    19.3.2.1 Filtering
    19.3.2.2 Detecting Novel Class
    Security Applications
    Summary and Directions
    Reference

    Evaluation and Results
    Introduction
    Datasets
    20.2.1 Synthetic Data with Only Concept-Drift (SynC)
    20.2.2 Synthetic Data with Concept-Drift and Novel Class (SynCN)
    20.2.3 Real Data—KDDCup 99 Network Intrusion Detection
    20.2.4 Real Data—Forest Cover (UCI Repository) 
    Experimental Setup
    20.3.1 Baseline Method
    Performance Study
    20.4.1 Evaluation Approach
    20.4.2 Results
    20.4.3 Running Time
    Summary and Directions
    References

    Conclusion for Part VI

    EMERGING APPLICATIONS

    Introduction to Part VII

    Data Mining For Active Defense
    Introduction
    Related Work
    Architecture
    A Data Mining–Based Malware Detection Model
    21.4.1 Our Framework
    21.4.2 Feature Extraction
    21.4.2.1 Binary n-Gram Feature Extraction
    21.4.2.2 Feature Selection
    21.4.2.3 Feature Vector Computation
    21.4.3 Training
    21.4.4 Testing
    Model-Reversing Obfuscations
    21.5.1 Path Selection
    21.5.2 Feature Insertion
    21.5.3 Feature Removal
    Experiments
    Summary and Directions
    References

    Data Mining for Insider Threat Detection
    Introduction
    The Challenges, Related Work, and Our Approach
    Data Mining for Insider Threat Detection
    22.3.1 Our Solution Architecture
    22.3.2 Feature Extraction and Compact Representation
    22.3.3 RDF Repository Architecture
    22.3.4 Data Storage
    22.3.4.1 File Organization
    22.3.4.2 Predicate Split (PS)
    22.3.4.3 Predicate Object Split (POS)
    22.3.5 Answering Queries Using Hadoop MapReduce
    22.3.6 Data Mining Applications
    Comprehensive Framework
    Summary and Directions
    References

    Dependable Real-Time Data Mining
    Introduction
    Issues in Real-Time Data Mining
    Real-Time Data Mining Techniques
    Parallel, Distributed, Real-Time Data Mining
    Dependable Data Mining
    Mining Data Streams
    Summary and Directions
    References

    Firewall Policy Analysis
    Introduction
    Related Work
    Firewall Concepts
    24.3.1 Representation of Rules
    24.3.2 Relationship between Two Rules
    24.3.3 Possible Anomalies between Two Rules
    Anomaly Resolution Algorithms
    24.4.1 Algorithms for Finding and Resolving Anomalies
    24.4.1.1 Illustrative Example
    24.4.2 Algorithms for Merging Rules
    24.4.2.1 Illustrative Example of the Merge Algorithm 
    Summary and Directions
    References

    Conclusion to Part VII

    Summary and Directions
    Overview
    Summary of This Book
    Directions for Data Mining Tools for Malware Detection
    Where Do We Go from Here?

    Appendix A: Data Management Systems: Developments and Trends
    Overview
    Developments in Database Systems
    Status, Vision, and Issues
    Data Management Systems Framework
    Building Information Systems from the Framework
    Relationship between the Texts
    Summary and Directions
    References

    Appendix B: Trustworthy Systems
    Overview
    Secure Systems
    B.2.1 Overview
    B.2.2 Access Control and Other Security Concepts
    B.2.3 Types of Secure Systems
    B.2.4 Secure Operating Systems
    B.2.5 Secure Database Systems
    B.2.6 Secure Networks
    B.2.7 Emerging Trends
    B.2.8 Impact of the Web
    B.2.9 Steps to Building Secure Systems
    Web Security
    Building Trusted Systems from Untrusted Components
    Dependable Systems
    B.5.1 Overview
    B.5.2 Trust Management
    B.5.3 Digital Rights Management

    Biography

    Mehedy Masud is a postdoctoral fellow at the University of Texas at Dallas (UTD), where he earned his PhD in computer science in December 2009. He has published in premier journals and conferences, including IEEE Transactions on Knowledge and Data Engineering and the IEEE Data Mining Conference. He will be appointed as a research assistant professor at UTD in Fall 2012. Masud’s research projects include reactively adaptive malware, data mining for detecting malicious executables, botnet, and remote exploits, and cloud data mining. He has a patent pending on stream mining for novel class detection.

    Latifur Khan is an associate professor in the computer science department at the University of Texas at Dallas, where he has been teaching and conducting research since September 2000. He received his PhD and MS degrees in computer science from the University of Southern California in August 2000 and December 1996, respectively. Khan is (or has been) supported by grants from NASA, the National Science Foundation (NSF), Air Force Office of Scientific Research (AFOSR), Raytheon, NGA, IARPA, Tektronix, Nokia Research Center, Alcatel, and the SUN academic equipment grant program. In addition, Khan is the director of the state-of-the-art DML@UTD, UTD Data Mining/Database Laboratory, which is the primary center of research related to data mining, semantic web, and image/videoannotation at the University of Texas at Dallas. Khan has published more than 100 papers, including articles in several IEEE Transactions journals, the Journal of Web Semantics, and the VLDB Journal and conference proceedings such as IEEE ICDM and PKDD. He is a senior member of IEEE.

    Bhavani Thuraisingham joined the University of Texas at Dallas (UTD) in October 2004 as a professor of computer science and director of the Cyber Security Research Center in the Erik Jonsson School of Engineering and Computer Science and is currently the Louis Beecherl Jr. Distinguished Professor. She is an elected Fellow of three professional organizations: the IEEE (Institute for Electrical and Electronics Engineers), the AAAS (American Association for the Advancement of Science), and the BCS (British Computer Society) for her work in data security. She received the IEEE Computer Society’s prestigious 1997 Technical Achievement Award for "outstanding and innovative contributions to secure data management." Prior to joining UTD, Thuraisingham worked for the MITRE Corporation for 16 years, which included an IPA (Intergovernmental Personnel Act) at the National Science Foundation as Program Director for Data and Applications Security. Her work in information security and information management has resulted in more than 100 journal articles, more than 200 refereed conference papers, more than 90 keynote addresses, and 3 U.S. patents. She is the author of ten books in data management, data mining, and data security.