1st Edition
Data Mining Tools for Malware Detection
Although the use of data mining for security and malware detection is quickly on the rise, most books on the subject provide high-level theoretical discussions to the near exclusion of the practical aspects. Breaking the mold, Data Mining Tools for Malware Detection provides a step-by-step breakdown of how to develop data mining tools for malware detection. Integrating theory with practical techniques and experimental results, it focuses on malware detection applications for email worms, malicious code, remote exploits, and botnets.
The authors describe the systems they have designed and developed: email worm detection using data mining, a scalable multi-level feature extraction technique to detect malicious executables, detecting remote exploits using data mining, and flow-based identification of botnet traffic by mining multiple log files. For each of these tools, they detail the system architecture, algorithms, performance results, and limitations.
- Discusses data mining for emerging applications, including adaptable malware detection, insider threat detection, firewall policy analysis, and real-time data mining
- Includes four appendices that provide a firm foundation in data management, secure systems, and the semantic web
- Describes the authors’ tools for stream data mining
From algorithms to experimental results, this is one of the few books that will be equally valuable to those in industry, government, and academia. It will help technologists decide which tools to select for specific applications, managers will learn how to determine whether or not to proceed with a data mining project, and developers will find innovative alternative designs for a range of applications.
Introduction
Trends
Data Mining and Security Technologies
Data Mining for Email Worm Detection
Data Mining for Malicious Code Detection
Data Mining for Detecting Remote Exploits
Data Mining for Botnet Detection
Stream Data Mining
Emerging Data Mining Tools for Cyber Security Applications
Organization of This Book
Next Steps
Part I: DATA MINING AND SECURITY
Introduction to Part I: Data Mining and Security
Data Mining Techniques
Overview of Data Mining Tasks and Techniques
Artificial Neural Network
Support Vector Machines
Markov Model
Association Rule Mining (ARM)
Multi-class Problem
2.7.1 One-VS-One
2.7.2 One-VS-All
Image Mining
2.8.1 Feature Selection
2.8.2 Automatic Image Annotation
2.8.3 Image Classification
Summary
References
Malware
Viruses
Worms
Trojan Horses
Time and Logic Bombs
Botnet
Spyware
Summary
References
Data Mining for Security Applications
Data Mining for Cyber Security
4.2.1 Overview
4.2.2 Cyber-terrorism, Insider Threats, and External Attacks
4.2.3 Malicious Intrusions
4.2.4 Credit Card Fraud and Identity Theft
4.2.5 Attacks on Critical Infrastructures
4.2.6 Data Mining for Cyber Security
Current Research and Development
Summary
References
Design and Implementation of Data Mining Tools
Intrusion Detection
Web Page Surfing Prediction
Image Classification
Summary and Directions
References
Conclusion to Part I
DATA MINING FOR EMAIL WORM DETECTION
Introduction to Part II
Email Worm Detection
Architecture
Related Work
Overview of Our Approach
Summary
References
Design of the Data Mining Tool
Architecture
Feature Description
7.3.1 Per-Email Features
7.3.2 Per-Window Features
Feature Reduction Techniques
7.4.1 Dimension Reduction
7.4.2 Two-Phase Feature Selection (TPS)
7.4.2.1 Phase I
7.4.2.2 Phase II
Classification Techniques
Summary
References
Evaluation and Results
Dataset
Experimental Setup
Results
8.4.1 Results from Unreduced Data
8.4.2 Results from PCA-Reduced Data
8.4.3 Results from Two-Phase Selection
Summary
References
Conclusion to Part II
Part III: DATA MINING FOR DETECTING MALICIOUS EXECUTABLES
Introduction to Part III
Malicious Executables
Architecture
Related Work
Hybrid Feature Retrieval (HFR) Model
Summary and Directions
References
Design of the Data Mining Tool
Feature Extraction Using n-Gram Analysis
10.2.1 Binary n-Gram Feature
10.2.2 Feature Collection
10.2.3 Feature Selection
10.2.4 Assembly n-Gram Feature
10.2.5 DLL Function Call Feature
The Hybrid Feature Retrieval Model
10.3.1 Description of the Model
10.3.2 The Assembly Feature Retrieval (AFR) Algorithm
10.3.3 Feature Vector Computation and Classification
Summary and Directions
References
Evaluation and Results
Experiments
Dataset
Experimental Setup
Results
11.5.1 Accuracy
11.5.1.1 Dataset1
11.5.1.2 Dataset2
11.5.1.3 Statistical Significance Test
11.5.1.4 DLL Call Feature
11.5.2 ROC Curves
11.5.3 False Positive and False Negative
11.5.4 Running Time
11.5.5 Training and Testing with Boosted J48
Example Run
Summary and Directions
References
Conclusion to Part III
DATA MINING FOR DETECTING REMOTE EXPLOITS
Introduction to Part IV
Detecting Remote Exploits
Architecture
Related Work
Overview of Our Approach
Summary and Directions
References
Design of the Data Mining Tool
DExtor Architecture
Disassembly
Feature Extraction
13.4.1 Useful Instruction Count (UIC)
13.4.2 Instruction Usage Frequencies (IUF)
13.4.3 Code vs. Data Length (CDL)
Combining Features and Compute Combined Feature Vector
Classification
Summary and Directions
References
Evaluation and Results
Dataset
Experimental Setup
14.3.1 Parameter Settings
14.2.2 Baseline Techniques
Results
14.4.1 Running Time
Analysis
Robustness and Limitations
14.6.1 Robustness against Obfuscations
14.6.2 Limitations
Summary and Directions
References
Conclusion to Part IV
Part V: DATA MINING FOR DETECTING BOTNETS
Introduction to Part V
Detecting Botnets
Introduction
Botnet Architecture
Related Work
Our Approach
Summary and Directions
References
Design of the Data Mining Tool
Architecture
System Setup
Data Collection
Bot Command Categorization
Feature Extraction
16.6.1 Packet-level Features
16.6.2 Flow-level Features
Log File Correlation
Classification
Packet Filtering
Summary and Directions
References
Evaluation and Results
17.1.1 Baseline Techniques
17.1.2 Classifiers
Performance on Different Datasets
Comparison with Other Techniques
Further Analysis
Summary and Directions
References
Conclusion to Part V
STREAM MINING FOR SECURITY APPLICATIONS
Introduction to Part VI
Stream Mining
Architecture
Related Work
Our Approach
Overview of the Novel Class Detection Algorithm
Classifiers Used
Security Applications
Summary
References
Design of the Data Mining Tool
Definitions
Novel Class Detection
19.3.1 Saving the Inventory of Used Spaces during Training
19.3.1.1 Clustering
19.3.1.2 Storing the Cluster Summary Information
19.3.2 Outlier Detection and Filtering
19.3.2.1 Filtering
19.3.2.2 Detecting Novel Class
Security Applications
Summary and Directions
Reference
Evaluation and Results
Datasets
20.2.1 Synthetic Data with Only Concept-Drift (SynC)
20.2.2 Synthetic Data with Concept-Drift and Novel Class (SynCN)
20.2.3 Real Data—KDDCup 99 Network Intrusion Detection
20.2.4 Real Data—Forest Cover (UCI Repository)
Experimental Setup
20.3.1 Baseline Method
Performance Study
20.4.1 Evaluation Approach
20.4.2 Results
20.4.3 Running Time
Summary and Directions
References
Conclusion for Part VI
EMERGING APPLICATIONS
Introduction to Part VII
Data Mining For Active Defense
Related Work
Architecture
A Data Mining–Based Malware Detection Model
21.4.1 Our Framework
21.4.2 Feature Extraction
21.4.2.1 Binary n-Gram Feature Extraction
21.4.2.2 Feature Selection
21.4.2.3 Feature Vector Computation
21.4.3 Training
21.4.4 Testing
Model-Reversing Obfuscations
21.5.1 Path Selection
21.5.2 Feature Insertion
21.5.3 Feature Removal
Experiments
Summary and Directions
References
Data Mining for Insider Threat Detection
The Challenges, Related Work, and Our Approach
Data Mining for Insider Threat Detection
22.3.1 Our Solution Architecture
22.3.2 Feature Extraction and Compact Representation
22.3.3 RDF Repository Architecture
22.3.4 Data Storage
22.3.4.1 File Organization
22.3.4.2 Predicate Split (PS)
22.3.4.3 Predicate Object Split (POS)
22.3.5 Answering Queries Using Hadoop MapReduce
22.3.6 Data Mining Applications
Comprehensive Framework
Summary and Directions
References
Dependable Real-Time Data Mining
Issues in Real-Time Data Mining
Real-Time Data Mining Techniques
Parallel, Distributed, Real-Time Data Mining
Dependable Data Mining
Mining Data Streams
Summary and Directions
References
Firewall Policy Analysis
Related Work
Firewall Concepts
24.3.1 Representation of Rules
24.3.2 Relationship between Two Rules
24.3.3 Possible Anomalies between Two Rules
Anomaly Resolution Algorithms
24.4.1 Algorithms for Finding and Resolving Anomalies
24.4.1.1 Illustrative Example
24.4.2 Algorithms for Merging Rules
24.4.2.1 Illustrative Example of the Merge Algorithm
Summary and Directions
References
Conclusion to Part VII
Summary and Directions
Summary of This Book
Directions for Data Mining Tools for Malware Detection
Where Do We Go from Here?
Appendix A: Data Management Systems: Developments and Trends
Overview
Developments in Database Systems
Status, Vision, and Issues
Data Management Systems Framework
Building Information Systems from the Framework
Relationship between the Texts
Summary and Directions
References
Appendix B: Trustworthy Systems
Secure Systems
B.2.1 Overview
B.2.2 Access Control and Other Security Concepts
B.2.3 Types of Secure Systems
B.2.4 Secure Operating Systems
B.2.5 Secure Database Systems
B.2.6 Secure Networks
B.2.7 Emerging Trends
B.2.8 Impact of the Web
B.2.9 Steps to Building Secure Systems
Web Security
Building Trusted Systems from Untrusted Components
Dependable Systems
B.5.1 Overview
B.5.2 Trust Management
B.5.3 Digital Rights Management
Biography
Mehedy Masud is a postdoctoral fellow at the University of Texas at Dallas (UTD), where he earned his PhD in computer science in December 2009. He has published in premier journals and conferences, including IEEE Transactions on Knowledge and Data Engineering and the IEEE Data Mining Conference. He will be appointed as a research assistant professor at UTD in Fall 2012. Masud’s research projects include reactively adaptive malware, data mining for detecting malicious executables, botnet, and remote exploits, and cloud data mining. He has a patent pending on stream mining for novel class detection.
Latifur Khan is an associate professor in the computer science department at the University of Texas at Dallas, where he has been teaching and conducting research since September 2000. He received his PhD and MS degrees in computer science from the University of Southern California in August 2000 and December 1996, respectively. Khan is (or has been) supported by grants from NASA, the National Science Foundation (NSF), Air Force Office of Scientific Research (AFOSR), Raytheon, NGA, IARPA, Tektronix, Nokia Research Center, Alcatel, and the SUN academic equipment grant program. In addition, Khan is the director of the state-of-the-art DML@UTD, UTD Data Mining/Database Laboratory, which is the primary center of research related to data mining, semantic web, and image/videoannotation at the University of Texas at Dallas. Khan has published more than 100 papers, including articles in several IEEE Transactions journals, the Journal of Web Semantics, and the VLDB Journal and conference proceedings such as IEEE ICDM and PKDD. He is a senior member of IEEE.
Bhavani Thuraisingham joined the University of Texas at Dallas (UTD) in October 2004 as a professor of computer science and director of the Cyber Security Research Center in the Erik Jonsson School of Engineering and Computer Science and is currently the Louis Beecherl Jr. Distinguished Professor. She is an elected Fellow of three professional organizations: the IEEE (Institute for Electrical and Electronics Engineers), the AAAS (American Association for the Advancement of Science), and the BCS (British Computer Society) for her work in data security. She received the IEEE Computer Society’s prestigious 1997 Technical Achievement Award for "outstanding and innovative contributions to secure data management." Prior to joining UTD, Thuraisingham worked for the MITRE Corporation for 16 years, which included an IPA (Intergovernmental Personnel Act) at the National Science Foundation as Program Director for Data and Applications Security. Her work in information security and information management has resulted in more than 100 journal articles, more than 200 refereed conference papers, more than 90 keynote addresses, and 3 U.S. patents. She is the author of ten books in data management, data mining, and data security.