1st Edition
Text Mining with Machine Learning Principles and Techniques
This book provides a perspective on the application of machine learning-based methods in knowledge discovery from natural languages texts. By analysing various data sets, conclusions which are not normally evident, emerge and can be used for various purposes and applications. The book provides explanations of principles of time-proven machine learning algorithms applied in text mining together with step-by-step demonstrations of how to reveal the semantic contents in real-world datasets using the popular R-language with its implemented machine learning algorithms. The book is not only aimed at IT specialists, but is meant for a wider audience that needs to process big sets of text documents and has basic knowledge of the subject, e.g. e-mail service providers, online shoppers, librarians, etc.
The book starts with an introduction to text-based natural language data processing and its goals and problems. It focuses on machine learning, presenting various algorithms with their use and possibilities, and reviews the positives and negatives. Beginning with the initial data pre-processing, a reader can follow the steps provided in the R-language including the subsuming of various available plug-ins into the resulting software tool. A big advantage is that R also contains many libraries implementing machine learning algorithms, so a reader can concentrate on the principal target without the need to implement the details of the algorithms her- or himself. To make sense of the results, the book also provides explanations of the algorithms, which supports the final evaluation and interpretation of the results. The examples are demonstrated using realworld data from commonly accessible Internet sources.
Preface
Introduction to Text Mining with Machine Learning
Introduction
Relation of Text Mining to Data Mining
The Text Mining Process
Machine Learning for Text Mining
Three Fundamental Learning Directions
Big Data
About This Book
Introduction to R
Installing R
Running R
RStudio
Writing and Executing Commands
Variables and Data Types
Objects in R
Functions
Operators
Vectors
Matrices and Arrays
Lists
Factors
Data Frames
Functions Useful in Machine Learning
Flow Control Structures
Packages
Graphics
Structured text representations
Introduction
The Bag-of-words Model
The Limitations of the Bag-of-Words Model
Document Features
Standardization
Texts in Different Encodings
Language Identification
Tokenization
Sentence Detection
Filtering Stop Words, Common, and Rare Terms
Removing Diacritics
Normalization
Annotation
Calculating the Weights in the Bag-of-Words Model
Common Formats for Storing Structured Data
A Complex Example
Classification
Sample Data
Selected Algorithms
Classifier Quality Measurement
Bayes Classifier
Introduction
Bayes’ Theorem
Optimal Bayes Classifier
Na¨ive Bayes Classifier
Illustrative Example of Na¨ive Bayes
Na¨ive Bayes Classifier in R
Nearest Neighbors
Introduction
Similarity as Distance
Illustrative Example of k-NN
k-NN in R
Decision Trees
Introduction
Entropy Minimization-Based c5 Algorithm
C5 Tree Generator in R
Random Forest
Introduction
Random Forest in R
Adaboost
Introduction
Boosting Principle
Adaboost Principle
Weak Learners
Adaboost in R
Support Vector Machines
Introduction
Support Vector Machines Principles
SVM in R
Deep Learning
Introduction
Artificial Neural Networks
Deep Learning in R
Clustering
Introduction to Clustering
Difficulties of Clustering
Similarity Measures
Types of Clustering Algorithms
Clustering Criterion Functions
Deciding on the Number of Clusters
K-means
K-medoids
Criterion Function Optimization
Agglomerative Hierarchical Clustering
Scatter-Gather Algorithm
Divisive Hierarchical Clustering
Constrained Clustering
Evaluating Clustering Results
Cluster Labeling
A Few Examples
Word Embeddings
Introduction
Determining the Context and Word Similarity
Context Windows
Computing Word Embeddings
Aggregation of Word Vectors
An Example
Feature Selection
Introduction
Feature Selection as State Space Search
Feature Selection Methods
Term Elimination Based on Frequency
Term Strength
Term Contribution
Entropy-based Ranking
Term Variance
An Example
References
Index
Biography
Jan Žižka is a consultant in machine learning and data mining. He has worked as a system programmer, developer of advanced software systems, and researcher. For the last 25 years, he has devoted himself to AI and machine learning, especially text mining. He has been a faculty at a number of universities and research institutes. He has authored approximately 100 international publications.
František Dařena is an associate professor and the head of the Text Mining and NLP group at the Department of Informatics, Mendel University, Brno. He has published numerous articles in international scientific journals, conference proceedings, and monographs, and is a member of editorial boards of several international journals. His research includes text/data mining, intelligent data processing, and machine learning.
Arnošt Svoboda is an expert programer. His speciality includes programming languages and systems such as R, Assembler, Matlab, PL/1, Cobol, Fortran, Pascal, and others. He started as a system programmer. The last 20 years, Arnošt has worked also as a teacher and researcher at Masaryk University in Brno. His current interest are machine learning and data mining.