1st Edition

Text Mining with Machine Learning Principles and Techniques

    368 Pages 10 Color & 68 B/W Illustrations
    by CRC Press

    366 Pages 10 Color & 68 B/W Illustrations
    by CRC Press

    366 Pages 10 Color & 68 B/W Illustrations
    by CRC Press

    This book provides a perspective on the application of machine learning-based methods in knowledge discovery from natural languages texts. By analysing various data sets, conclusions which are not normally evident, emerge and can be used for various purposes and applications. The book provides explanations of principles of time-proven machine learning algorithms applied in text mining together with step-by-step demonstrations of how to reveal the semantic contents in real-world datasets using the popular R-language with its implemented machine learning algorithms. The book is not only aimed at IT specialists, but is meant for a wider audience that needs to process big sets of text documents and has basic knowledge of the subject, e.g. e-mail service providers, online shoppers, librarians, etc.



    The book starts with an introduction to text-based natural language data processing and its goals and problems. It focuses on machine learning, presenting various algorithms with their use and possibilities, and reviews the positives and negatives. Beginning with the initial data pre-processing, a reader can follow the steps provided in the R-language including the subsuming of various available plug-ins into the resulting software tool. A big advantage is that R also contains many libraries implementing machine learning algorithms, so a reader can concentrate on the principal target without the need to implement the details of the algorithms her- or himself. To make sense of the results, the book also provides explanations of the algorithms, which supports the final evaluation and interpretation of the results. The examples are demonstrated using realworld data from commonly accessible Internet sources.

    Preface

    Introduction to Text Mining with Machine Learning
    Introduction
    Relation of Text Mining to Data Mining
    The Text Mining Process
    Machine Learning for Text Mining
    Three Fundamental Learning Directions
    Big Data
    About This Book

    Introduction to R
    Installing R
    Running R
    RStudio
    Writing and Executing Commands
    Variables and Data Types
    Objects in R
    Functions
    Operators
    Vectors
    Matrices and Arrays
    Lists
    Factors
    Data Frames
    Functions Useful in Machine Learning
    Flow Control Structures
    Packages
    Graphics

    Structured text representations
    Introduction
    The Bag-of-words Model
    The Limitations of the Bag-of-Words Model
    Document Features
    Standardization
    Texts in Different Encodings
    Language Identification
    Tokenization
    Sentence Detection
    Filtering Stop Words, Common, and Rare Terms
    Removing Diacritics
    Normalization
    Annotation
    Calculating the Weights in the Bag-of-Words Model
    Common Formats for Storing Structured Data
    A Complex Example

    Classification
    Sample Data
    Selected Algorithms
    Classifier Quality Measurement

    Bayes Classifier
    Introduction
    Bayes’ Theorem
    Optimal Bayes Classifier
    Na¨ive Bayes Classifier
    Illustrative Example of Na¨ive Bayes
    Na¨ive Bayes Classifier in R

    Nearest Neighbors
    Introduction
    Similarity as Distance
    Illustrative Example of k-NN
    k-NN in R

    Decision Trees
    Introduction
    Entropy Minimization-Based c5 Algorithm
    C5 Tree Generator in R

    Random Forest
    Introduction
    Random Forest in R

    Adaboost
    Introduction
    Boosting Principle
    Adaboost Principle
    Weak Learners
    Adaboost in R

    Support Vector Machines
    Introduction
    Support Vector Machines Principles
    SVM in R

    Deep Learning
    Introduction
    Artificial Neural Networks
    Deep Learning in R

    Clustering
    Introduction to Clustering
    Difficulties of Clustering
    Similarity Measures
    Types of Clustering Algorithms
    Clustering Criterion Functions
    Deciding on the Number of Clusters
    K-means
    K-medoids
    Criterion Function Optimization
    Agglomerative Hierarchical Clustering
    Scatter-Gather Algorithm
    Divisive Hierarchical Clustering
    Constrained Clustering
    Evaluating Clustering Results
    Cluster Labeling
    A Few Examples

    Word Embeddings
    Introduction
    Determining the Context and Word Similarity
    Context Windows
    Computing Word Embeddings
    Aggregation of Word Vectors
    An Example

    Feature Selection
    Introduction
    Feature Selection as State Space Search
    Feature Selection Methods
    Term Elimination Based on Frequency
    Term Strength
    Term Contribution
    Entropy-based Ranking
    Term Variance
    An Example

    References

    Index

    Biography



    Jan Žižka is a consultant in machine learning and data mining. He has worked as a system programmer, developer of advanced software systems, and researcher. For the last 25 years, he has devoted himself to AI and machine learning, especially text mining. He has been a faculty at a number of universities and research institutes. He has authored approximately 100 international publications.



    František Dařena is an associate professor and the head of the Text Mining and NLP group at the Department of Informatics, Mendel University, Brno. He has published numerous articles in international scientific journals, conference proceedings, and monographs, and is a member of editorial boards of several international journals. His research includes text/data mining, intelligent data processing, and machine learning.



    Arnošt Svoboda is an expert programer. His speciality includes programming languages and systems such as R, Assembler, Matlab, PL/1, Cobol, Fortran, Pascal, and others. He started as a system programmer. The last 20 years, Arnošt has worked also as a teacher and researcher at Masaryk University in Brno. His current interest are machine learning and data mining.