240 Pages
    by Chapman & Hall

    240 Pages
    by Chapman & Hall

    "This book is a great way to both start learning data science through the promising Julia language and to become an efficient data scientist."- Professor Charles Bouveyron, INRIA Chair in Data Science, Université Côte d’Azur, Nice, France

    Julia, an open-source programming language, was created to be as easy to use as languages such as R and Python while also as fast as C and Fortran. An accessible, intuitive, and highly efficient base language with speed that exceeds R and Python, makes Julia a formidable language for data science. Using well known data science methods that will motivate the reader, Data Science with Julia will get readers up to speed on key features of the Julia language and illustrate its facilities for data science and machine learning work.

    Features:

    • Covers the core components of Julia as well as packages relevant to the input, manipulation and representation of data.
    • Discusses several important topics in data science including supervised and unsupervised learning.
    • Reviews data visualization using the Gadfly package, which was designed to emulate the very popular ggplot2 package in R. Readers will learn how to make many common plots and how to visualize model results.
    • Presents how to optimize Julia code for performance.
    • Will be an ideal source for people who already know R and want to learn how to use Julia (though no previous knowledge of R or any other programming language is required).

    The advantages of Julia for data science cannot be understated. Besides speed and ease of use, there are already over 1,900 packages available and Julia can interface (either directly or through packages) with libraries written in R, Python, Matlab, C, C++ or Fortran. The book is for senior undergraduates, beginning graduate students, or practicing data scientists who want to learn how to use Julia for data science.

    "This book is a great way to both start learning data science through the promising Julia language and to become an efficient data scientist."

    Professor Charles Bouveyron
    INRIA Chair in Data Science
    Université Côte d’Azur, Nice, France

    Chapter 1

    Introduction

    DATA SCIENCE

    BIG DATA

    JULIA

    JULIA PACKAGES

    R PACKAGES

    DATASETS

    Overview

    Beer Data

    Coffee Data

    Leptograpsus Crabs Data

    Food Preferences Data

    x Data

    Iris Data

    OUTLINE OF THE CONTENTS OF THIS MONOGRAPH

    Chapter 2

    Core Julia

    VARIABLE NAMES

    TYPES

    Numeric

    Floats

    Strings

    Tuples

    DATA STRUCTURES

    Arrays

    Dictionaries

    CONTROL FLOW

    Compound Expressions

    Conditional Evaluation

    Loops

    Basics

    Loop termination

    Exception Handling

    FUNCTIONS

    Chapter 3

    Working With Data

    DATAFRAMES

    CATEGORICAL DATA

    IO

    USEFUL DATAFRAME FUNCTIONS

    SPLIT-APPLY-COMBINE STRATEGY

    QUERYJL

    Chapter 4

    Visualizing Data

    GADFLYJL

    VISUALIZING UNIVARIATE DATA

    DISTRIBUTIONS

    VISUALIZING BIVARIATE DATA

    ERROR BARS

    FACETS

    SAVING PLOTS

    Chapter 5

    Supervised Learning

    INTRODUCTION

    Contents _ ix

    CROSS-VALIDATION

    Overview

    K-Fold Cross-Validation

    K-NEAREST NEIGHBOURS CLASSIFICATION

    CLASSIFICATION AND REGRESSION TREES

    Overview

    Classification Trees

    Regression Trees

    Comments

    BOOTSTRAP

    RANDOM FORESTS

    GRADIENT BOOSTING

    Overview

    Beer Data

    Food Data

    COMMENTS

    Chapter 6

    Unsupervised Learning

    INTRODUCTION

    PRINCIPAL COMPONENTS ANALYSIS

    PROBABILISTIC PRINCIPAL COMPONENTS

    ANALYSIS

    EM ALGORITHM FOR PPCA

    Background: EM Algorithm

    E-step

    M-step

    Woodbury Identity

    Initialization

    Stopping Rule

    Implementing the EM Algorithm for PPCA

    Comments

    K-MEANS CLUSTERING

    MIXTURE OF PPCAS

    Model

    Parameter Estimation

    Illustrative Example: Coffee Data

    Chapter 7

    R Interoperability

    ACCESSING R DATASETS

    INTERACTING WITH R

    EXAMPLE: CLUSTERING AND DATA REDUCTION FOR THE COFFEE DATA

    Coffee Data

    PGMM Analysis

    VSCC Analysis

    EXAMPLE: FOOD DATA

    Overview

    Random Forests

    Biography

    Paul D. McNicholas is the Canada Research Chair in Computational Statistics at McMaster University, where he is a Professor in the Department of Mathematics and Statistics.

    Peter Tait  is a Ph.D. student at the Department of Mathematics and Statistics at McMaster University. Prior to returning to academia, he worked as a data scientist in the software industry, where he gained extensive practical experience.

    "The book is ideal for people who want to learn Julia through machine-learning examples and is especially relevant for R users – Chapter 7 is devoted to interacting with R from within Julia. The book contains a good balance of equations, code, algorithms written from scratch, and use of built-in machine-learning algorithms. Readers can directly use the code, which is available on GitHub, or dive deeper into how the methods work. A nice feature is the inclusion of probabilistic principal components analysis (PPCA) and mixtures of PPCA for unsupervised learning."
    ~The Royal Statistical Society

    ". . . the book is an excellent piece of work that makes a start with Julia very easy and that covers all essential aspects of the language. After making the first steps into the realm of Julia with the help of this book, the reader should be able afterwards to find the own path and to specialize into the more individual aspects of the language that no introductory textbook can cover. The same is true for the data science part. After reading the book, the reader will be able to perform the most common analyses alone and learn other, more specific methods from different sources afterwards."
    ~Daniel Fischer, International Statistical Review