Guides professionals and students through the rapidly growing field of machine learning with hands-on examples in the popular R programming language
Machine learning - a branch of Artificial Intelligence (AI) which enables computers to improve their results and learn new approaches without explicit instructions - allows organizations to reveal patterns in their data and incorporate predictive analytics into their decision-making process. Practical Machine Learning in R provides a hands-on approach to solving business problems with intelligent, self-learning computer algorithms.
Bestselling author and data analytics experts Fred Nwanganga and Mike Chapple explain what machine learning is, demonstrate its organizational benefits, and provide hands-on examples created in the R programming language. A perfect guide for professional self-taught learners or students in an introductory machine learning course, this reader-friendly book illustrates the numerous real-world business uses of machine learning approaches. Clear and detailed chapters cover data wrangling, R programming with the popular RStudio tool, classification and regression techniques, performance evaluation, and more.
- Explores data management techniques, including data collection, exploration and dimensionality reduction
- Covers unsupervised learning, where readers identify and summarize patterns using approaches such as apriori, eclat and clustering
- Describes the principles behind the Nearest Neighbor, Decision Tree and Naive Bayes classification techniques
- Explains how to evaluate and choose the right model, as well as how to improve model performance using ensemble methods such as Random Forest and XGBoost
Practical Machine Learning in R is a must-have guide for business analysts, data scientists, and other professionals interested in leveraging the power of AI to solve business problems, as well as students and independent learners seeking to enter the field.
Table of Contents
About the Authors vii
About the Technical Editors ix
Acknowledgments xi
Introduction xxi
Part I: Getting Started 1
Chapter 1 What is Machine Learning? 3
Discovering Knowledge in Data 5
Introducing Algorithms 5
Artificial Intelligence, Machine Learning, and Deep Learning 6
Machine Learning Techniques 7
Supervised Learning 8
Unsupervised Learning 12
Model Selection 14
Classification Techniques 14
Regression Techniques 15
Similarity Learning Techniques 16
Model Evaluation 16
Classification Errors 17
Regression Errors 19
Types of Error 20
Partitioning Datasets 22
Holdout Method 23
Cross-Validation Methods 23
Exercises 24
Chapter 2 Introduction to R and RStudio 25
Welcome to R 26
R and RStudio Components 27
The R Language 27
RStudio 28
RStudio Desktop 28
RStudio Server 29
Exploring the RStudio
Environment 29
R Packages 38
The CRAN Repository 38
Installing Packages 38
Loading Packages 39
Package Documentation 40
Writing and Running an R Script 41
Data Types in R 44
Vectors 45
Testing Data Types 47
Converting Data Types 50
Missing Values 51
Exercises 52
Chapter 3 Managing Data 53
The Tidyverse 54
Data Collection 55
Key Considerations 55
Collecting Ground Truth Data 55
Data Relevance 55
Quantity of Data 56
Ethics 56
Importing the Data 56
Reading Comma-Delimited Files 56
Reading Other Delimited Files 60
Data Exploration 60
Describing the Data 61
Instance 61
Feature 61
Dimensionality 62
Sparsity and Density 62
Resolution 62
Descriptive Statistics 63
Visualizing the Data 69
Comparison 69
Relationship 70
Distribution 72
Composition 73
Data Preparation 74
Cleaning the Data 75
Missing Values 75
Noise 79
Outliers 81
Class Imbalance 82
Transforming the Data 84
Normalization 84
Discretization 89
Dummy Coding 89
Reducing the Data 92
Sampling 92
Dimensionality Reduction 99
Exercises 100
Part II: Regression 101
Chapter 4 Linear Regression 103
Bicycle Rentals and Regression 104
Relationships Between Variables 106
Correlation 106
Regression 114
Simple Linear Regression 115
Ordinary Least Squares Method 116
Simple Linear Regression Model 119
Evaluating the Model 120
Residuals 121
Coefficients 121
Diagnostics 122
Multiple Linear Regression 124
The Multiple Linear Regression Model 124
Evaluating the Model 125
Residual Diagnostics 127
Influential Point Analysis 130
Multicollinearity 133
Improving the Model 135
Considering Nonlinear Relationships 135
Considering Categorical Variables 137
Considering Interactions Between Variables 139
Selecting the Important Variables 141
Strengths and Weaknesses 146
Case Study: Predicting Blood Pressure 147
Importing the Data 148
Exploring the Data 149
Fitting the Simple Linear Regression Model 151
Fitting the Multiple Linear Regression Model 152
Exercises 161
Chapter 5 Logistic Regression 165
Prospecting for Potential Donors 166
Classifi cation 169
Logistic Regression 170
Odds Ratio 172
Binomial Logistic Regression Model 176
Dealing with Missing Data 178
Dealing with Outliers 182
Splitting the Data 187
Dealing with Class Imbalance 188
Training a Model 190
Evaluating the Model 190
Coeffi cients 193
Diagnostics 195
Predictive Accuracy 195
Improving the Model 198
Dealing with Multicollinearity 198
Choosing a Cutoff Value 205
Strengths and Weaknesses 206
Case Study: Income Prediction 207
Importing the Data 208
Exploring and Preparing the Data 208
Training the Model 212
Evaluating the Model 215
Exercises 216
Part III: Classification 221
Chapter 6 k-Nearest Neighbors 223
Detecting Heart Disease 224
k-Nearest Neighbors 226
Finding the Nearest Neighbors 228
Labeling Unlabeled Data 230
Choosing an Appropriate k 231
k-Nearest Neighbors Model 232
Dealing with Missing Data 234
Normalizing the Data 234
Dealing with Categorical Features 235
Splitting the Data 237
Classifying Unlabeled Data 237
Evaluating the Model 238
Improving the Model 239
Strengths and Weaknesses 241
Case Study: Revisiting the Donor Dataset 241
Importing the Data 241
Exploring and Preparing the Data 242
Dealing with Missing Data 243
Normalizing the Data 245
Splitting and Balancing the Data 246
Building the Model 248
Evaluating the Model 248
Exercises 249
Chapter 7 Naïve Bayes 251
Classifying Spam Email 252
Naïve Bayes 253
Probability 254
Joint Probability 255
Conditional Probability 256
Classification with Naïve Bayes 257
Additive Smoothing 261
Naïve Bayes Model 263
Splitting the Data 266
Training a Model 267
Evaluating the Model 267
Strengths and Weaknesses of the Naïve Bayes Classifier 269
Case Study: Revisiting the Heart Disease Detection Problem 269
Importing the Data 270
Exploring and Preparing the Data 270
Building the Model 272
Evaluating the Model 273
Exercises 274
Chapter 8 Decision Trees 277
Predicting Build Permit Decisions 278
Decision Trees 279
Recursive Partitioning 281
Entropy 285
Information Gain 286
Gini Impurity 290
Pruning 290
Building a Classification Tree Model 291
Splitting the Data 294
Training a Model 295
Evaluating the Model 295
Strengths and Weaknesses of the Decision Tree Model 298
Case Study: Revisiting the Income Prediction Problem 299
Importing the Data 300
Exploring and Preparing the Data 300
Building the Model 302
Evaluating the Model 302
Exercises 304
Part IV: Evaluating and Improving Performance 305
Chapter 9 Evaluating Performance 307
Estimating Future Performance 308
Cross-Validation 311
k-Fold Cross-Validation 311
Leave-One-Out Cross-Validation 315
Random Cross-Validation 316
Bootstrap Sampling 318
Beyond Predictive Accuracy 321
Kappa 323
Precision and Recall 326
Sensitivity and Specificity 328
Visualizing Model Performance 332
Receiver Operating Characteristic Curve 333
Area Under the Curve 336
Exercises 339
Chapter 10 Improving Performance 341
Parameter Tuning 342
Automated Parameter Tuning 342
Customized Parameter Tuning 348
Ensemble Methods 354
Bagging 355
Boosting 358
Stacking 361
Exercises 366
Part V: Unsupervised Learning 367
Chapter 11 Discovering Patterns with Association Rules 369
Market Basket Analysis 370
Association Rules 371
Identifying Strong Rules 373
Support 373
Confi dence 373
Lift 374
The Apriori Algorithm 374
Discovering Association Rules 376
Generating the Rules 377
Evaluating the Rules 382
Strengths and Weaknesses 386
Case Study: Identifying Grocery Purchase Patterns 386
Importing the Data 387
Exploring and Preparing the Data 387
Generating the Rules 389
Evaluating the Rules 389
Exercises 392
Notes 393
Chapter 12 Grouping Data with Clustering 395
Clustering 396
k-Means Clustering 399
Segmenting Colleges with k-Means Clustering 403
Creating the Clusters 404
Analyzing the Clusters 407
Choosing the Right Number of Clusters 409
The Elbow Method 409
The Average Silhouette Method 411
The Gap Statistic 412
Strengths and Weaknesses of k-Means Clustering 414
Case Study: Segmenting Shopping Mall Customers 415
Exploring and Preparing the Data 415
Clustering the Data 416
Evaluating the Clusters 418
Exercises 420
Notes 420
Index 421