Practical Machine Learning in R. Edition No. 1


Book
464 Pages
July 2020
John Wiley and Sons Ltd
ID: 5842826

Guides professionals and students through the rapidly growing field of machine learning with hands-on examples in the popular R programming language

Machine learning - a branch of Artificial Intelligence (AI) which enables computers to improve their results and learn new approaches without explicit instructions - allows organizations to reveal patterns in their data and incorporate predictive analytics into their decision-making process. Practical Machine Learning in R provides a hands-on approach to solving business problems with intelligent, self-learning computer algorithms.

Bestselling author and data analytics experts Fred Nwanganga and Mike Chapple explain what machine learning is, demonstrate its organizational benefits, and provide hands-on examples created in the R programming language. A perfect guide for professional self-taught learners or students in an introductory machine learning course, this reader-friendly book illustrates the numerous real-world business uses of machine learning approaches. Clear and detailed chapters cover data wrangling, R programming with the popular RStudio tool, classification and regression techniques, performance evaluation, and more.

Explores data management techniques, including data collection, exploration and dimensionality reduction
Covers unsupervised learning, where readers identify and summarize patterns using approaches such as apriori, eclat and clustering
Describes the principles behind the Nearest Neighbor, Decision Tree and Naive Bayes classification techniques
Explains how to evaluate and choose the right model, as well as how to improve model performance using ensemble methods such as Random Forest and XGBoost

Practical Machine Learning in R is a must-have guide for business analysts, data scientists, and other professionals interested in leveraging the power of AI to solve business problems, as well as students and independent learners seeking to enter the field.

About the Authors vii

About the Technical Editors ix

Acknowledgments xi

Introduction xxi

Part I: Getting Started 1

Chapter 1 What is Machine Learning? 3

Discovering Knowledge in Data 5

Introducing Algorithms 5

Artificial Intelligence, Machine Learning, and Deep Learning 6

Machine Learning Techniques 7

Supervised Learning 8

Unsupervised Learning 12

Model Selection 14

Classification Techniques 14

Regression Techniques 15

Similarity Learning Techniques 16

Model Evaluation 16

Classification Errors 17

Regression Errors 19

Types of Error 20

Partitioning Datasets 22

Holdout Method 23

Cross-Validation Methods 23

Exercises 24

Chapter 2 Introduction to R and RStudio 25

Welcome to R 26

R and RStudio Components 27

The R Language 27

RStudio 28

RStudio Desktop 28

RStudio Server 29

Exploring the RStudio

Environment 29

R Packages 38

The CRAN Repository 38

Installing Packages 38

Loading Packages 39

Package Documentation 40

Writing and Running an R Script 41

Data Types in R 44

Vectors 45

Testing Data Types 47

Converting Data Types 50

Missing Values 51

Exercises 52

Chapter 3 Managing Data 53

The Tidyverse 54

Data Collection 55

Key Considerations 55

Collecting Ground Truth Data 55

Data Relevance 55

Quantity of Data 56

Ethics 56

Importing the Data 56

Reading Comma-Delimited Files 56

Reading Other Delimited Files 60

Data Exploration 60

Describing the Data 61

Instance 61

Feature 61

Dimensionality 62

Sparsity and Density 62

Resolution 62

Descriptive Statistics 63

Visualizing the Data 69

Comparison 69

Relationship 70

Distribution 72

Composition 73

Data Preparation 74

Cleaning the Data 75

Missing Values 75

Noise 79

Outliers 81

Class Imbalance 82

Transforming the Data 84

Normalization 84

Discretization 89

Dummy Coding 89

Reducing the Data 92

Sampling 92

Dimensionality Reduction 99

Exercises 100

Part II: Regression 101

Chapter 4 Linear Regression 103

Bicycle Rentals and Regression 104

Relationships Between Variables 106

Correlation 106

Regression 114

Simple Linear Regression 115

Ordinary Least Squares Method 116

Simple Linear Regression Model 119

Evaluating the Model 120

Residuals 121

Coefficients 121

Diagnostics 122

Multiple Linear Regression 124

The Multiple Linear Regression Model 124

Evaluating the Model 125

Residual Diagnostics 127

Influential Point Analysis 130

Multicollinearity 133

Improving the Model 135

Considering Nonlinear Relationships 135

Considering Categorical Variables 137

Considering Interactions Between Variables 139

Selecting the Important Variables 141

Strengths and Weaknesses 146

Case Study: Predicting Blood Pressure 147

Importing the Data 148

Exploring the Data 149

Fitting the Simple Linear Regression Model 151

Fitting the Multiple Linear Regression Model 152

Exercises 161

Chapter 5 Logistic Regression 165

Prospecting for Potential Donors 166

Classifi cation 169

Logistic Regression 170

Odds Ratio 172

Binomial Logistic Regression Model 176

Dealing with Missing Data 178

Dealing with Outliers 182

Splitting the Data 187

Dealing with Class Imbalance 188

Training a Model 190

Evaluating the Model 190

Coeffi cients 193

Diagnostics 195

Predictive Accuracy 195

Improving the Model 198

Dealing with Multicollinearity 198

Choosing a Cutoff Value 205

Strengths and Weaknesses 206

Case Study: Income Prediction 207

Importing the Data 208

Exploring and Preparing the Data 208

Training the Model 212

Evaluating the Model 215

Exercises 216

Part III: Classification 221

Chapter 6 k-Nearest Neighbors 223

Detecting Heart Disease 224

k-Nearest Neighbors 226

Finding the Nearest Neighbors 228

Labeling Unlabeled Data 230

Choosing an Appropriate k 231

k-Nearest Neighbors Model 232

Dealing with Missing Data 234

Normalizing the Data 234

Dealing with Categorical Features 235

Splitting the Data 237

Classifying Unlabeled Data 237

Evaluating the Model 238

Improving the Model 239

Strengths and Weaknesses 241

Case Study: Revisiting the Donor Dataset 241

Importing the Data 241

Exploring and Preparing the Data 242

Dealing with Missing Data 243

Normalizing the Data 245

Splitting and Balancing the Data 246

Building the Model 248

Evaluating the Model 248

Exercises 249

Chapter 7 Naïve Bayes 251

Classifying Spam Email 252

Naïve Bayes 253

Probability 254

Joint Probability 255

Conditional Probability 256

Classification with Naïve Bayes 257

Additive Smoothing 261

Naïve Bayes Model 263

Splitting the Data 266

Training a Model 267

Evaluating the Model 267

Strengths and Weaknesses of the Naïve Bayes Classifier 269

Case Study: Revisiting the Heart Disease Detection Problem 269

Importing the Data 270

Exploring and Preparing the Data 270

Building the Model 272

Evaluating the Model 273

Exercises 274

Chapter 8 Decision Trees 277

Predicting Build Permit Decisions 278

Decision Trees 279

Recursive Partitioning 281

Entropy 285

Information Gain 286

Gini Impurity 290

Pruning 290

Building a Classification Tree Model 291

Splitting the Data 294

Training a Model 295

Evaluating the Model 295

Strengths and Weaknesses of the Decision Tree Model 298

Case Study: Revisiting the Income Prediction Problem 299

Importing the Data 300

Exploring and Preparing the Data 300

Building the Model 302

Evaluating the Model 302

Exercises 304

Part IV: Evaluating and Improving Performance 305

Chapter 9 Evaluating Performance 307

Estimating Future Performance 308

Cross-Validation 311

k-Fold Cross-Validation 311

Leave-One-Out Cross-Validation 315

Random Cross-Validation 316

Bootstrap Sampling 318

Beyond Predictive Accuracy 321

Kappa 323

Precision and Recall 326

Sensitivity and Specificity 328

Visualizing Model Performance 332

Receiver Operating Characteristic Curve 333

Area Under the Curve 336

Exercises 339

Chapter 10 Improving Performance 341

Parameter Tuning 342

Automated Parameter Tuning 342

Customized Parameter Tuning 348

Ensemble Methods 354

Bagging 355

Boosting 358

Stacking 361

Exercises 366

Part V: Unsupervised Learning 367

Chapter 11 Discovering Patterns with Association Rules 369

Market Basket Analysis 370

Association Rules 371

Identifying Strong Rules 373

Support 373

Confi dence 373

Lift 374

The Apriori Algorithm 374

Discovering Association Rules 376

Generating the Rules 377

Evaluating the Rules 382

Strengths and Weaknesses 386

Case Study: Identifying Grocery Purchase Patterns 386

Importing the Data 387

Exploring and Preparing the Data 387

Generating the Rules 389

Evaluating the Rules 389

Exercises 392

Notes 393

Chapter 12 Grouping Data with Clustering 395

Clustering 396

k-Means Clustering 399

Segmenting Colleges with k-Means Clustering 403

Creating the Clusters 404

Analyzing the Clusters 407

Choosing the Right Number of Clusters 409

The Elbow Method 409

The Average Silhouette Method 411

The Gap Statistic 412

Strengths and Weaknesses of k-Means Clustering 414

Case Study: Segmenting Shopping Mall Customers 415

Exploring and Preparing the Data 415

Clustering the Data 416

Evaluating the Clusters 418

Exercises 420

Notes 420

Index 421

Authors

Fred Nwanganga Mike Chapple University of Notre Dame.

Table of Contents

Authors

Related Topics

Related Products

Machine Learning as a Service (MLaaS) Market Report 2025

How Machine Learning is Innovating Today's World. A Concise Technical Guide. Edition No. 1

The Global Quantum Machine Learning Market 2026-2040

Machine Learning Model Operationalization Management Market Size, Share & Industry Analysis Report By Organization Size, By Component, By Deployment Mode, By Vertical, By Regional Outlook and Forecast, 2025 - 2032

Machine Learning and Deep Learning in the Quantum Era 2024: A Market Forecast and Technology Assessment