Machine Learning: Hands-On for Developers and Technical Professionals provides hands-on instruction and fully-coded working examples for the most common machine learning techniques used by developers and technical professionals. The book contains a breakdown of each ML variant, explaining how it works and how it is used within certain industries, allowing readers to incorporate the presented techniques into their own work as they follow along. A core tenant of machine learning is a strong focus on data preparation, and a full exploration of the various types of learning algorithms illustrates how the proper tools can help any developer extract information and insights from existing data. The book includes a full complement of Instructor's Materials to facilitate use in the classroom, making this resource useful for students and as a professional reference.
At its core, machine learning is a mathematical, algorithm-based technology that forms the basis of historical data mining and modern big data science. Scientific analysis of big data requires a working knowledge of machine learning, which forms predictions based on known properties learned from training data. Machine Learning is an accessible, comprehensive guide for the non-mathematician, providing clear guidance that allows readers to:
- Learn the languages of machine learning including Hadoop, Mahout, and Weka
- Understand decision trees, Bayesian networks, and artificial neural networks
- Implement Association Rule, Real Time, and Batch learning
- Develop a strategic plan for safe, effective, and efficient machine learning
By learning to construct a system that can learn from data, readers can increase their utility across industries. Machine learning sits at the core of deep dive data analysis and visualization, which is increasingly in demand as companies discover the goldmine hiding in their existing data. For the tech professional involved in data science, Machine Learning: Hands-On for Developers and Technical Professionals provides the skills and techniques required to dig deeper.
Table of Contents
Introduction xxvii
Chapter 1 What is Machine Learning? 1
History of Machine Learning 1
Alan Turing 1
Arthur Samuel 2
Tom M. Mitchell 2
Summary Definition 3
Algorithm Types for Machine Learning 3
Supervised Learning 3
Unsupervised Learning 4
The Human Touch 4
Uses for Machine Learning 4
Software 4
Stock Trading 5
Robotics 6
Medicine and Healthcare 6
Advertising 7
Retail and E-commerce 7
Gaming Analytics 9
The Internet of Things 10
Languages for Machine Learning 10
Python 10
R 11
Matlab 11
Scala 11
Ruby 11
Software Used in This Book 11
Checking the Java Version 12
Weka Toolkit 12
DeepLearning4J 13
Kafka 13
Spark and Hadoop 13
Text Editors and IDEs 13
Data Repositories 14
UC Irvine Machine Learning Repository 14
Kaggle 14
Summary 14
Chapter 2 Planning for Machine Learning 15
The Machine Learning Cycle 15
It All Starts with a Question 16
I Don’t Have Data! 16
Starting Local 17
Transfer Learning 17
Competitions 17
One Solution Fits All? 18
Defining the Process 18
Planning 18
Developing 19
Testing 19
Reporting 19
Refining 19
Production 20
Avoiding Bias 20
Building a Data Team 20
Mathematics and Statistics 20
Programming 21
Graphic Design 21
Domain Knowledge 21
Data Processing 22
Using Your Computer 22
A Cluster of Machines 22
Cloud-Based Services 22
Data Storage 23
Physical Discs 23
Cloud-Based Storage 23
Data Privacy 23
Cultural Norms 24
Generational Expectations 24
The Anonymity of User Data 25
Don’t Cross the “Creepy Line” 25
Data Quality and Cleaning 26
Presence Checks 26
Type Checks 27
Length Checks 27
Range Checks 28
Format Checks 28
The Britney Dilemma 28
What’s in a Country Name? 31
Dates and Times 33
Final Thoughts on Data Cleaning 33
Thinking About Input Data 34
Raw Text 34
Comma-Separated Variables 34
JSON 35
YAML 37
XML 37
Spreadsheets 38
Databases 39
Thinking About Output Data 39
Don’t Be Afraid to Experiment 40
Summary 40
Chapter 3 Data Acquisition Techniques 43
Scraping Data 43
Copy and Paste 44
Google Sheets 46
Using an API 47
Acquiring Weather Data 48
Migrating Data 50
Installing Embulk 51
Using the Quick Run 51
Installing Plugins 52
Migrating Files to Database 53
Bulk Converting CSV to JSON 55
Summary 56
Chapter 4 Statistics, Linear Regression, and Randomness 57
Working with a Basic Dataset 57
Loading and Converting the Dataset 58
Introducing Basic Statistics 59
Minimum and Maximum Values 60
Sum 61
Mean 62
Arithmetic Mean 62
Harmonic Mean 62
Geometric Mean 63
The Relationship Between the Three Averages 63
Mode 65
Median 66
Range 67
Interquartile Ranges 67
Variance 68
Standard Deviation 69
Using Simple Linear Regression 70
Using Your Spreadsheet 70
Writing a Program 73
Embracing Randomness 75
Finding Pi with Random Numbers 76
Using Monte Carlo Pi in Clojure 77
Summary 80
Chapter 5 Working with Decision Trees 81
The Basics of Decision Trees 81
Uses for Decision Trees 81
Advantages of Decision Trees 82
Limitations of Decision Trees 82
Different Algorithm Types 82
How Decision Trees Work 84
Decision Trees in Weka 88
The Requirement 88
Training Data 89
Using Weka to Create a Decision Tree 90
Creating Java Code from the Classification 94
Testing the Classifier Code 99
Thinking About Future Iterations 101
Summary 101
Chapter 6 Clustering 103
What is Clustering? 103
Where is Clustering Used? 104
The Internet 104
Business and Retail 104
Law Enforcement 105
Computing 105
Clustering Models 105
How the K-Means Works 106
Calculating the Number of Clusters in a Dataset 108
K-Means Clustering with Weka 110
Preparing the Data 110
The Workbench Method 111
The Command-Line Method 116
Converting CSV File to ARFF 116
The Coded Method 120
Summary 128
Chapter 7 Association Rules Learning 129
Where is Association Rules Learning Used? 129
Web Usage Mining 130
Beer and Diapers 130
How Association Rules Learning Works 131
Support 133
Confidence 133
Lift 134
Conviction 134
Defining the Process 134
Algorithms 135
Apriori 135
FP-Growth 136
Mining the Baskets - A Walk-Through 136
The Raw Basket Data 136
Using the Weka Application 137
Inspecting the Results 141
Summary 142
Chapter 8 Support Vector Machines 143
What is a Support Vector Machine? 143
Where are Support Vector Machines Used? 144
The Basic Classification Principles 144
Binary and Multiclass Classification 144
Linear Classifiers 146
Confidence 147
Maximizing and Minimizing to Find the Line 147
How Support Vector Machines Approach Classification 148
Using Linear Classification 148
Using Non-Linear Classification 150
Using Support Vector Machines in Weka 151
Installing LibSVM 151
A Classification Walk-Through 152
Implementing LibSVM with Java 158
Summary 164
Chapter 9 Artificial Neural Networks 165
What is a Neural Network? 165
Artificial Neural Network Uses 166
High-Frequency Trading 166
Credit Applications 167
Data Center Management 167
Robotics 167
Medical Monitoring 168
Trusting the Black Box 168
Breaking Down the Artificial Neural Network 169
Perceptrons 169
Activation Functions 170
Multilayer Perceptrons 171
Back Propagation 173
Data Preparation for Artificial Neural Networks 174
Artificial Neural Networks with Weka 175
Generating a Dataset 175
Loading the Data into Weka 177
Configuring the Multilayer Perceptron 178
Training the Network 180
Altering the Network 182
Increasing the Test Data Size 183
Implementing a Neural Network in Java 183
Creating the Project 183
Writing the Code 185
Converting from CSV to Arff 188
Running the Neural Network 188
Developing Neural Networks with DeepLearning4J 189
Modifying the Data 189
Viewing Maven Dependencies 190
Handling the Training Data 191
Normalizing Data 191
Building the Model 192
Evaluating the Model 193
Saving the Model 193
Building and Executing the Program 194
Summary 195
Chapter 10 Machine Learning with Text Documents 197
Preparing Text for Analysis 198
Apache Tika 198
Cleaning the Text Data 203
Stopwords 205
Stemming 206
N-grams 206
TF/IDF 207
Loading the Documents 207
Calculating the Term Frequency 208
Calculating the Inverse Document Frequency 208
Computing the TF/IDF Score 209
Reviewing the Final Code Listing 209
Word2Vec 211
Loading the Raw Text Data 212
Tokenizing the Strings 212
Creating the Model 212
Evaluating the Model 213
Reviewing the Final Code 214
Basic Sentiment Analysis 216
Loading Positive and Negative Words 216
Loading Sentences 217
Calculating the Sentiment Score 217
Reviewing the Final Code 218
Performing a Test Run 220
Further Development 220
Summary 221
Chapter 11 Machine Learning with Images 223
What is an Image? 223
Introducing Color Depth 224
Images in Machine Learning 225
Basic Classifi cation with Neural Networks 226
Basic Settings 226
Loading the MNIST Images 226
Model Configuration 227
Model Training 228
Model Evaluation 228
Convolutional Neural Networks 228
How CNNs Work 228
CNN Demonstration 231
Downloading the Image Data 231
Basic Setup 232
Handling the Training and Test Data 233
Image Preparation 233
CNN Model Configuration 234
Model Training 236
Model Evaluation 236
Saving the Model 237
Transfer Learning 237
Summary 238
Chapter 12 Machine Learning Streaming with Kafka 239
What You Will Learn in This Chapter 239
From Machine Learning to Machine Learning Engineer 240
From Batch Processing to Streaming Data Processing 241
What is Kafka? 241
How Does It Work? 241
Fault Tolerance 243
Further Reading 243
Installing Kafka 243
Kafka as a Single-Node Cluster 244
Kafka as a Multinode Cluster 245
Topics Management 247
Creating Topics 248
Finding Out Information About Existing Topics 248
Deleting Topics 249
Sending Messages from the Command Line 249
Receiving Messages from the Command Line 250
Kafka Tool UI 250
Writing Your Own Producers and Consumers 251
Producers in Java 251
Consumers in Java 255
Building and Running the Applications 258
The Streaming API 260
Building a Streaming Machine Learning System 262
Planning the System 263
Continuous Training 265
Determining Which Models to Use for Predictions 266
Determining Which Algorithms to Use 268
Simple Linear Regression 271
Neural Network 274
Kafka Topics 281
Creating the Topics 281
Kafka Connect 283
Why Persist the Event Data? 283
The REST API Microservice 285
Processing Commands and Events 287
Finding Kafka Brokers 288
A Command or an Event? 289
Making Predictions 293
Prediction Streaming API 293
Prediction Functions 296
Predicting Linear Regression 298
Predicting the Neural Network Model 299
Running the Project 301
Run MySQL 301
Run Zookeeper 301
Run Kafka 301
Create the Topics 301
Run Kafka Connect 301
Model Builds 302
Run Events Streaming Application 302
Run Prediction Streaming Application 302
Start the API 302
Send JSON Training Data 302
Train a Model 302
Make a Prediction 303
Summary 303
Chapter 13 Apache Spark 305
Spark: A Hadoop Replacement? 305
Java, Scala, or Python? 306
Downloading and Installing Spark 306
A Quick Intro to Spark 306
Starting the Shell 307
Data Sources 307
Testing Spark 308
Spark Monitor 309
Comparing Hadoop MapReduce to Spark 310
Writing Stand-Alone Programs with Spark 313
Spark Programs in Java 313
Spark Program Summary 318
Spark SQL 318
Basic Concepts 318
Wrapping Up SparkSQL 323
Spark Streaming 323
Basic Concepts 323
Creating Your First Spark Stream 324
Spark Streams from Kafka 326
MLib: The Machine Learning Library 327
Dependencies 328
Decision Trees 328
Clustering 330
Association Rules with FP-Growth 332
Summary 335
Chapter 14 Machine Learning with R 337
Installing R 337
macOS 337
Windows 338
Linux 338
Your First Run 338
Installing R-Studio 339
The R Basics 340
Variables and Vectors 340
Matrices 341
Lists 342
Data Frames 343
Installing Packages 344
Loading in Data 345
Plotting Data 347
Simple Statistics 350
Simple Linear Regression 350
Creating the Data 351
The Initial Graph 351
Regression with the Linear Model 351
Making a Prediction 352
Basic Sentiment Analysis 353
Using Functions to Load in Word Lists 353
Writing a Function to Score Sentiment 354
Testing the Function 354
Apriori Association Rules 355
Installing the arules Package 355
Gathering the Training Data 356
Importing the Transaction Data 356
Running the Apriori Algorithm 357
Inspecting the Results 358
Accessing R from Java 358
Installing the rJava Package 358
Creating Your First Java Code in R 359
Calling R from Java Programs 359
Setting Up an Eclipse Project 360
Creating the Java/R Class 361
Running the Example 361
Extending Your R Implementations 363
Connecting to Social Media with R 364
Summary 366
Appendix A Kafka Quick Start 367
Installing Kafka 367
Starting Zookeeper 367
Starting Kafka 368
Creating Topics 368
Listing Topics 369
Describing a Topic 369
Deleting Topics 369
Running a Console Producer 370
Running a Console Consumer 370
Appendix B The Twitter API Developer Application Configuration 371
Appendix C Useful Unix Commands 375
Using Sample Data 375
Showing the Contents: cat, more, and less 376
Example Command 376
Expected Output 376
Filtering Content: grep 377
Example Command for Finding Text 377
Example Output 377
Sorting Data: sort 378
Example Command for Basic Sorting 378
Example Output 378
Finding Unique Occurrences: uniq 380
Showing the Top of a File: head 381
Counting Words: wc 381
Locating Anything: find 382
Combining Commands and Redirecting Output 383
Picking a Text Editor 383
Colon Frenzy: Vi and Vim 383
Nano 384
Emacs 384
Appendix D Further Reading 385
Machine Learning 385
Statistics 386
Big Data and Data Science 386
Visualization 387
Making Decisions 387
Datasets 388
Blogs 388
Useful Websites 389
The Tools of the Trade 389
Index 391