"Turn yourself into a Data Head. You'll become a more valuable employee and make your organization more successful."
Thomas H. Davenport, Research Fellow, Author of Competing on Analytics, Big Data @ Work, and The AI Advantage
You've heard the hype around data - now get the facts.
In Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning, award-winning data scientists Alex Gutman and Jordan Goldmeier pull back the curtain on data science and give you the language and tools necessary to talk and think critically about it.
You'll learn how to:
- Think statistically and understand the role variation plays in your life and decision making
- Speak intelligently and ask the right questions about the statistics and results you encounter in the workplace
- Understand what's really going on with machine learning, text analytics, deep learning, and artificial intelligence
- Avoid common pitfalls when working with and interpreting data
Becoming a Data Head is a complete guide for data science in the workplace: covering everything from the personalities you’ll work with to the math behind the algorithms. The authors have spent years in data trenches and sought to create a fun, approachable, and eminently readable book. Anyone can become a Data Head - an active participant in data science, statistics, and machine learning. Whether you're a business professional, engineer, executive, or aspiring data scientist, this book is for you.
Table of Contents
Acknowledgments xiii
Foreword xxiii
Introduction xxvii
Part One Thinking Like a Data Head
Chapter 1 What Is the Problem? 3
Questions a Data Head Should Ask 4
Why Is This Problem Important? 4
Who Does This Problem Affect? 6
What If We Don’t Have the Right Data? 6
When Is the Project Over? 7
What If We Don’t Like the Results? 7
Understanding Why Data Projects Fail 8
Customer Perception 8
Discussion 10
Working on Problems That Matter 11
Chapter Summary 11
Chapter 2 What Is Data? 13
Data vs. Information 13
An Example Dataset 14
Data Types 15
How Data Is Collected and Structured 16
Observational vs. Experimental Data 16
Structured vs. Unstructured Data 17
Basic Summary Statistics 18
Chapter Summary 19
Chapter 3 Prepare to Think Statistically 21
Ask Questions 22
There Is Variation in All Things 23
Scenario: Customer Perception (The Sequel) 24
Case Study: Kidney-Cancer Rates 26
Probabilities and Statistics 28
Probability vs. Intuition 29
Discovery with Statistics 31
Chapter Summary 33
Part Two Speaking Like a Data Head
Chapter 4 Argue with the Data 37
What Would You Do? 38
Missing Data Disaster 39
Tell Me the Data Origin Story 43
Who Collected the Data? 44
How Was the Data Collected? 44
Is the Data Representative? 45
Is There Sampling Bias? 46
What Did You Do with Outliers? 46
What Data Am I Not Seeing? 47
How Did You Deal with Missing Values? 47
Can the Data Measure What You Want It to Measure? 48
Argue with Data of All Sizes 48
Chapter Summary 49
Chapter 5 Explore the Data 51
Exploratory Data Analysis and You 52
Embracing the Exploratory Mindset 52
Questions to Guide You 53
The Setup 53
Can the Data Answer the Question? 54
Set Expectations and Use Common Sense 54
Do the Values Make Intuitive Sense? 54
Watch Out: Outliers and Missing Values 58
Did You Discover Any Relationships? 59
Understanding Correlation 59
Watch Out: Misinterpreting Correlation 60
Watch Out: Correlation Does Not Imply Causation 62
Did You Find New Opportunities in the Data? 63
Chapter Summary 63
Chapter 6 Examine the Probabilities 65
Take a Guess 66
The Rules of the Game 66
Notation 67
Conditional Probability and Independent Events 69
The Probability of Multiple Events 69
Two Things That Happen Together 69
One Thing or the Other 70
Probability Thought Exercise 72
Next Steps 73
Be Careful Assuming Independence 74
Don’t Fall for the Gambler’s Fallacy 74
All Probabilities Are Conditional 75
Don’t Swap Dependencies 76
Bayes’ Theorem 76
Ensure the Probabilities Have Meaning 79
Calibration 80
Rare Events Can, and Do, Happen 80
Chapter Summary 81
Chapter 7 Challenge the Statistics 83
Quick Lessons on Inference 83
Give Yourself Some Wiggle Room 84
More Data, More Evidence 84
Challenge the Status Quo 85
Evidence to the Contrary 86
Balance Decision Errors 88
The Process of Statistical Inference 89
The Questions You Should Ask to Challenge the Statistics 90
What Is the Context for These Statistics? 90
What Is the Sample Size? 91
What Are You Testing? 92
What Is the Null Hypothesis? 92
Assuming Equivalence 93
What Is the Significance Level? 93
How Many Tests Are You Doing? 94
Can I See the Confidence Intervals? 95
Is This Practically Significant? 96
Are You Assuming Causality? 96
Chapter Summary 97
Part Three Understanding the Data Scientist’s Toolbox
Chapter 8 Search for Hidden Groups 101
Unsupervised Learning 102
Dimensionality Reduction 102
Creating Composite Features 103
Principal Component Analysis 105
Principal Components in Athletic Ability 105
PCA Summary 108
Potential Traps 109
Clustering 110
k-Means Clustering 111
Clustering Retail Locations 111
Potential Traps 113
Chapter Summary 114
Chapter 9 Understand the Regression Model 117
Supervised Learning 117
Linear Regression: What It Does 119
Least Squares Regression: Not Just a Clever Name 120
Linear Regression: What It Gives You 123
Extending to Many Features 124
Linear Regression: What Confusion It Causes 125
Omitted Variables 125
Multicollinearity 126
Data Leakage 127
Extrapolation Failures 128
Many Relationships Aren’t Linear 128
Are You Explaining or Predicting? 128
Regression Performance 130
Other Regression Models 131
Chapter Summary 131
Chapter 10 Understand the Classification Model 133
Introduction to Classification 133
What You’ll Learn 134
Classification Problem Setup 135
Logistic Regression 135
Logistic Regression: So What? 138
Decision Trees 139
Ensemble Methods 142
Random Forests 143
Gradient Boosted Trees 143
Interpretability of Ensemble Models 145
Watch Out for Pitfalls 145
Misapplication of the Problem 146
Data Leakage 146
Not Splitting Your Data 146
Choosing the Right Decision Threshold 147
Misunderstanding Accuracy 147
Confusion Matrices 148
Chapter Summary 150
Chapter 11 Understand Text Analytics 151
Expectations of Text Analytics 151
How Text Becomes Numbers 153
A Big Bag of Words 153
N-Grams 157
Word Embeddings 158
Topic Modeling 160
Text Classification 163
Naïve Bayes 164
Sentiment Analysis 166
Practical Considerations When Working with Text 167
Big Tech Has the Upper Hand 168
Chapter Summary 169
Chapter 12 Conceptualize Deep Learning 171
Neural Networks 172
How Are Neural Networks Like the Brain? 172
A Simple Neural Network 173
How a Neural Network Learns 174
A Slightly More Complex Neural Network 175
Applications of Deep Learning 178
The Benefits of Deep Learning 179
How Computers “See” Images 180
Convolutional Neural Networks 182
Deep Learning on Language and Sequences 183
Deep Learning in Practice 185
Do You Have Data? 185
Is Your Data Structured? 186
What Will the Network Look Like? 186
Artificial Intelligence and You 187
Big Tech Has the Upper Hand 188
Ethics in Deep Learning 189
Chapter Summary 190
Part Four Ensuring Success
Chapter 13 Watch Out for Pitfalls 193
Biases and Weird Phenomena in Data 194
Survivorship Bias 194
Regression to the Mean 195
Simpson’s Paradox 195
Confirmation Bias 197
Effort Bias (aka the “Sunk Cost Fallacy”) 197
Algorithmic Bias 198
Uncategorized Bias 198
The Big List of Pitfalls 199
Statistical and Machine Learning Pitfalls 199
Project Pitfalls 200
Chapter Summary 202
Chapter 14 Know the People and Personalities 203
Seven Scenes of Communication Breakdowns 204
The Postmortem 204
Storytime 205
The Telephone Game 206
Into the Weeds 206
The Reality Check 207
The Takeover 207
The Blowhard 208
Data Personalities 208
Data Enthusiasts 209
Data Cynics 209
Data Heads 209
Chapter Summary 210
Chapter 15 What’s Next? 211
Index 215