Want to jump into data science but don't know where to start?
Let's be real, data science is presented as something mystical and unattainable without the most powerful software, hardware, and data expertise. Real data science isn't about technology. It's about how you approach the problem.
In this updated edition of Data Smart: Using Data Science to Transform Information into Insight, award-winning data scientist and bestselling author Jordan Goldmeier shows you how to implement data science problems using Excel while exposing how things work behind the scenes.
Data Smart is your field guide to building statistics, machine learning, and powerful artificial intelligence concepts right inside your spreadsheet.
Inside you'll find:
- Four-color data visualizations that highlight and illustrate the concepts discussed in the book
- Tutorials explaining complicated data science using just Microsoft Excel
- How to take what you've learned and apply it to everyday problems at work and life
- Advice for using formulas, Power Query, and some of Excel's latest features to solve tough data problems
- Smart data science solutions for common business challenges
- Explanations of what algorithms do, how they work, and what you can tweak to take your Excel skills to the next level
Data Smart is a must-read for students, analysts, and managers ready to become data science savvy and share their findings with the world.
Table of Contents
Introduction xix
1 Everything You Ever Needed to Know About Spreadsheets but Were Too Afraid to Ask 1
Some Sample Data 2
Accessing Quick Descriptive Statistics 3
Excel Tables 4
Filtering and Sorting 5
Table Formatting 7
Structured References 7
Adding Table Columns 10
Lookup Formulas 11
VLOOKUP 11
INDEX/MATCH 13
XLOOKUP 15
PivotTables 16
Using Array Formulas 19
Solving Stuff with Solver 20
2 Set It and Forget It: An Introduction to Power Query 27
What Is Power Query? 27
Sample Data 28
Starting Power Query 29
Filtering Rows 32
Removing Columns 33
Find & Replace 34
Close & Load to Table 35
3 Naïve Bayes and the Incredible Lightness of Being an Idiot 39
The World's Fastest Intro to Probability Theory 39
Totaling Conditional Probabilities 40
Joint Probability, the Chain Rule, and Independence 40
What Happens in a Dependent Situation? 41
Bayes Rule 42
Separating the Signal and the Noise 43
Using the Bayes Rule to Create an AI Model 44
High-Level Class Probabilities Are Often Assumed to Be Equal 45
A Couple More Odds and Ends 46
Let's Get This Excel Party Started 47
Cleaning the Data with Power Query 48
Splitting on Spaces: Giving Each Word Its Due 50
Counting Tokens and Calculating Probabilities 55
We Have a Model! Let's Use It 58
4 Cluster Analysis Part 1: Using K-Means to Segment Your Customer Base 65
Dances at Summer Camp 65
Getting Real: K-Means Clustering Subscribers in Email Marketing 70
The Initial Dataset 71
Determining What to Measure 72
Start with Four Clusters 75
Euclidean Distance: Measuring Distances as the Crow Flies 76
Solving for the Cluster Centers 80
Making Sense of the Results 82
Getting the Top Deals by Cluster 83
The Silhouette: A Good Way to Let Different K Values Duke It Out 86
How About Five Clusters? 95
Solving for Five Clusters 96
Getting the Top Deals for All Five Clusters 96
Computing the Silhouette for 5-Means Clustering 99
K-Medians Clustering and Asymmetric Distance Measurements 100
Using K-Medians Clustering 100
Getting a More Appropriate Distance Metric 100
Putting It All in Excel 102
The Top Deals for the 5-Medians Clusters 104
5 Cluster Analysis Part II: Network Graphs and Community Detection 109
What Is a Network Graph? 110
Visualizing a Simple Graph 110
Beyond GiGraph and Adjacency Lists 115
Building a Graph from the Wholesale Wine Data 117
Creating a Cosine Similarity Matrix 118
Producing an R-Neighborhood Graph 121
Introduction to Gephi 123
Creating a Static Adjacency Matrix 124
Bringing in Your R-Neighborhood Adjacency Matrix into Gephi 124
Node Degree 128
Touching the Graph Data 130
How Much Is an Edge Worth? Points and Penalties in Graph Modularity 132
What's a Point, and What's a Penalty? 133
Setting Up the Score Sheet 136
Let's Get Clustering! 138
Split Number 1 138
Split 2: Electric Boogaloo 143
And. . .Split3: Split with a Vengeance 145
Encoding and Analyzing the Communities 146
There and Back Again: A Gephi Tale 151
6 Regression: The Granddaddy of Supervised Artificial Intelligence 157
Predicting Pregnant Customers at RetailMart Using Linear Regression 158
The Feature Set 159
Assembling the Training Data 161
Creating Dummy Variables 163
Let's Bake Our Own Linear Regression 165
Linear Regression Statistics: R-Squared, F-Tests, t-Tests 173
Making Predictions on Some New Data and Measuring Performance 182
Predicting Pregnant Customers at RetailMart Using Logistic Regression 192
First You Need a Link Function 192
Hooking Up the Logistic Function and Reoptimizing 193
Baking an Actual Logistic Regression 196
7 Ensemble Models: A Whole Lot of Bad Pizza 203
Getting Started Using the Data from Chapter 6 203
Bagging: Randomize, Train, Repeat 204
Decision Stump is Another Name for a Weak Learner 204
Doesn't Seem So Weak to Me! 204
You Need More Power! 207
Let's Train It 208
Evaluating the Bagged Model 220
Boosting: If You Get It Wrong, Just Boost and Try Again 223
Training the Model - Every Feature Gets a Shot 224
Evaluating the Boosted Model 231
8 Forecasting: Breathe Easy: You Can't Win 235
The Sword Trade Is Hopping 236
Getting Acquainted with Time-Series Data 236
Starting Slow with Simple Exponential Smoothing 238
Setting Up the Simple Exponential Smoothing Forecast 240
You Might Have a Trend 249
Holt's Trend-Corrected Exponential Smoothing 250
Setting Up Holt's Trend-Corrected Smoothing in a Spreadsheet 252
So Are You Done? Looking at Autocorrelations 258
Multiplicative Holt-Winters Exponential Smoothing 266
Setting the Initial Values for Level, Trend, and Seasonality 268
Getting Rolling on the Forecast 274
And. . .Optimize! 280
Putting a Prediction Interval Around the Forecast 283
Creating a Fan Chart for Effect 287
Forecast Sheets in Excel 289
9 Optimization Modeling: Because That "Fresh-Squeezed" Orange Juice Ain't Gonna Blend Itself 293
Wait Is This Data Science? 294
Starting with a Simple Trade-Off 295
Representing the Problem as a Polytope 296
Solving by Sliding the Level Set 297
The Simplex Method: Rooting Around the Corners 298
Working in Excel 300
Fresh from the Grove to Your Glass with a Pit Stop Through a Blending Model 305
Let's Start with Some Specs 307
Coming Back to Consistency 308
Putting the Data into Excel 309
Setting Up the Problem in Solver 311
Lowering Your Standards 314
Dead Squirrel Removal: the Minimax Formulation 317
If-Then and the "Big M" Constraint 320
Multiplying Variables: Cranking Up the Volume to 11,000 324
Modeling Risk 330
Normally Distributed Data 331
10 Outlier Detection: Just Because They're Odd Doesn't Mean They're Unimportant 339
Outliers Are (Bad?) People, Too 340
The Fascinating Case of Hadlum v Hadlum 340
Tukey's Fences 341
Applying Tukey's Fences in a Spreadsheet 342
The Limitations of This Simple Approach 345
Terrible at Nothing, Bad at Everything 346
Preparing Data for Graphing 347
Creating a Graph 350
Getting the k-Nearest Neighbors 351
Graph Outlier Detection Method 1: Just Use the Indegree 352
Graph Outlier Detection Method 2: Getting Nuanced with k-Distance 355
Graph Outlier Detection Method 3: Local Outlier Factors Are Where It's At 358
11 Moving on From Spreadsheets 363
Getting Up and Running with R 364
A Crash Course in R-ing 366
Show Me the Numbers! Vector Math and Factoring 367
The Best Data Type of Them All: the Dataframe 370
How to Ask for Help in R 371
It Gets Even Better Beyond Base R 372
Doing Some Actual Data Science 374
Reading Data into R 374
Spherical K-Means on Wine Data in Just a Few Lines 375
Building AI Models on the Pregnancy Data 381
Forecasting in R 389
Looking at Outlier Detection 393
12 Conclusion 397
Where Am I? What Just Happened? 397
Before You Go-Go 397
Get to Know the Problem 398
We Need More Translators 398
Beware the Three-Headed Geek-Monster: Tools, Performance, and Mathematical Perfection 399
You Are Not the Most Important Function of Your Organization 401
Get Creative and Keep in Touch! 402
Index 403