Learn how to manage a modern data stack and get the most out of data in your organization!
Thanks to the emergence of new technologies and the explosion of data in recent years, we need new practices for managing and getting value out of data. In the modern, data driven competitive landscape the "best guess" approach - reading blog posts here and there and patching together data practices without any real visibility - is no longer going to hack it. The Informed Company provides definitive direction on how best to leverage the modern data stack, including cloud computing, columnar storage, cloud ETL tools, and cloud BI tools. You'll learn how to work with Agile methods and set up processes that's right for your company to use your data as a key weapon for your success . . . You'll discover best practices for every stage, from querying production databases at a small startup all the way to setting up data marts for different business lines of an enterprise.
In their work at Chartio, authors Fowler and David have learned that most businesspeople are almost completely self-taught when it comes to data. If they are using resources, those resources are outdated, so they're missing out on the latest cloud technologies and advances in data analytics. This book will firm up your understanding of data and bring you into the present with knowledge around what works and what doesn't.
- Discover the data stack strategies that are working for today's successful small, medium, and enterprise companies
- Learn the different Agile stages of data organization, and the right one for your team
- Learn how to maintain Data Lakes and Data Warehouses for effective, accessible data storage
- Gain the knowledge you need to architect Data Warehouses and Data Marts
- Understand your business's level of data sophistication and the steps you can take to get to "level up" your data
The Informed Company is the definitive data book for anyone who wants to work faster and more nimbly, armed with actionable decision-making data.
Table of Contents
About This Book xiii
Foreword xxi
Introduction xxv
Stage 1 Source (aka Siloed Data) 1
Chapter 1 Starting with Source Data 3
Common Options for Analyzing Source Data 4
Chapter 2 The Need to Replicate Source Data 11
Replicate Sources 12
Create Read-Only Access 14
Chapter 3 Source Data Best Practices 15
Keep a Complexity Wiki Page 15
Snippet Dictionary 16
Use a BI Product 17
Double Check Results 18
Keep Short Dashboards 19
Design Before Building 20
Stage 2 Data Lake (aka Data Combined) 23
Chapter 4 Why Build a Data Lake? 25
What Is a Data Lake? 26
Reasons to Build a Data Lake Summarized 27
Chapter 5 Choosing an Engine for the Data Lake 33
Modern Columnar Warehouse Engines 35
Modern Warehouse Engine Products 38
Database Engines 41
Recommendation 42
Chapter 6 Extract and Load (EL) Data 45
ETL versus ELT 46
EL/ETL Vendors 48
Extract Options 49
Load Options 51
Multiple Schemas 52
Other Extract and Load Routes 53
Chapter 7 Data Lake Security 55
Access in Central Place 56
Permission Tiers 57
Chapter 8 Data Lake Maintenance 59
Why SQL? 60
Data Sources 61
Performance 64
Upgrade Snippets to Views 68
Stage 3 Data Warehouse (aka the Single Source of Truth) 69
Chapter 9 The Power of Layers and Views 75
Make Readable Views 77
Layer Views on Views 78
Start with a Single View 81
Chapter 10 Staging Schemas 83
Orient to the Schemas 84
Pick a Table and Clean It 85
Other Staging Modeling Considerations 98
Building on Top of Staging Schemas 106
Chapter 11 Model Data with dbt 111
Version Control 111
Modularity and Reusability 112
Package Management 112
Organizing Files 113
Macros 113
Incremental Tables 114
Testing 115
Chapter 12 Deploy Modeling Code 119
Branch Using Version Control Software 119
Commit Message 120
Test Locally 120
Code Review 121
Schedule Runs 122
Chapter 13 Implementing the Data Warehouse 123
Manage Dependencies 124
Combine Tables Within Schemas 126
Combine Tables Across Schemas 128
Keep the Grain Consistent 130
Create Business Metrics 131
Keeping Accurate History 133
Chapter 14 Managing Data Access 135
How to Secure Sensitive Data in the Data Warehouse 137
How to Secure Sensitive Data in a BI Tool 140
Chapter 15 Maintaining the Source of Truth 143
Track New Metrics 144
Deprecate Old Metrics 147
Deprecate Old Schemas 149
Resolve Conflicting Numbers 150
Handling Ongoing Requests and Ongoing Feedback 151
Updating Modeling Code 152
Manage Access 153
Tuning to Optimize 156
Code Review All Modeling 157
Maintenance Checklist 158
Stage 4 Data Marts (aka Data Democratized) 161
Chapter 16 Data Mart Implementation 167
Views on the Data Warehouse 167
Segment Tables 168
Access Update 169
Chapter 17 Data Mart Maintenance 171
Educate Team 172
Identifies Issues 172
Identify New Needs 176
Help Track Success 176
Chapter 18 Modern versus Traditional Data Stacks: What’s Changed? 177
What’s Changed? 177
Chapter 19 Row-versus
Column-Oriented
Database 181
Row-Oriented
Databases 182
Column-Oriented
Databases 184
Summary 190
Chapter 20 Style Guide Example 191
Simplify 192
Clean 194
Naming Conventions 195
Share It 197
Chapter 21 Building an SST Example 199
First Attempt - Same Tables with Prefixes 199
Second Attempt - Operational Schema (Source Agnostic) 205
Third Attempt - Application Separate, Other Sources Smashed 207
Less Planning, More Implementing 209
Acknowledgments and Contributions 211
Index 213