Provides expert guidance and valuable insights on getting the most out of Big Data systems
An array of tools are currently available for managing and processing data - some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. Designing Big Data Platforms provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems.
This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies:- Provides up-to-date coverage of the tools currently used in Big Data processing and management- Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems- Highlights and explains how data is processed at scale- Includes an introduction to the foundation of a modern data platform
Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.
Table of Contents
List of Contributors xvii
Preface xix
Acknowledgments xxi
Acronyms xxiii
Introduction xxv
1 An Introduction: What’s a Modern Big Data Platform 1
1.1 Defining Modern Big Data Platform 1
1.2 Fundamentals of a Modern Big Data Platform 2
1.2.1 Expectations from Data 2
1.2.1.1 Ease of Access 2
1.2.1.2 Security 2
1.2.1.3 Quality 3
1.2.1.4 Extensibility 3
1.2.2 Expectations from Platform 3
1.2.2.1 Storage Layer 4
1.2.2.2 Resource Management 4
1.2.2.3 ETL 5
1.2.2.4 Discovery 6
1.2.2.5 Reporting 7
1.2.2.6 Monitoring 7
1.2.2.7 Testing 8
1.2.2.8 Lifecycle Management 9
2 A Bird’s Eye View on Big Data 11
2.1 A Bit of History 11
2.1.1 Early Uses of Big Data Term 11
2.1.2 A New Era 12
2.1.2.1 Word Count Problem 12
2.1.2.2 Execution Steps 13
2.1.3 An Open-Source Alternative 15
2.1.3.1 Hadoop Distributed File System 15
2.1.3.2 HadoopMapReduce 17
2.2 What Makes Big Data 20
2.2.1 Volume 20
2.2.2 Velocity 21
2.2.3 Variety 21
2.2.4 Complexity 21
2.3 Components of Big Data Architecture 22
2.3.1 Ingestion 22
2.3.2 Storage 23
2.3.3 Computation 23
2.3.4 Presentation 24
2.4 Making Use of Big Data 24
2.4.1 Querying 24
2.4.2 Reporting 25
2.4.3 Alerting 25
2.4.4 Searching 25
2.4.5 Exploring 25
2.4.6 Mining 25
2.4.7 Modeling 26
3 A Minimal Data Processing and Management System 27
3.1 Problem Definition 27
3.1.1 Online Book Store 27
3.1.2 User Flow Optimization 28
3.2 Processing Large Data with Linux Commands 28
3.2.1 Understand the Data 28
3.2.2 Sample the Data 28
3.2.3 Building the Shell Command 29
3.2.4 Executing the Shell Command 30
3.2.5 Analyzing the Results 31
3.2.6 Reporting the Findings 32
3.2.7 Automating the Process 33
3.2.8 A Brief Review 33
3.3 Processing Large Data with PostgreSQL 34
3.3.1 Data Modeling 34
3.3.2 Copying Data 35
3.3.3 Sharding in PostgreSQL 37
3.3.3.1 Setting up Foreign Data Wrapper 37
3.3.3.2 Sharding Data over Multiple Nodes 38
3.4 Cost of Big Data 39
4 Big Data Storage 41
4.1 Big Data Storage Patterns 41
4.1.1 Data Lakes 41
4.1.2 Data Warehouses 42
4.1.3 Data Marts 43
4.1.4 Comparison of Storage Patterns 43
4.2 On-Premise Storage Solutions 44
4.2.1 Choosing Hardware 44
4.2.1.1 DataNodes 44
4.2.1.2 NameNodes 45
4.2.1.3 Resource Managers 45
4.2.1.4 Network Equipment 45
4.2.2 Capacity Planning 46
4.2.2.1 Overall Cluster 46
4.2.2.2 Resource Sharing 47
4.2.2.3 Doing the Math 47
4.2.3 Deploying Hadoop Cluster 48
4.2.3.1 Networking 48
4.2.3.2 Operating System 48
4.2.3.3 Management Tools 49
4.2.3.4 Hadoop Ecosystem 49
4.2.3.5 A Humble Deployment 49
4.3 Cloud Storage Solutions 53
4.3.1 Object Storage 54
4.3.2 Data Warehouses 55
4.3.2.1 Columnar Storage 55
4.3.2.2 Provisioned Data Warehouses 56
4.3.2.3 Serverless Data Warehouses 56
4.3.2.4 Virtual Data Warehouses 57
4.3.3 Archiving 58
4.4 Hybrid Storage Solutions 59
4.4.1 Making Use of Object Store 59
4.4.1.1 Additional Capacity 59
4.4.1.2 Batch Processing 59
4.4.1.3 Hot Backup 60
4.4.2 Making Use of Data Warehouse 60
4.4.2.1 Primary Data Warehouse 60
4.4.2.2 Shared Data Mart 61
4.4.3 Making Use of Archiving 61
5 Offline Big Data Processing 63
5.1 Defining Offline Data Processing 63
5.2 MapReduce Technologies 65
5.2.1 Apache Pig 65
5.2.1.1 Pig Latin Overview 66
5.2.1.2 Compilation To MapReduce 66
5.2.2 Apache Hive 67
5.2.2.1 Hive Database 68
5.2.2.2 Hive Architecture 69
5.3 Apache Spark 70
5.3.1 What’s Spark 71
5.3.2 Spark Constructs and Components 71
5.3.2.1 Resilient Distributed Datasets 71
5.3.2.2 Distributed Shared Variables 73
5.3.2.3 Datasets and DataFrames 74
5.3.2.4 Spark Libraries and Connectors 75
5.3.3 Execution Plan 76
5.3.3.1 The Logical Plan 77
5.3.3.2 The Physical Plan 77
5.3.4 Spark Architecture 77
5.3.4.1 Inside of Spark Application 78
5.3.4.2 Outside of Spark Application 79
5.4 Apache Flink 81
5.5 Presto 83
5.5.1 Presto Architecture 83
5.5.2 Presto System Design 84
5.5.2.1 Execution Plan 84
5.5.2.2 Scheduling 86
5.5.2.3 Resource Management 86
5.5.2.4 Fault Tolerance 87
6 Stream Big Data Processing 89
6.1 The Need for Stream Processing 89
6.2 Defining Stream Data Processing 90
6.3 Streams via Message Brokers 92
6.3.1 Apache Kafka 92
6.3.1.1 Apache Samza 93
6.3.1.2 Kafka Streams 98
6.3.2 Apache Pulsar 100
6.3.2.1 Pulsar Functions 102
6.3.3 AMQP Based Brokers 105
6.4 Streams via Stream Engines 106
6.4.1 Apache Flink 106
6.4.1.1 Flink Architecture 107
6.4.1.2 System Design 109
6.4.2 Apache Storm 111
6.4.2.1 Storm Architecture 114
6.4.2.2 System Design 115
6.4.3 Apache Heron 116
6.4.3.1 Storm Limitations 116
6.4.3.2 Heron Architecture 117
6.4.4 Spark Streaming 118
6.4.4.1 Discretized Streams 119
6.4.4.2 Fault-tolerance 120
7 Data Analytics 121
7.1 Log Collection 121
7.1.1 Apache Flume 122
7.1.2 Fluentd 122
7.1.2.1 Data Pipeline 123
7.1.2.2 Fluent Bit 124
7.1.2.3 Fluentd Deployment 124
7.2 Transferring Big Data Sets 125
7.2.1 Reloading 126
7.2.2 Partition Loading 126
7.2.3 Streaming 127
7.2.4 Timestamping 127
7.2.5 Tools 128
7.2.5.1 Sqoop 128
7.2.5.2 Embulk 128
7.2.5.3 Spark 129
7.2.5.4 Apache Gobblin 130
7.3 Aggregating Big Data Sets 132
7.3.1 Data Cleansing 132
7.3.2 Data Transformation 134
7.3.2.1 Transformation Functions 134
7.3.2.2 Transformation Stages 135
7.3.3 Data Retention 135
7.3.4 Data Reconciliation 136
7.4 Data Pipeline Scheduler 136
7.4.1 Jenkins 137
7.4.2 Azkaban 138
7.4.2.1 Projects 139
7.4.2.2 Execution Modes 139
7.4.3 Airflow 139
7.4.3.1 Task Execution 140
7.4.3.2 Scheduling 141
7.4.3.3 Executor 141
7.4.3.4 Security and Monitoring 142
7.4.4 Cloud 143
7.5 Patterns and Practices 143
7.5.1 Patterns 143
7.5.1.1 Data Centralization 143
7.5.1.2 Singe Source of Truth 144
7.5.1.3 Domain Driven Data Sets 145
7.5.2 Anti-Patterns 146
7.5.2.1 Data Monolith 146
7.5.2.2 Data Swamp 147
7.5.2.3 Technology Pollution 147
7.5.3 Best Practices 148
7.5.3.1 Business-Driven Approach 148
7.5.3.2 Cost of Maintenance 148
7.5.3.3 Avoiding Modeling Mistakes 149
7.5.3.4 Choosing Right Tool for The Job 150
7.5.4 Detecting Anomalies 150
7.5.4.1 Manual Anomaly Detection 151
7.5.4.2 Automated Anomaly Detection 151
7.6 Exploring Data Visually 152
7.6.1 Metabase 152
7.6.2 Apache Superset 153
8 Data Science 155
8.1 Data Science Applications 155
8.1.1 Recommendation 156
8.1.2 Predictive Analytics 156
8.1.3 Pattern Discovery 157
8.2 Data Science Life Cycle 158
8.2.1 Business Objective 158
8.2.2 Data Understanding 159
8.2.3 Data Ingestion 159
8.2.4 Data Preparation 160
8.2.5 Data Exploration 160
8.2.6 Feature Engineering 161
8.2.7 Modeling 161
8.2.8 Model Evaluation 162
8.2.9 Model Deployment 163
8.2.10 Operationalizing 163
8.3 Data Science Toolbox 164
8.3.1 R 164
8.3.2 Python 165
8.3.3 SQL 167
8.3.4 TensorFlow 167
8.3.4.1 Execution Model 169
8.3.4.2 Architecture 170
8.3.5 Spark MLlib 171
8.4 Productionalizing Data Science 173
8.4.1 Apache PredictionIO 173
8.4.1.1 Architecture Overview 174
8.4.1.2 Machine Learning Templates 174
8.4.2 Seldon 175
8.4.3 MLflow 175
8.4.3.1 MLflow Tracking 175
8.4.3.2 MLflow Projects 176
8.4.3.3 MLflow Models 176
8.4.3.4 MLflow Model Registry 177
8.4.4 Kubeflow 177
8.4.4.1 Kubeflow Pipelines 177
8.4.4.2 Kubeflow Metadata 178
8.4.4.3 Kubeflow Katib 178
9 Data Discovery 179
9.1 Need for Data Discovery 179
9.1.1 Single Source of Metadata 180
9.1.2 Searching 181
9.1.3 Data Lineage 182
9.1.4 Data Ownership 182
9.1.5 Data Metrics 183
9.1.6 Data Grouping 183
9.1.7 Data Clustering 184
9.1.8 Data Classification 184
9.1.9 Data Glossary 185
9.1.10 Data Update Notification 185
9.1.11 Data Presentation 186
9.2 Data Governance 186
9.2.1 Data Governance Overview 186
9.2.1.1 Data Quality 186
9.2.1.2 Metadata 187
9.2.1.3 Data Access 187
9.2.1.4 Data Life Cycle 188
9.2.2 Big Data Governance 188
9.2.2.1 Data Architecture 188
9.2.2.2 Data Source Integration 189
9.3 Data Discovery Tools 189
9.3.1 Metacat 189
9.3.1.1 Data Abstraction and Interoperability 190
9.3.1.2 Metadata Enrichment 190
9.3.1.3 Searching and Indexing 190
9.3.1.4 Update Notifications 191
9.3.2 Amundsen 191
9.3.2.1 Discovery Capabilities 191
9.3.2.2 Integration Points 192
9.3.2.3 Architecture Overview 192
9.3.3 Apache Atlas 193
9.3.3.1 Searching 194
9.3.3.2 Glossary 194
9.3.3.3 Type System 194
9.3.3.4 Lineage 195
9.3.3.5 Notifications 196
9.3.3.6 Bridges and Hooks 196
9.3.3.7 Architecture Overview 196
10 Data Security 199
10.1 Infrastructure Security 199
10.1.1 Computing 200
10.1.1.1 Auditing 200
10.1.1.2 Operating System 200
10.1.1.3 Network 200
10.1.2 Identity and Access Management 201
10.1.2.1 Authentication 201
10.1.2.2 Authorization 201
10.1.3 Data Transfers 202
10.2 Data Privacy 202
10.2.1 Data Encryption 202
10.2.1.1 File System Layer Encryption 203
10.2.1.2 Database Layer Encryption 203
10.2.1.3 Transport Layer Encryption 203
10.2.1.4 Application Layer Encryption 203
10.2.2 Data Anonymization 204
10.2.2.1 k-Anonymity 204
10.2.2.2 l-Diversity 204
10.2.3 Data Perturbation 204
10.3 Law Enforcement 205
10.3.1 PII 205
10.3.1.1 Identifying PII Tables/Columns 205
10.3.1.2 Segregating Tables Containing PII 205
10.3.1.3 Protecting PII Tables via Access Control 206
10.3.1.4 Masking and Anonymizing PII Data 206
10.3.2 Privacy Regulations/Acts 207
10.3.2.1 Collecting Data 207
10.3.2.2 Erasing Data 207
10.4 Data Security Tools 208
10.4.1 Apache Ranger 208
10.4.1.1 Ranger Policies 208
10.4.1.2 Managed Components 209
10.4.1.3 Architecture Overview 211
10.4.2 Apache Sentry 212
10.4.2.1 Managed Components 212
10.4.2.2 Architecture Overview 213
10.4.3 Apache Knox 214
10.4.3.1 Authentication and Authorization 215
10.4.3.2 Supported Hadoop Services 215
10.4.3.3 Client Services 216
10.4.3.4 Architecture Overview 217
10.4.3.5 Audit 218
11 Putting All Together 219
11.1 Platforms 219
11.1.1 In-house Solutions 220
11.1.1.1 Cloud Provisioning 220
11.1.1.2 On-premise Provisioning 221
11.1.2 Cloud Providers 221
11.1.2.1 Vendor Lock-in 222
11.1.2.2 Outages 223
11.1.3 Hybrid Solutions 223
11.1.3.1 Kubernetes 224
11.2 Big Data Systems and Tools 224
11.2.1 Storage 224
11.2.1.1 File-Based Storage 225
11.2.1.2 NoSQL 225
11.2.2 Processing 226
11.2.2.1 Batch Processing 226
11.2.2.2 Stream Processing 227
11.2.2.3 Combining Batch and Streaming 227
11.2.3 Model Training 228
11.2.4 A Holistic View 228
11.3 Challenges 229
11.3.1 Growth 229
11.3.2 SLA 230
11.3.3 Versioning 231
11.3.4 Maintenance 232
11.3.5 Deprecation 233
11.3.6 Monitoring 233
11.3.7 Trends 234
11.3.8 Security 235
11.3.9 Testing 235
11.3.10 Organization 236
11.3.11 Talent 237
11.3.12 Budget 238
12 An Ideal Platform 239
12.1 Event Sourcing 240
12.1.1 How It Works 240
12.1.2 Messaging Middleware 241
12.1.3 Why Use Event Sourcing 242
12.2 Kappa Architecture 242
12.2.1 How It Works 243
12.2.2 Limitations 244
12.3 Data Mesh 245
12.3.1 Domain-Driven Design 245
12.3.2 Self-Serving Infrastructure 247
12.3.3 Data as Product Approach 247
12.4 Data Reservoirs 248
12.4.1 Data Organization 248
12.4.2 Data Standards 249
12.4.3 Data Policies 249
12.4.4 Multiple Data Reservoirs 250
12.5 Data Catalog 250
12.5.1 Data Feedback Loop 251
12.5.2 Data Synthesis 252
12.6 Self-service Platform 252
12.6.1 Data Publishing 253
12.6.2 Data Processing 254
12.6.3 Data Monitoring 254
12.7 Abstraction 254
12.7.1 Abstractions via User Interface 255
12.7.2 Abstractions via Wrappers 256
12.8 Data Guild 256
12.9 Trade-offs 257
12.9.1 Quality vs Efficiency 258
12.9.2 Real time vs Offline 258
12.9.3 Performance vs Cost 258
12.9.4 Consistency vs Availability 259
12.9.5 Disruption vs Reliability 259
12.9.6 Build vs Buy 259
12.10 Data Ethics 260
Appendix A Further Systems and Patterns 261
A.1 Lambda Architecture 261
A.1.1 Batch Layer 262
A.1.2 Speed Layer 262
A.1.3 Serving Layer 262
A.2 Apache Cassandra 263
A.2.1 Cassandra Data Modeling 263
A.2.2 Cassandra Architecture 265
A.2.2.1 Cassandra Components 265
A.2.2.2 Storage Engine 267
A.3 Apache Beam 267
A.3.1 Programming Overview 268
A.3.2 Execution Model 270
Appendix B Recipes 271
B.1 Activity Tracking Recipe 271
B.1.1 Problem Statement 271
B.1.2 Design Approach 271
B.1.2.1 Data Ingestion 271
B.1.2.2 Computation 272
B.2 Data Quality Assurance 273
B.2.1 Problem Statement 273
B.2.2 Design Approach 273
B.2.2.1 Ingredients 273
B.2.2.2 Preparation 275
B.2.2.3 The Menu 277
B.3 Estimating Time to Delivery 277
B.3.1 Problem Definition 278
B.3.2 Design Approach 278
B.3.2.1 Streaming Reference Architecture 278
B.4 Incident Response Recipe 283
B.4.1 Problem Definition 283
B.4.2 Design Approach 283
B.4.2.1 Minimizing Disruptions 284
B.4.2.2 Categorizing Disruptions 284
B.4.2.3 Handling Disruptions 284
B.5 Leveraging Spark SQL Metrics 286
B.5.1 Problem Statement 286
B.5.2 Design Approach 286
B.5.2.1 Spark SQL Metrics Pipeline 286
B.5.2.2 Integrating SQL Metrics 288
B.5.2.3 The Menu 289
B.6 Airbnb Price Prediction 289
B.6.1 Problem Statement 290
B.6.2 Design Approach 290
B.6.2.1 Tech Stack 290
B.6.2.2 Data Understanding and Ingestion 291
B.6.2.3 Data Preparation 291
B.6.2.4 Modeling and Evaluation 293
B.6.2.5 Monitor and Iterate 294
Bibliography 295
Index 301