The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.
The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.
Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.
Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: - Identify the foundations of large-scale, distributed data processing systems - Make major software design decisions that optimize performance - Diagnose performance problems and distributed operation issues - Understand state-of-the-art research in big data - Explain and use the major big data frameworks and understand what underpins them - Use big data analytics in the real world to solve practical problems
Table of Contents
Introduction xxvii
Chapter 1 Data Intensive Applications 1
Anatomy of a Data-Intensive Application 1
A Histogram Example 2
Program 2
Process Management 3
Communication 4
Execution 5
Data Structures 6
Putting It Together 6
Application 6
Resource Management 6
Messaging 7
Data Structures 7
Tasks and Execution 8
Fault Tolerance 8
Remote Execution 8
Parallel Applications 9
Serial Applications 9
Lloyd’s K-Means Algorithm 9
Parallelizing Algorithms 11
Decomposition 11
Task Assignment 12
Orchestration 12
Mapping 13
K-Means
Algorithm 13
Parallel and Distributed Computing 15
Memory Abstractions 16
Shared Memory 16
Distributed Memory 18
Hybrid (Shared + Distributed) Memory 20
Partitioned Global Address Space Memory 21
Application Classes and Frameworks 22
Parallel Interaction Patterns 22
Pleasingly Parallel 23
Dataflow 23
Iterative 23
Irregular 23
Data Abstractions 24
Data-Intensive
Frameworks 24
Components 24
Workflows 25
An Example 25
What Makes It Difficult? 26
Developing Applications 27
Concurrency 27
Data Partitioning 28
Debugging 28
Diverse Environments 28
Computer Networks 29
Synchronization 29
Thread Synchronization 29
Data Synchronization 30
Ordering of Events 31
Faults 31
Consensus 31
Summary 32
References 32
Chapter 2 Data and Storage 35
Storage Systems 35
Storage for Distributed Systems 36
Direct-Attached Storage 37
Storage Area Network 37
Network-Attached Storage 38
DAS or SAN or NAS? 38
Storage Abstractions 39
Block Storage 39
File Systems 40
Object Storage 41
Data Formats 41
XML 42
JSON 43
CSV 44
Apache Parquet 45
Apache Avro 47
Avro Data Definitions (Schema) 48
Code Generation 49
Without Code Generation 49
Avro File 49
Schema Evolution 49
Protocol Buffers, Flat Buffers, and Thrift 50
Data Replication 51
Synchronous and Asynchronous Replication 52
Single-Leader and Multileader Replication 52
Data Locality 53
Disadvantages of Replication 54
Data Partitioning 54
Vertical Partitioning 55
Horizontal Partitioning (Sharding) 55
Hybrid Partitioning 56
Considerations for Partitioning 57
NoSQL Databases 58
Data Models 58
Key-Value Databases 58
Document Databases 59
Wide Column Databases 59
Graph Databases 59
CAP Theorem 60
Message Queuing 61
Message Processing Guarantees 63
Durability of Messages 64
Acknowledgments 64
Storage First Brokers and Transient Brokers 65
Summary 66
References 66
Chapter 3 Computing Resources 69
A Demonstration 71
Computer Clusters 72
Anatomy of a Computer Cluster 73
Data Analytics in Clusters 74
Dedicated Clusters 76
Classic Parallel Systems 76
Big Data Systems 77
Shared Clusters 79
OpenMPI on a Slurm Cluster 79
Spark on a Yarn Cluster 80
Distributed Application Life Cycle 80
Life Cycle Steps 80
Step 1: Preparation of the Job Package 81
Step 2: Resource Acquisition 81
Step 3: Distributing the Application (Job) Artifacts 81
Step 4: Bootstrapping the Distributed Environment 82
Step 5: Monitoring 82
Step 6: Termination 83
Computing Resources 83
Data Centers 83
Physical Machines 85
Network 85
Virtual Machines 87
Containers 87
Processor, Random Access Memory, and Cache 88
Cache 89
Multiple Processors in a Computer 90
Nonuniform Memory Access 90
Uniform Memory Access 91
Hard Disk 92
GPUs 92
Mapping Resources to Applications 92
Cluster Resource Managers 93
Kubernetes 94
Kubernetes Architecture 94
Kubernetes Application Concepts 96
Data-Intensive Applications on Kubernetes 96
Slurm 98
Yarn 99
Job Scheduling 99
Scheduling Policy 101
Objective Functions 101
Throughput and Latency 101
Priorities 102
Lowering Distance Among the Processes 102
Data Locality 102
Completion Deadline 102
Algorithms 103
First in First Out 103
Gang Scheduling 103
List Scheduling 103
Backfill Scheduling 104
Summary 104
References 104
Chapter 4 Data Structures 107
Virtual Memory 108
Paging and TLB 109
Cache 111
The Need for Data Structures 112
Cache and Memory Layout 112
Memory Fragmentation 114
Data Transfer 115
Data Transfer Between Frameworks 115
Cross-Language Data Transfer 115
Object and Text Data 116
Serialization 116
Vectors and Matrices 117
1D Vectors 118
Matrices 118
Row-Major and Column-Major Formats 119
N-Dimensional Arrays/Tensors 122
NumPy 123
Memory Representation 125
K-means with NumPy 126
Sparse Matrices 127
Table 128
Table Formats 129
Column Data Format 129
Row Data Format 130
Apache Arrow 130
Arrow Data Format 131
Primitive Types 131
Variable-Length Data 132
Arrow Serialization 133
Arrow Example 133
Pandas DataFrame 134
Column vs. Row Tables 136
Summary 136
References 136
Chapter 5 Programming Models 139
Introduction 139
Parallel Programming Models 140
Parallel Process Interaction 140
Problem Decomposition 140
Data Structures 140
Data Structures and Operations 141
Data Types 141
Local Operations 143
Distributed Operations 143
Array 144
Tensor 145
Indexing 145
Slicing 146
Broadcasting 146
Table 146
Graph Data 148
Message Passing Model 150
Model 151
Message Passing Frameworks 151
Message Passing Interface 151
Bulk Synchronous Parallel 153
K-Means 154
Distributed Data Model 157
Eager Model 157
Dataflow Model 158
Data Frames, Datasets, and Tables 159
Input and Output 160
Task Graphs (Dataflow Graphs) 160
Model 161
User Program to Task Graph 161
Tasks and Functions 162
Source Task 162
Compute Task 163
Implicit vs. Explicit Parallel Models 163
Remote Execution 163
Components 164
Batch Dataflow 165
Data Abstractions 165
Table Abstraction 165
Matrix/Tensors 165
Functions 166
Source 166
Compute 167
Sink 168
An Example 168
Caching State 169
Evaluation Strategy 170
Lazy Evaluation 171
Eager Evaluation 171
Iterative Computations 172
DOALL Parallel 172
DOACROSS Parallel 172
Pipeline Parallel 173
Task Graph Models for Iterative Computations 173
K-Means Algorithm 174
Streaming Dataflow 176
Data Abstractions 177
Streams 177
Distributed Operations 178
Streaming Functions 178
Sources 178
Compute 179
Sink 179
An Example 179
Windowing 180
Windowing Strategies 181
Operations on Windows 182
Handling Late Events 182
SQL 182
Queries 183
Summary 184
References 184
Chapter 6 Messaging 187
Network Services 188
TCP/IP 188
RDMA 189
Messaging for Data Analytics 189
Anatomy of a Message 190
Data Packing 190
Protocol 191
Message Types 192
Control Messages 192
External Data Sources 192
Data Transfer Messages 192
Distributed Operations 194
How Are They Used? 194
Task Graph 194
Parallel Processes 195
Anatomy of a Distributed Operation 198
Data Abstractions 198
Distributed Operation API 198
Streaming and Batch Operations 199
Streaming Operations 199
Batch Operations 199
Distributed Operations on Arrays 200
Broadcast 200
Reduce and AllReduce 201
Gather and AllGather 202
Scatter 203
AllToAll 204
Optimized Operations 204
Broadcast 205
Reduce 206
AllReduce 206
Gather and AllGather Collective Algorithms 208
Scatter and AllToAll Collective Algorithms 208
Distributed Operations on Tables 209
Shuffle 209
Partitioning Data 211
Handling Large Data 212
Fetch-Based Algorithm (Asynchronous Algorithm) 213
Distributed Synchronization Algorithm 214
GroupBy 214
Aggregate 215
Join 216
Join Algorithms 219
Distributed Joins 221
Performance of Joins 223
More Operations 223
Advanced Topics 224
Data Packing 224
Memory Considerations 224
Message Coalescing 224
Compression 225
Stragglers 225
Nonblocking vs. Blocking Operations 225
Blocking Operations 226
Nonblocking Operations 226
Summary 227
References 227
Chapter 7 Parallel Tasks 229
CPUs 229
Cache 229
False Sharing 230
Vectorization 231
Threads and Processes 234
Concurrency and Parallelism 234
Context Switches and Scheduling 234
Mutual Exclusion 235
User-Level Threads 236
Process Affinity 236
NUMA-Aware Programming 237
Accelerators 237
Task Execution 238
Scheduling 240
Static Scheduling 240
Dynamic Scheduling 240
Loosely Synchronous and Asynchronous Execution 241
Loosely Synchronous Parallel System 242
Asynchronous Parallel System (Fully Distributed) 243
Actor Model 244
Actor 244
Asynchronous Messages 244
Actor Frameworks 245
Execution Models 245
Process Model 246
Thread Model 246
Remote Execution 246
Tasks for Data Analytics 248
SPMD and MPMD Execution 248
Batch Tasks 249
Data Partitions 249
Operations 251
Task Graph Scheduling 253
Threads, CPU Cores, and Partitions 254
Data Locality 255
Execution 257
Streaming Execution 257
State 257
Immutable Data 258
State in Driver 258
Distributed State 259
Streaming Tasks 259
Streams and Data Partitioning 260
Partitions 260
Operations 261
Scheduling 262
Uniform Resources 263
Resource-Aware Scheduling 264
Execution 264
Dynamic Scaling 264
Back Pressure (Flow Control) 265
Rate-Based Flow Control 266
Credit-Based Flow Control 266
State 267
Summary 268
References 268
Chapter 8 Case Studies 271
Apache Hadoop 271
Programming Model 272
Architecture 274
Cluster Resource Management 275
Apache Spark 275
Programming Model 275
RDD API 276
SQL, DataFrames, and DataSets 277
Architecture 278
Resource Managers 278
Task Schedulers 279
Executors 279
Communication Operations 280
Apache Spark Streaming 280
Apache Storm 282
Programming Model 282
Architecture 284
Cluster Resource Managers 285
Communication Operations 286
Kafka Streams 286
Programming Model 286
Architecture 287
PyTorch 288
Programming Model 288
Execution 292
Cylon 295
Programming Model 296
Architecture 296
Execution 297
Communication Operations 298
Rapids cuDF 298
Programming Model 298
Architecture 299
Summary 300
References 300
Chapter 9 Fault Tolerance 303
Dependable Systems and Failures 303
Fault Tolerance is Not Free 304
Dependable Systems 305
Failures 306
Process Failures 306
Network Failures 307
Node Failures 307
Byzantine Faults 307
Failure Models 308
Failure Detection 308
Recovering from Faults 309
Recovery Methods 310
Stateless Programs 310
Batch Systems 311
Streaming Systems 311
Processing Guarantees 311
Role of Cluster Resource Managers 312
Checkpointing 313
State 313
Consistent Global State 313
Uncoordinated Checkpointing 314
Coordinated Checkpointing 315
Chandy-Lamport Algorithm 315
Batch Systems 316
When to Checkpoint? 317
Snapshot Data 318
Streaming Systems 319
Case Study: Apache Storm 319
Message Tracking 320
Failure Recovery 321
Case Study: Apache Flink 321
Checkpointing 322
Failure Recovery 324
Batch Systems 324
Iterative Programs 324
Case Study: Apache Spark 325
RDD Recomputing 326
Checkpointing 326
Recovery from Failures 327
Summary 327
References 327
Chapter 10 Performance and Productivity 329
Performance Metrics 329
System Performance Metrics 330
Parallel Performance Metrics 330
Speedup 330
Strong Scaling 331
Weak Scaling 332
Parallel Efficiency 332
Amdahl’s Law 333
Gustafson’s Law 334
Throughput 334
Latency 335
Benchmarks 336
LINPACK Benchmark 336
NAS Parallel Benchmark 336
BigDataBench 336
TPC Benchmarks 337
HiBench 337
Performance Factors 337
Memory 337
Execution 338
Distributed Operators 338
Disk I/O 339
Garbage Collection 339
Finding Issues 342
Serial Programs 342
Profiling 342
Scaling 343
Strong Scaling 343
Weak Scaling 344
Debugging Distributed Applications 344
Programming Languages 345
C/C++ 346
Java 346
Memory Management 347
Data Structures 348
Interfacing with Python 348
Python 350
C/C++ Code integration 350
Productivity 351
Choice of Frameworks 351
Operating Environment 353
CPUs and GPUs 353
Public Clouds 355
Future of Data-Intensive Applications 358
Summary 358
References 359
Index 361