+353-1-416-8900REST OF WORLD
+44-20-3973-8888REST OF WORLD
1-917-300-0470EAST COAST U.S
1-800-526-8630U.S. (TOLL FREE)

Foundations of Data Intensive Applications. Large Scale Data Analytics under the Hood. Edition No. 1

  • Book

  • 416 Pages
  • November 2021
  • John Wiley and Sons Ltd
  • ID: 5841281
PEEK “UNDER THE HOOD” OF BIG DATA ANALYTICS

The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.

The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.

Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.

Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: - Identify the foundations of large-scale, distributed data processing systems - Make major software design decisions that optimize performance - Diagnose performance problems and distributed operation issues - Understand state-of-the-art research in big data - Explain and use the major big data frameworks and understand what underpins them - Use big data analytics in the real world to solve practical problems

Table of Contents

Introduction xxvii

Chapter 1 Data Intensive Applications 1

Anatomy of a Data-Intensive Application 1

A Histogram Example 2

Program 2

Process Management 3

Communication 4

Execution 5

Data Structures 6

Putting It Together 6

Application 6

Resource Management 6

Messaging 7

Data Structures 7

Tasks and Execution 8

Fault Tolerance 8

Remote Execution 8

Parallel Applications 9

Serial Applications 9

Lloyd’s K-Means Algorithm 9

Parallelizing Algorithms 11

Decomposition 11

Task Assignment 12

Orchestration 12

Mapping 13

K-Means

Algorithm 13

Parallel and Distributed Computing 15

Memory Abstractions 16

Shared Memory 16

Distributed Memory 18

Hybrid (Shared + Distributed) Memory 20

Partitioned Global Address Space Memory 21

Application Classes and Frameworks 22

Parallel Interaction Patterns 22

Pleasingly Parallel 23

Dataflow 23

Iterative 23

Irregular 23

Data Abstractions 24

Data-Intensive

Frameworks 24

Components 24

Workflows 25

An Example 25

What Makes It Difficult? 26

Developing Applications 27

Concurrency 27

Data Partitioning 28

Debugging 28

Diverse Environments 28

Computer Networks 29

Synchronization 29

Thread Synchronization 29

Data Synchronization 30

Ordering of Events 31

Faults 31

Consensus 31

Summary 32

References 32

Chapter 2 Data and Storage 35

Storage Systems 35

Storage for Distributed Systems 36

Direct-Attached Storage 37

Storage Area Network 37

Network-Attached Storage 38

DAS or SAN or NAS? 38

Storage Abstractions 39

Block Storage 39

File Systems 40

Object Storage 41

Data Formats 41

XML 42

JSON 43

CSV 44

Apache Parquet 45

Apache Avro 47

Avro Data Definitions (Schema) 48

Code Generation 49

Without Code Generation 49

Avro File 49

Schema Evolution 49

Protocol Buffers, Flat Buffers, and Thrift 50

Data Replication 51

Synchronous and Asynchronous Replication 52

Single-Leader and Multileader Replication 52

Data Locality 53

Disadvantages of Replication 54

Data Partitioning 54

Vertical Partitioning 55

Horizontal Partitioning (Sharding) 55

Hybrid Partitioning 56

Considerations for Partitioning 57

NoSQL Databases 58

Data Models 58

Key-Value Databases 58

Document Databases 59

Wide Column Databases 59

Graph Databases 59

CAP Theorem 60

Message Queuing 61

Message Processing Guarantees 63

Durability of Messages 64

Acknowledgments 64

Storage First Brokers and Transient Brokers 65

Summary 66

References 66

Chapter 3 Computing Resources 69

A Demonstration 71

Computer Clusters 72

Anatomy of a Computer Cluster 73

Data Analytics in Clusters 74

Dedicated Clusters 76

Classic Parallel Systems 76

Big Data Systems 77

Shared Clusters 79

OpenMPI on a Slurm Cluster 79

Spark on a Yarn Cluster 80

Distributed Application Life Cycle 80

Life Cycle Steps 80

Step 1: Preparation of the Job Package 81

Step 2: Resource Acquisition 81

Step 3: Distributing the Application (Job) Artifacts 81

Step 4: Bootstrapping the Distributed Environment 82

Step 5: Monitoring 82

Step 6: Termination 83

Computing Resources 83

Data Centers 83

Physical Machines 85

Network 85

Virtual Machines 87

Containers 87

Processor, Random Access Memory, and Cache 88

Cache 89

Multiple Processors in a Computer 90

Nonuniform Memory Access 90

Uniform Memory Access 91

Hard Disk 92

GPUs 92

Mapping Resources to Applications 92

Cluster Resource Managers 93

Kubernetes 94

Kubernetes Architecture 94

Kubernetes Application Concepts 96

Data-Intensive Applications on Kubernetes 96

Slurm 98

Yarn 99

Job Scheduling 99

Scheduling Policy 101

Objective Functions 101

Throughput and Latency 101

Priorities 102

Lowering Distance Among the Processes 102

Data Locality 102

Completion Deadline 102

Algorithms 103

First in First Out 103

Gang Scheduling 103

List Scheduling 103

Backfill Scheduling 104

Summary 104

References 104

Chapter 4 Data Structures 107

Virtual Memory 108

Paging and TLB 109

Cache 111

The Need for Data Structures 112

Cache and Memory Layout 112

Memory Fragmentation 114

Data Transfer 115

Data Transfer Between Frameworks 115

Cross-Language Data Transfer 115

Object and Text Data 116

Serialization 116

Vectors and Matrices 117

1D Vectors 118

Matrices 118

Row-Major and Column-Major Formats 119

N-Dimensional Arrays/Tensors 122

NumPy 123

Memory Representation 125

K-means with NumPy 126

Sparse Matrices 127

Table 128

Table Formats 129

Column Data Format 129

Row Data Format 130

Apache Arrow 130

Arrow Data Format 131

Primitive Types 131

Variable-Length Data 132

Arrow Serialization 133

Arrow Example 133

Pandas DataFrame 134

Column vs. Row Tables 136

Summary 136

References 136

Chapter 5 Programming Models 139

Introduction 139

Parallel Programming Models 140

Parallel Process Interaction 140

Problem Decomposition 140

Data Structures 140

Data Structures and Operations 141

Data Types 141

Local Operations 143

Distributed Operations 143

Array 144

Tensor 145

Indexing 145

Slicing 146

Broadcasting 146

Table 146

Graph Data 148

Message Passing Model 150

Model 151

Message Passing Frameworks 151

Message Passing Interface 151

Bulk Synchronous Parallel 153

K-Means 154

Distributed Data Model 157

Eager Model 157

Dataflow Model 158

Data Frames, Datasets, and Tables 159

Input and Output 160

Task Graphs (Dataflow Graphs) 160

Model 161

User Program to Task Graph 161

Tasks and Functions 162

Source Task 162

Compute Task 163

Implicit vs. Explicit Parallel Models 163

Remote Execution 163

Components 164

Batch Dataflow 165

Data Abstractions 165

Table Abstraction 165

Matrix/Tensors 165

Functions 166

Source 166

Compute 167

Sink 168

An Example 168

Caching State 169

Evaluation Strategy 170

Lazy Evaluation 171

Eager Evaluation 171

Iterative Computations 172

DOALL Parallel 172

DOACROSS Parallel 172

Pipeline Parallel 173

Task Graph Models for Iterative Computations 173

K-Means Algorithm 174

Streaming Dataflow 176

Data Abstractions 177

Streams 177

Distributed Operations 178

Streaming Functions 178

Sources 178

Compute 179

Sink 179

An Example 179

Windowing 180

Windowing Strategies 181

Operations on Windows 182

Handling Late Events 182

SQL 182

Queries 183

Summary 184

References 184

Chapter 6 Messaging 187

Network Services 188

TCP/IP 188

RDMA 189

Messaging for Data Analytics 189

Anatomy of a Message 190

Data Packing 190

Protocol 191

Message Types 192

Control Messages 192

External Data Sources 192

Data Transfer Messages 192

Distributed Operations 194

How Are They Used? 194

Task Graph 194

Parallel Processes 195

Anatomy of a Distributed Operation 198

Data Abstractions 198

Distributed Operation API 198

Streaming and Batch Operations 199

Streaming Operations 199

Batch Operations 199

Distributed Operations on Arrays 200

Broadcast 200

Reduce and AllReduce 201

Gather and AllGather 202

Scatter 203

AllToAll 204

Optimized Operations 204

Broadcast 205

Reduce 206

AllReduce 206

Gather and AllGather Collective Algorithms 208

Scatter and AllToAll Collective Algorithms 208

Distributed Operations on Tables 209

Shuffle 209

Partitioning Data 211

Handling Large Data 212

Fetch-Based Algorithm (Asynchronous Algorithm) 213

Distributed Synchronization Algorithm 214

GroupBy 214

Aggregate 215

Join 216

Join Algorithms 219

Distributed Joins 221

Performance of Joins 223

More Operations 223

Advanced Topics 224

Data Packing 224

Memory Considerations 224

Message Coalescing 224

Compression 225

Stragglers 225

Nonblocking vs. Blocking Operations 225

Blocking Operations 226

Nonblocking Operations 226

Summary 227

References 227

Chapter 7 Parallel Tasks 229

CPUs 229

Cache 229

False Sharing 230

Vectorization 231

Threads and Processes 234

Concurrency and Parallelism 234

Context Switches and Scheduling 234

Mutual Exclusion 235

User-Level Threads 236

Process Affinity 236

NUMA-Aware Programming 237

Accelerators 237

Task Execution 238

Scheduling 240

Static Scheduling 240

Dynamic Scheduling 240

Loosely Synchronous and Asynchronous Execution 241

Loosely Synchronous Parallel System 242

Asynchronous Parallel System (Fully Distributed) 243

Actor Model 244

Actor 244

Asynchronous Messages 244

Actor Frameworks 245

Execution Models 245

Process Model 246

Thread Model 246

Remote Execution 246

Tasks for Data Analytics 248

SPMD and MPMD Execution 248

Batch Tasks 249

Data Partitions 249

Operations 251

Task Graph Scheduling 253

Threads, CPU Cores, and Partitions 254

Data Locality 255

Execution 257

Streaming Execution 257

State 257

Immutable Data 258

State in Driver 258

Distributed State 259

Streaming Tasks 259

Streams and Data Partitioning 260

Partitions 260

Operations 261

Scheduling 262

Uniform Resources 263

Resource-Aware Scheduling 264

Execution 264

Dynamic Scaling 264

Back Pressure (Flow Control) 265

Rate-Based Flow Control 266

Credit-Based Flow Control 266

State 267

Summary 268

References 268

Chapter 8 Case Studies 271

Apache Hadoop 271

Programming Model 272

Architecture 274

Cluster Resource Management 275

Apache Spark 275

Programming Model 275

RDD API 276

SQL, DataFrames, and DataSets 277

Architecture 278

Resource Managers 278

Task Schedulers 279

Executors 279

Communication Operations 280

Apache Spark Streaming 280

Apache Storm 282

Programming Model 282

Architecture 284

Cluster Resource Managers 285

Communication Operations 286

Kafka Streams 286

Programming Model 286

Architecture 287

PyTorch 288

Programming Model 288

Execution 292

Cylon 295

Programming Model 296

Architecture 296

Execution 297

Communication Operations 298

Rapids cuDF 298

Programming Model 298

Architecture 299

Summary 300

References 300

Chapter 9 Fault Tolerance 303

Dependable Systems and Failures 303

Fault Tolerance is Not Free 304

Dependable Systems 305

Failures 306

Process Failures 306

Network Failures 307

Node Failures 307

Byzantine Faults 307

Failure Models 308

Failure Detection 308

Recovering from Faults 309

Recovery Methods 310

Stateless Programs 310

Batch Systems 311

Streaming Systems 311

Processing Guarantees 311

Role of Cluster Resource Managers 312

Checkpointing 313

State 313

Consistent Global State 313

Uncoordinated Checkpointing 314

Coordinated Checkpointing 315

Chandy-Lamport Algorithm 315

Batch Systems 316

When to Checkpoint? 317

Snapshot Data 318

Streaming Systems 319

Case Study: Apache Storm 319

Message Tracking 320

Failure Recovery 321

Case Study: Apache Flink 321

Checkpointing 322

Failure Recovery 324

Batch Systems 324

Iterative Programs 324

Case Study: Apache Spark 325

RDD Recomputing 326

Checkpointing 326

Recovery from Failures 327

Summary 327

References 327

Chapter 10 Performance and Productivity 329

Performance Metrics 329

System Performance Metrics 330

Parallel Performance Metrics 330

Speedup 330

Strong Scaling 331

Weak Scaling 332

Parallel Efficiency 332

Amdahl’s Law 333

Gustafson’s Law 334

Throughput 334

Latency 335

Benchmarks 336

LINPACK Benchmark 336

NAS Parallel Benchmark 336

BigDataBench 336

TPC Benchmarks 337

HiBench 337

Performance Factors 337

Memory 337

Execution 338

Distributed Operators 338

Disk I/O 339

Garbage Collection 339

Finding Issues 342

Serial Programs 342

Profiling 342

Scaling 343

Strong Scaling 343

Weak Scaling 344

Debugging Distributed Applications 344

Programming Languages 345

C/C++ 346

Java 346

Memory Management 347

Data Structures 348

Interfacing with Python 348

Python 350

C/C++ Code integration 350

Productivity 351

Choice of Frameworks 351

Operating Environment 353

CPUs and GPUs 353

Public Clouds 355

Future of Data-Intensive Applications 358

Summary 358

References 359

Index 361

Authors

Supun Kamburugamuve Saliya Ekanayake