Prepare for the Azure Data Engineering certification - and an exciting new career in analytics - with this must-have study aide
In the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203, accomplished data engineer and tech educator Benjamin Perkins delivers a hands-on, practical guide to preparing for the challenging Azure Data Engineer certification and for a new career in an exciting and growing field of tech.
In the book, you’ll explore all the objectives covered on the DP-203 exam while learning the job roles and responsibilities of a newly minted Azure data engineer. From integrating, transforming, and consolidating data from various structured and unstructured data systems into a structure that is suitable for building analytics solutions, you’ll get up to speed quickly and efficiently with Sybex’s easy-to-use study aids and tools.
This Study Guide also offers:
- Career-ready advice for anyone hoping to ace their first data engineering job interview and excel in their first day in the field
- Indispensable tips and tricks to familiarize yourself with the DP-203 exam structure and help reduce test anxiety
- Complimentary access to Sybex’s expansive online study tools, accessible across multiple devices, and offering access to hundreds of bonus practice questions, electronic flashcards, and a searchable, digital glossary of key terms
A one-of-a-kind study aid designed to help you get straight to the crucial material you need to succeed on the exam and on the job, the MCA Microsoft Certified Associate Azure Data Engineer Study Guide: Exam DP-203 belongs on the bookshelves of anyone hoping to increase their data analytics skills, advance their data engineering career with an in-demand certification, or hoping to make a career change into a popular new area of tech.
Table of Contents
Introduction xxvii
Part I Azure Data Engineer Certification and Azure Products 1
Chapter 1 Gaining the Azure Data Engineer Associate Certification 3
The Journey to Certification 7
How to Pass Exam DP- 203 8
Understanding the Exam Expectations and Requirements 9
Use Azure Daily 17
Read Azure Articles to Stay Current 17
Have an Understanding of All Azure Products 20
Azure Product Name Recognition 21
Azure Data Analytics 23
Azure Synapse Analytics 23
Azure Databricks 26
Azure HDInsight 28
Azure Analysis Services 30
Azure Data Factory 31
Azure Event Hubs 33
Azure Stream Analytics 34
Other Products 35
Azure Storage Products 36
Azure Data Lake Storage 37
Azure Storage 40
Other Products 42
Azure Databases 43
Azure Cosmos DB 43
Azure SQL Server Products 46
Additional Azure Databases 46
Other Products 47
Azure Security 48
Azure Active Directory 48
Role- Based Access Control 51
Attribute- Based Access Control 53
Azure Key Vault 53
Other Products 55
Azure Networking 56
Virtual Networks 56
Other Products 59
Azure Compute 59
Azure Virtual Machines 59
Azure Virtual Machine Scale Sets 60
Azure App Service Web Apps 60
Azure Functions 60
Azure Batch 60
Azure Management and Governance 60
Azure Monitor 61
Azure Purview 61
Azure Policy 62
Azure Blueprints (Preview) 62
Azure Lighthouse 62
Azure Cost Management and Billing 62
Other Products 63
Summary 64
Exam Essentials 64
Review Questions 66
Chapter 2 CREATE DATABASE dbName; GO 69
The Brainjammer 70
A Historical Look at Data 71
Variety 73
Velocity 74
Volume 74
Data Locations 74
Data File Formats 75
Data Structures, Types, and Concepts 83
Data Structures 83
Data Types and Management 92
Data Concepts 95
Data Programming and Querying for Data Engineers 125
Data Programming 126
Querying Data 143
Understanding Big Data Processing 169
Big Data Stages 169
Etl, Elt, Eltl 174
Analytics Types 175
Big Data Layers 176
Summary 177
Exam Essentials 177
Review Questions 179
Part II Design and Implement Data Storage 181
Chapter 3 Data Sources and Ingestion 183
Where Does Data Come From? 185
Design a Data Storage Structure 189
Design an Azure Data Lake Solution 190
Recommended File Types for Storage 198
Recommended File Types for Analytical Queries 199
Design for Efficient Querying 200
Design for Data Pruning 203
Design a Folder Structure That Represents the Levels of Data Transformation 203
Design a Distribution Strategy 205
Design a Data Archiving Solution 206
Design a Partition Strategy 207
Design a Partition Strategy for Files 209
Design a Partition Strategy for Analytical Workloads 210
Design a Partition Strategy for Efficiency and Performance 211
Design a Partition Strategy for Azure Synapse Analytics 211
Identify When Partitioning Is Needed in Azure Data Lake Storage Gen 2 212
Design the Serving/Data Exploration Layer 213
Design Star Schemas 214
Design Slowly Changing Dimensions 215
Design a Dimensional Hierarchy 219
Design a Solution for Temporal Data 220
Design for Incremental Loading 222
Design Analytical Stores 223
Design Metastores in Azure Synapse Analytics and Azure Databricks 224
The Ingestion of Data into a Pipeline 228
Azure Synapse Analytics 228
Azure Data Factory 268
Azure Databricks 275
Event Hubs and IoT Hub 301
Azure Stream Analytics 303
Apache Kafka for HDInsight 314
Migrating and Moving Data 316
Summary 317
Exam Essentials 317
Review Questions 319
Chapter 4 The Storage of Data 321
Implement Physical Data Storage Structures 322
Implement Compression 322
Implement Partitioning 325
Implement Sharding 328
Implement Different Table Geometries with Azure Synapse Analytics Pools 329
Implement Data Redundancy 331
Implement Distributions 341
Implement Data Archiving 342
Azure Synapse Analytics Develop Hub 346
Implement Logical Data Structures 360
Build a Temporal Data Solution 361
Build a Slowly Changing Dimension 365
Build a Logical Folder Structure 368
Build External Tables 369
Implement File and Folder Structures for Efficient Querying and Data Pruning 372
Implement a Partition Strategy 375
Implement a Partition Strategy for Files 376
Implement a Partition Strategy for Analytical Workloads 377
Implement a Partition Strategy for Streaming Workloads 378
Implement a Partition Strategy for Azure Synapse Analytics 378
Design and Implement the Data Exploration Layer 379
Deliver Data in a Relational Star Schema 379
Deliver Data in Parquet Files 385
Maintain Metadata 386
Implement a Dimensional Hierarchy 386
Create and Execute Queries by Using a Compute Solution That Leverages SQL Serverless and Spark Cluster 388
Recommend Azure Synapse Analytics Database Templates 389
Implement Azure Synapse Analytics Database Templates 389
Additional Data Storage Topics 390
Storing Raw Data in Azure Databricks for Transformation 390
Storing Data Using Azure HDInsight 392
Storing Prepared, Trained, and Modeled Data 393
Summary 394
Exam Essentials 395
Review Questions 396
Part III Develop Data Processing 399
Chapter 5 Transform, Manage, and Prepare Data 401
Chapter 6 Ingest and Transform Data 402
Transform Data Using Azure Synapse Pipelines 404
Transform Data Using Azure Data Factory 410
Transform Data Using Apache Spark 414
Transform Data Using Transact- SQL 429
Transform Data Using Stream Analytics 431
Cleanse Data 433
Split Data 435
Shred JSON 439
Encode and Decode Data 445
Configure Error Handling for the Transformation 450
Normalize and Denormalize Values 451
Transform Data by Using Scala 461
Perform Exploratory Data Analysis 463
Transformation and Data Management Concepts 473
Transformation 473
Data Management 480
Azure Databricks 481
Data Modeling and Usage 485
Data Modeling with Machine Learning 486
Usage 494
Summary 500
Exam Essentials 500
Review Questions 502
Create and Manage Batch Processing and Pipelines 505
Design and Develop a Batch Processing Solution 507
Design a Batch Processing Solution 510
Develop Batch Processing Solutions 512
Create Data Pipelines 538
Handle Duplicate Data 560
Handle Missing Data 569
Handle Late- Arriving Data 571
Upsert Data 572
Configure the Batch Size 578
Configure Batch Retention 581
Design and Develop Slowly Changing Dimensions 582
Design and Implement Incremental Data Loads 583
Integrate Jupyter/IPython Notebooks into a Data Pipeline 590
Chapter 7 Revert Data to a Previous State 591
Handle Security and Compliance Requirements 592
Design and Create Tests for Data Pipelines 593
Scale Resources 593
Design and Configure Exception Handling 593
Debug Spark Jobs Using the Spark UI 594
Implement Azure Synapse Link and Query the Replicated Data 594
Use PolyBase to Load Data to a SQL Pool 595
Read from and Write to a Delta Table 595
Manage Batches and Pipelines 596
Trigger Batches 597
Schedule Data Pipelines 597
Validate Batch Loads 598
Implement Version Control for Pipeline Artifacts 604
Manage Data Pipelines 607
Manage Spark Jobs in a Pipeline 609
Handle Failed Batch Loads 610
Summary 610
Exam Essentials 611
Review Questions 612
Design and Implement a Data Stream Processing Solution 615
Develop a Stream Processing Solution 617
Design a Stream Processing Solution 618
Create a Stream Processing Solution 630
Process Time Series Data 657
Design and Create Windowed Aggregates 658
Process Data Within One Partition 661
Process Data Across Partitions 663
Upsert Data 665
Handle Schema Drift 674
Configure Checkpoints/Watermarking During Processing 680
Replay Archived Stream Data 685
Design and Create Tests for Data Pipelines 688
Monitor for Performance and Functional Regressions 689
Optimize Pipelines for Analytical or Transactional Purposes 689
Scale Resources 690
Design and Configure Exception Handling 691
Handle Interruptions 694
Ingest and Transform Data 694
Transform Data Using Azure Stream Analytics 694
Monitor Data Storage and Data Processing 695
Monitor Stream Processing 695
Summary 695
Exam Essentials 696
Review Questions 697
Part IV Secure, Monitor, and Optimize Data Storage and Data Processing 699
Chapter 8 Keeping Data Safe and Secure 701
Design Security for Data Policies and Standards 702
Design a Data Auditing Strategy 711
Design a Data Retention Policy 716
Design for Data Privacy 717
Design to Purge Data Based on Business Requirements 719
Design Data Encryption for Data at Rest and in Transit 719
Design Row- Level and Column- Level Security 722
Design a Data Masking Strategy 723
Design Access Control for Azure Data Lake Storage Gen 2 724
Implement Data Security 730
Implement a Data Auditing Strategy 731
Manage Sensitive Information 739
Implement a Data Retention Policy 745
Encrypt Data at Rest and in Motion 748
Implement Row- Level and Column- Level Security 749
Implement Data Masking 753
Manage Identities, Keys, and Secrets Across Different Data Platform Technologies 755
Implement Access Control for Azure Data Lake Storage Gen 2 765
Implement Secure Endpoints (Private and Public) 772
Implement Resource Tokens in Azure Databricks 778
Load a DataFrame with Sensitive Information 779
Write Encrypted Data to Tables or Parquet Files 780
Develop a Batch Processing Solution 781
Handle Security and Compliance Requirements 782
Design and Implement the Data Exploration Layer 784
Browse and Search Metadata in Microsoft Purview Data Catalog 784
Push New or Updated Data Lineage to Microsoft Purview 785
Summary 786
Exam Essentials 787
Review Questions 789
Chapter 9 Monitoring Azure Data Storage and Processing 791
Monitoring Data Storage and Data Processing 793
Implement Logging Used by Azure Monitor 793
Configure Monitoring Services 799
Understand Custom Logging Options 821
Measure Query Performance 822
Monitor Data Pipeline Performance 823
Monitor Cluster Performance 824
Measure Performance of Data Movement 824
Interpret Azure Monitor Metrics and Logs 825
Monitor and Update Statistics about Data Across a System 828
Schedule and Monitor Pipeline Tests 830
Interpret a Spark Directed Acyclic Graph 830
Monitor Stream Processing 832
Implement a Pipeline Alert Strategy 832
Develop a Batch Processing Solution 832
Design and Create Tests for Data Pipelines 832
Develop a Stream Processing Solution 837
Monitor for Performance and Functional Regressions 837
Design and Create Tests for Data Pipelines 838
Azure Monitoring Overview 841
Azure Batch 841
Azure Key Vault 842
Azure SQL 843
Summary 844
Exam Essentials 844
Review Questions 846
Chapter 10 Troubleshoot Data Storage Processing 849
Optimize and Troubleshoot Data Storage and Data Processing 851
Optimize Resource Management 854
Compact Small Files 857
Handle Skew in Data 859
Handle Data Spill 860
Find Shuffling in a Pipeline 862
Tune Shuffle Partitions 864
Tune Queries by Using Indexers 869
Tune Queries by Using Cache 876
Optimize Pipelines for Analytical or Transactional Purposes 877
Optimize Pipeline for Descriptive versus Analytical Workloads 886
Troubleshoot a Failed Spark Job 888
Troubleshoot a Failed Pipeline Run 890
Rewrite User- Defined Functions 899
Design and Develop a Batch Processing Solution 901
Design and Configure Exception Handling 902
Debug Spark Jobs by Using the Spark UI 902
Scale Resources 902
Monitor Batches and Pipelines 904
Handle Failed Batch Loads 904
Design and Develop a Stream Processing Solution 905
Optimize Pipelines for Analytical or Transactional Purposes 905
Handle Interruptions 906
Scale Resources 908
Summary 909
Exam Essentials 910
Review Questions 912
Appendix Answers to Review Questions 915
Chapter 1: Gaining the Azure Data Engineer Associate Certification 916
Chapter 2: CREATE DATABASE dbName; GO 916
Chapter 3: Data Sources and Ingestion 917
Chapter 4: The Storage of Data 918
Chapter 5: Transform, Manage, and Prepare Data 918
Chapter 6. Create and Manage Batch Processing and Pipelines 919
Chapter 7: Design and Implement a Data Stream Processing Solution 920
Chapter 8: Keeping Data Safe and Secure 921
Chapter 9: Monitoring Azure Data Storage and Processing 921
Chapter 10: Troubleshoot Data Storage Processing 922
Index 925