In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you’ll explore every relevant aspect of data analytics - from data engineering to analysis, business intelligence, DevOps, and MLOps - as you discover how to integrate machine learning predictions with analytics engines and visualization tools.
You’ll also find: - Real-world use cases of AWS architectures that demystify the applications of data analytics - Accessible introductions to data acquisition, importation, storage, visualization, and reporting - Expert insights into serverless data engineering and how to use it to reduce overhead and costs, improve stability, and simplify maintenance
A can't-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.
Table of Contents
Introduction xxiii
Chapter 1 AWS Data Lakes and Analytics Technology Overview 1
Why AWS? 1
What Does a Data Lake Look Like in AWS? 2
Analytics on AWS 3
Skills Required to Build and Maintain an AWS Analytics Pipeline 3
Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team 5
The Data Vision 6
Support 6
DA Team Roles 7
Early Stage Roles 7
Team Lead 8
Data Architect 8
Data Engineer 8
Data Analyst 9
Maturity Stage Roles 9
Data Scientist 9
Cloud Engineer 10
Business Intelligence (BI) Developer 10
Machine Learning Engineer 10
Business Analyst 11
Niche Roles 11
Analytics Flow at a Process Level 12
Workflow Methodology 12
The DA Team Mantra: “Automate Everything” 14
Analytics Models in the Wild: Centralized, Distributed, Center of Excellence 15
Centralized 15
Distributed 16
Center of Excellence 16
Summary 17
Chapter 3 Working on AWS 19
Accessing AWS 20
Everything Is a Resource 21
S3: An Important Exception 21
IAM: Policies, Roles, and Users 22
Policies 22
Identity- Based Policies 24
Resource- Based Policies 25
Roles 25
Users and User Groups 25
Summarizing IAM 26
Working with the Web Console 26
The AWS Command- Line Interface 29
Installing AWS cli 29
Linux Installation 30
macOS Installation 30
Windows 31
Configuring AWS cli 31
A Note on Region 33
Setting Individual Parameters 33
Using Profiles and Configuration Files 33
Final Notes on Configuration 36
Using the AWS cli 36
Using Skeletons and File Inputs 39
Cleaning Up! 43
Infrastructure- as- Code: CloudFormation and Terraform 44
CloudFormation 44
CloudFormation Stacks 46
CloudFormation Template Anatomy 47
CloudFormation Changesets 52
Getting Stack Information 55
Cleaning Up Again 57
CloudFormation Conclusions 58
Terraform 58
Coding Style 58
Modularity 59
Limitations 59
Terraform vs. CloudFormation 60
Infrastructure- as- Code: CDK, Pulumi, Cloudcraft, and Other Solutions 60
AWS CDK 60
Pulumi 62
Cloudcraft 62
Infrastructure Management Conclusions 63
Chapter 4 Serverless Computing and Data Engineering 65
Serverless vs. Fully Managed 65
AWS Serverless Technologies 66
AWS Lambda 67
Pricing Model 67
Laser Focus on Code 68
The Lambda Paradigm Shift 69
Virtually Infinite Scalability 70
Geographical Distribution 70
A Lambda Hello World 71
Lambda Configuration 74
Runtime 74
Container- Based Lambdas 75
Architectures 75
Memory 75
Networking 76
Execution Role 76
Environment Variables 76
AWS EventBridge 77
AWS Fargate 77
AWS DynamoDB 77
AWS SNS 77
Amazon SQS 78
AWS CloudWatch 78
Amazon QuickSight 78
AWS Step Functions 78
Amazon API Gateway 79
Amazon Cognito 79
AWS Serverless Application Model (SAM) 79
Ephemeral Infrastructure 80
AWS SAM Installation 80
Configuration 80
Creating Your First AWS SAM Project 81
Application Structure 83
SAM Resource Types 85
SAM Lambda Template 86
!! Recursive Lambda Invocation !! 88
Function Metadata 88
Outputs 89
Implicitly Generated Resources 89
Other Template Sections 90
Lambda Code 90
Building Your First SAM Application 93
Testing the AWS SAM Application Locally 96
Deployment 99
Cleaning Up 104
Summary 104
Chapter 5 Data Ingestion 105
AWS Data Lake Architecture 106
Serverless Data Lake Architecture Structure 106
Ingestion 106
Storage and Processing 108
Cataloging, Governance, and Search 108
Security and Monitoring 109
Consumption 109
Sample Processing Architecture: Cataloging Images into DynamoDB 109
Use Case Description 109
SAM Application Creation 110
S3- Triggered Lambda 111
Adding DynamoDB 119
Lambda Execution Context 121
Inserting into DynamoDB 121
Cleaning Up 123
Serverless Ingestion 124
AWS Fargate 124
AWS Lambda 124
Example Architecture: Fargate- Based Periodic Batch Import 125
The Basic Importer 125
ECS CLI 128
AWS Copilot cli 128
Clean Up 136
AWS Kinesis Ingestion 136
Example Architecture: Two- Pronged Delivery 137
Fully Managed Ingestion with AppFlow 146
Operational Data Ingestion with Database Migration Service 151
DMS Concepts 151
DMS Instance 151
DMS Endpoints 152
DMS Tasks 152
Summary of the Workflow 152
Common Use of DMS 153
Example Architecture: DMS to S3 154
DMS Instance 154
DMS Endpoints 156
DMS Task 162
Summary 167
Chapter 6 Processing Data 169
Phases of Data Preparation 170
What Is ETL? Why Should I Care? 170
ETL Job vs. Streaming Job 171
Overview of ETL in AWS 172
ETL with AWS Glue 172
ETL with Lambda Functions 172
ETL with Hadoop/EMR 173
Other Ways to Perform ETL 173
ETL Job Design Concepts 173
Source Identification 174
Destination Identification 174
Mappings 174
Validation 174
Filter 175
Join, Denormalization, Relationalization 175
AWS Glue for ETL 176
Really, It’s Just Spark 176
Visual 176
Spark Script Editor 177
Python Shell Script Editor 177
Jupyter Notebook 177
Connectors 177
Creating Connections 178
Creating Connections with the Web Console 178
Creating Connections with the AWS cli 179
Creating ETL Jobs with AWS Glue Visual Editor 184
ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet) 184
Job Bookmarks 187
Transformations 188
Apply Mapping 189
Filter 189
Other Available Transforms 190
Run the Edited Job 191
Visual Editor with Source and Target Conclusions 192
Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target) 192
Creating ETL Jobs with the Spark Script Editor 192
Developing ETL Jobs with AWS Glue Notebooks 193
What Is a Notebook? 194
Notebook Structure 194
Step 1: Load Code into a DynamicFrame 196
Step 2: Apply Field Mapping 197
Step 3: Apply the Filter 197
Step 4: Write to S3 in Parquet Format 198
Example: Joining and Denormalizing Data from Two S3 Locations 199
Conclusions for Manually Authored Jobs with Notebooks 203
Creating ETL Jobs with AWS Glue Interactive Sessions 204
It’s Magic 205
Development Workflow 206
Streaming Jobs 207
Differences with a Standard ETL Job 208
Streaming Sources 208
Example: Process Kinesis Streams with a Streaming Job 208
Streaming ETL Jobs Conclusions 217
Summary 217
Chapter 7 Cataloging, Governance, and Search 219
Cataloging with AWS Glue 219
AWS Glue and the AWS Glue Data Catalog 219
Glue Databases and Tables 220
Databases 220
The Idea of Schema- on- Read 221
Tables 222
Create Table Manually 223
Creating a Table from an Existing Schema 225
Creating a Table with a Crawler 225
Summary on Databases and Tables 226
Crawlers 226
Updating or Not Updating? 230
Running the Crawler 231
Creating a Crawler from the AWS CLI 231
Retrieving Table Information from the CLI 233
Classifiers 235
Classifier Example 236
Crawlers and Classifiers Summary 237
Search with Amazon Athena: The Heart of Analytics in AWS 238
A Bit of History 238
Interface Overview 238
Creating Tables Manually 239
Athena Data Types 240
Complex Types 241
Running a Query 242
Connecting with JDBC and ODBC 243
Query Stats 243
Recent Queries and Saved Queries 243
The Power of Partitions 244
Athena Pricing Model 244
Automatic Naming 245
Athena Query Output 246
Athena Peculiarities (SQL and Not) 246
Computed Fields Gotcha and WITH Statement Workaround 246
Lowercase! 247
Query Explain 248
Deduplicating Records 249
Working with JSON, Flattening, and Unnesting 250
Athena Views 251
Create Table as Select (CTAS) 252
Saving Queries and Reusing Saved Queries 253
Running Parameterized Queries 254
Athena Federated Queries 254
Athena Lambda Connectors 255
Note on Connection Errors 256
Performing Federated Queries 257
Creating a View from a Federated Query 258
Governing: Athena Workgroups, Lake Formation, and More 258
Athena Workgroups 259
Fine- Grained Athena Access with IAM 262
Recap of Athena- Based Governance 264
AWS Lake Formation 265
Registering a Location in Lake Formation 266
Creating a Database in Lake Formation 268
Assigning Permissions in Lake Formation 269
LF- Tags and Permissions in Lake Formation 271
Data Filters 277
Governance Conclusions 279
Summary 280
Chapter 8 Data Consumption: BI, Visualization, and Reporting 283
QuickSight 283
Signing Up for QuickSight 284
Standard Plan 284
Enterprise Plan 284
Users and User Groups 285
Managing Users and Groups 285
Managing QuickSight 286
Users and Groups 287
Your Subscriptions 287
SPICE Capacity 287
Account Settings 287
Security and Permissions 287
VPC Connections 288
Mobile Settings 289
Domains and Embedding 289
Single Sign- On 289
Data Sources and Datasets 289
Creating an Athena Data Source 291
Creating Other Data Sources 292
Creating a Data Source from the AWS cli 292
Creating a Dataset from a Table 294
Creating a Dataset from a SQL Query 295
Duplicating Datasets 296
Note on Creating Datasets 297
QuickSight Favorites, Recent, and Folders 297
SPICE 298
Manage SPICE Capacity 298
Refresh Schedule 299
QuickSight Data Editor 299
QuickSight Data Types 302
Change Data Types 302
Calculated Fields 303
Joining Data 305
Excluding Fields 309
Filtering Data 309
Removing Data 310
Geospatial Hierarchies and Adding Fields to Hierarchies 310
Unsupported Format Dates 311
Visualizing Data: QuickSight Analysis 312
Adding a Title and a Description to Your Analysis 313
Renaming the Sheet 314
Your First Visual with AutoGraph 314
Field Wells 314
Visuals Types 315
Saving and Autosaving 316
A First Example: Pie Chart 316
Renaming a Visual 317
Filtering Data 318
Adding Drill- Downs 320
Parameters 321
Actions 324
Insights 328
ML- Powered Insights 330
Sharing an Analysis 335
Dashboards 335
Dashboard Layouts and Themes 335
Publishing a Dashboard 336
Embedding Visuals and Dashboards 337
Data Consumption: Not Only Dashboards 337
Summary 338
Chapter 9 Machine Learning at Scale 339
Machine Learning and Artificial Intelligence 339
What Are ML/AI Use Cases? 340
Types of ML Models 340
Overview of ML/AI AWS Solutions 341
Amazon SageMaker 341
SageMaker Domains 342
Adding a User to the Domain 344
SageMaker Studio 344
SageMaker Example Notebook 346
Step 1: Prerequisites and Preprocessing 346
Step 2: Data Ingestion 347
Step 3: Data Inspection 348
Step 4: Data Conversion 349
Step 5: Upload Training Data 349
Step 6: Train the Model 349
Step 7: Set Up Hosting and Deploy the Model 351
Step 8: Validate the Model 352
Step 9: Use the Model 353
Inference 353
Real Time 354
Asynchronous 354
Serverless 354
Batch Transform 354
Data Wrangler 356
SageMaker Canvas 357
Summary 358
Appendix Example Data Architectures in AWS 359
Modern Data Lake Architecture 360
ETL in a Lake House 361
Consuming Data in the Lake House 361
The Modern Data Lake Architecture 362
Batch Processing 362
Stream Processing 363
Architecture Design Recommendations 364
Automate Everything 365
Build on Events 365
Performance = Cost Savings 365
AWS Glue Catalog and Athena- Centric Workflow 365
Design Flexible 365
Pick Your Battles 365
Parquet 366
Summary 366
Index 367