In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scientists who wish to build and deploy cost-effective large language model (LLM)-based solutions. In the book, you'll find coverage of a wide range of key topics, including how to select a model, pre- and post-processing of data, prompt engineering, and instruction fine tuning.
The author sheds light on techniques for optimizing inference, like model quantization and pruning, as well as different and affordable architectures for typical generative AI (GenAI) applications, including search systems, agent assists, and autonomous agents. You'll also find: - Effective strategies to address the challenge of the high computational cost associated with LLMs - Assistance with the complexities of building and deploying affordable generative AI apps, including tuning and inference techniques - Selection criteria for choosing a model, with particular consideration given to compact, nimble, and domain-specific models
Perfect for developers and data scientists interested in deploying foundational models, or business leaders planning to scale out their use of GenAI, Large Language Model-Based Solutions will also benefit project leaders and managers, technical support staff, and administrators with an interest or stake in the subject.
Table of Contents
Introduction xix
Chapter 1: Introduction 1
Overview of GenAI Applications and Large Language Models 1
The Rise of Large Language Models 1
Neural Networks, Transformers, and Beyond 2
GenAI vs. LLMs: What’s the Difference? 5
The Three-Layer GenAI Application Stack 6
The Infrastructure Layer 6
The Model Layer 7
The Application Layer 8
Paths to Productionizing GenAI Applications 9
Sample LLM-Powered Chat Application 11
The Importance of Cost Optimization 12
Cost Assessment of the Model Inference Component 12
Cost Assessment of the Vector Database Component 19
Benchmarking Setup and Results 20
Other Factors to Consider 23
Cost Assessment of the Large Language Model Component 24
Summary 27
Chapter 2: Tuning Techniques for Cost Optimization 29
Fine-Tuning and Customizability 29
Basic Scaling Laws You Should Know 30
Parameter-Efficient Fine-Tuning Methods 32
Adapters Under the Hood 33
Prompt Tuning 34
Prefix Tuning 36
P-tuning 39
IA3 40
Low-Rank Adaptation 44
Cost and Performance Implications of PEFT Methods 46
Summary 48
Chapter 3: Inference Techniques for Cost Optimization 49
Introduction to Inference Techniques 49
Prompt Engineering 50
Impact of Prompt Engineering on Cost 50
Estimating Costs for Other Models 52
Clear and Direct Prompts 53
Adding Qualifying Words for Brief Responses 53
Breaking Down the Request 54
Example of Using Claude for PII Removal 55
Conclusion 59
Providing Context 59
Examples of Providing Context 60
RAG and Long Context Models 60
Recent Work Comparing RAG with Long Content Models 61
Conclusion 62
Context and Model Limitations 62
Indicating a Desired Format 63
Example of Formatted Extraction with Claude 63
Trade-Off Between Verbosity and Clarity 66
Caching with Vector Stores 66
What Is a Vector Store? 66
How to Implement Caching Using Vector Stores 66
Conclusion 69
Chains for Long Documents 69
What Is Chaining? 69
Implementing Chains 69
Example Use Case 70
Common Components 70
Tools That Implement Chains 72
Comparing Results 76
Conclusion 76
Summarization 77
Summarization in the Context of Cost and Performance 77
Efficiency in Data Processing 77
Cost-Effective Storage 77
Enhanced Downstream Applications 77
Improved Cache Utilization 77
Summarization as a Preprocessing Step 77
Enhanced User Experience 77
Conclusion 77
Batch Prompting for Efficient Inference 78
Batch Inference 78
Experimental Results 80
Using the accelerate Library 81
Using the DeepSpeed Library 81
Batch Prompting 82
Example of Using Batch Prompting 83
Model Optimization Methods 83
Quantization 83
Code Example 84
Recent Advancements: GPTQ 85
Parameter-Efficient Fine-Tuning Methods 85
Recap of PEFT Methods 85
Code Example 86
Cost and Performance Implications 87
Summary 88
References 88
Chapter 4: Model Selection and Alternatives 89
Introduction to Model Selection 89
Motivating Example: The Tale of Two Models 89
The Role of Compact and Nimble Models 90
Examples of Successful Smaller Models 91
Quantization for Powerful but Smaller Models 91
Text Generation with Mistral 7B 93
Zephyr 7B and Aligned Smaller Models 94
CogVLM for Language-Vision Multimodality 95
Prometheus for Fine-Grained Text Evaluation 96
Orca 2 and Teaching Smaller Models to Reason 98
Breaking Traditional Scaling Laws with Gemini and Phi 99
Phi 1, 1.5, and 2 B Models 100
Gemini Models 102
Domain-Specific Models 104
Step 1 - Training Your Own Tokenizer 105
Step 2 - Training Your Own Domain-Specific Model 107
More References for Fine-Tuning 114
Evaluating Domain-Specific Models vs. Generic Models 115
The Power of Prompting with General-Purpose Models 120
Summary 122
Chapter 5: Infrastructure and Deployment Tuning Strategies 123
Introduction to Tuning Strategies 123
Hardware Utilization and Batch Tuning 124
Memory Occupancy 126
Strategies to Fit Larger Models in Memory 128
KV Caching 130
PagedAttention 131
How Does PagedAttention Work? 131
Comparisons, Limitations, and Cost Considerations 131
AlphaServe 133
How Does AlphaServe Work? 133
Impact of Batching 134
Cost and Performance Considerations 134
S3: Scheduling Sequences with Speculation 134
How Does S3 Work? 135
Performance and Cost 135
Streaming LLMs with Attention Sinks 136
Fixed to Sliding Window Attention 137
Extending the Context Length 137
Working with Infinite Length Context 137
How Does StreamingLLM Work? 138
Performance and Results 139
Cost Considerations 139
Batch Size Tuning 140
Frameworks for Deployment Configuration Testing 141
Cloud-Native Inference Frameworks 142
Deep Dive into Serving Stack Choices 142
Batching Options 143
Options in DJL Serving 144
High-Level Guidance for Selecting Serving Parameters 146
Automatically Finding Good Inference Configurations 146
Creating a Generic Template 148
Defining a HPO Space 149
Searching the Space for Optimal Configurations 151
Results of Inference HPO 153
Inference Acceleration Tools 155
TensorRT and GPU Acceleration Tools 156
CPU Acceleration Tools 156
Monitoring and Observability 157
LLMOps and Monitoring 157
Why Is Monitoring Important for LLMs? 159
Monitoring and Updating Guardrails 160
Summary 161
Conclusion 163
Index 181