+353-1-416-8900REST OF WORLD
+44-20-3973-8888REST OF WORLD
1-917-300-0470EAST COAST U.S
1-800-526-8630U.S. (TOLL FREE)

Large Language Model-Based Solutions. How to Deliver Value with Cost-Effective Generative AI Applications. Edition No. 1. Tech Today

  • Book

  • 224 Pages
  • April 2024
  • John Wiley and Sons Ltd
  • ID: 5912482
Learn to build cost-effective apps using Large Language Models

In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scientists who wish to build and deploy cost-effective large language model (LLM)-based solutions. In the book, you'll find coverage of a wide range of key topics, including how to select a model, pre- and post-processing of data, prompt engineering, and instruction fine tuning.

The author sheds light on techniques for optimizing inference, like model quantization and pruning, as well as different and affordable architectures for typical generative AI (GenAI) applications, including search systems, agent assists, and autonomous agents. You'll also find: - Effective strategies to address the challenge of the high computational cost associated with LLMs - Assistance with the complexities of building and deploying affordable generative AI apps, including tuning and inference techniques - Selection criteria for choosing a model, with particular consideration given to compact, nimble, and domain-specific models

Perfect for developers and data scientists interested in deploying foundational models, or business leaders planning to scale out their use of GenAI, Large Language Model-Based Solutions will also benefit project leaders and managers, technical support staff, and administrators with an interest or stake in the subject.

Table of Contents

Introduction xix

Chapter 1: Introduction 1

Overview of GenAI Applications and Large Language Models 1

The Rise of Large Language Models 1

Neural Networks, Transformers, and Beyond 2

GenAI vs. LLMs: What’s the Difference? 5

The Three-Layer GenAI Application Stack 6

The Infrastructure Layer 6

The Model Layer 7

The Application Layer 8

Paths to Productionizing GenAI Applications 9

Sample LLM-Powered Chat Application 11

The Importance of Cost Optimization 12

Cost Assessment of the Model Inference Component 12

Cost Assessment of the Vector Database Component 19

Benchmarking Setup and Results 20

Other Factors to Consider 23

Cost Assessment of the Large Language Model Component 24

Summary 27

Chapter 2: Tuning Techniques for Cost Optimization 29

Fine-Tuning and Customizability 29

Basic Scaling Laws You Should Know 30

Parameter-Efficient Fine-Tuning Methods 32

Adapters Under the Hood 33

Prompt Tuning 34

Prefix Tuning 36

P-tuning 39

IA3 40

Low-Rank Adaptation 44

Cost and Performance Implications of PEFT Methods 46

Summary 48

Chapter 3: Inference Techniques for Cost Optimization 49

Introduction to Inference Techniques 49

Prompt Engineering 50

Impact of Prompt Engineering on Cost 50

Estimating Costs for Other Models 52

Clear and Direct Prompts 53

Adding Qualifying Words for Brief Responses 53

Breaking Down the Request 54

Example of Using Claude for PII Removal 55

Conclusion 59

Providing Context 59

Examples of Providing Context 60

RAG and Long Context Models 60

Recent Work Comparing RAG with Long Content Models 61

Conclusion 62

Context and Model Limitations 62

Indicating a Desired Format 63

Example of Formatted Extraction with Claude 63

Trade-Off Between Verbosity and Clarity 66

Caching with Vector Stores 66

What Is a Vector Store? 66

How to Implement Caching Using Vector Stores 66

Conclusion 69

Chains for Long Documents 69

What Is Chaining? 69

Implementing Chains 69

Example Use Case 70

Common Components 70

Tools That Implement Chains 72

Comparing Results 76

Conclusion 76

Summarization 77

Summarization in the Context of Cost and Performance 77

Efficiency in Data Processing 77

Cost-Effective Storage 77

Enhanced Downstream Applications 77

Improved Cache Utilization 77

Summarization as a Preprocessing Step 77

Enhanced User Experience 77

Conclusion 77

Batch Prompting for Efficient Inference 78

Batch Inference 78

Experimental Results 80

Using the accelerate Library 81

Using the DeepSpeed Library 81

Batch Prompting 82

Example of Using Batch Prompting 83

Model Optimization Methods 83

Quantization 83

Code Example 84

Recent Advancements: GPTQ 85

Parameter-Efficient Fine-Tuning Methods 85

Recap of PEFT Methods 85

Code Example 86

Cost and Performance Implications 87

Summary 88

References 88

Chapter 4: Model Selection and Alternatives 89

Introduction to Model Selection 89

Motivating Example: The Tale of Two Models 89

The Role of Compact and Nimble Models 90

Examples of Successful Smaller Models 91

Quantization for Powerful but Smaller Models 91

Text Generation with Mistral 7B 93

Zephyr 7B and Aligned Smaller Models 94

CogVLM for Language-Vision Multimodality 95

Prometheus for Fine-Grained Text Evaluation 96

Orca 2 and Teaching Smaller Models to Reason 98

Breaking Traditional Scaling Laws with Gemini and Phi 99

Phi 1, 1.5, and 2 B Models 100

Gemini Models 102

Domain-Specific Models 104

Step 1 - Training Your Own Tokenizer 105

Step 2 - Training Your Own Domain-Specific Model 107

More References for Fine-Tuning 114

Evaluating Domain-Specific Models vs. Generic Models 115

The Power of Prompting with General-Purpose Models 120

Summary 122

Chapter 5: Infrastructure and Deployment Tuning Strategies 123

Introduction to Tuning Strategies 123

Hardware Utilization and Batch Tuning 124

Memory Occupancy 126

Strategies to Fit Larger Models in Memory 128

KV Caching 130

PagedAttention 131

How Does PagedAttention Work? 131

Comparisons, Limitations, and Cost Considerations 131

AlphaServe 133

How Does AlphaServe Work? 133

Impact of Batching 134

Cost and Performance Considerations 134

S3: Scheduling Sequences with Speculation 134

How Does S3 Work? 135

Performance and Cost 135

Streaming LLMs with Attention Sinks 136

Fixed to Sliding Window Attention 137

Extending the Context Length 137

Working with Infinite Length Context 137

How Does StreamingLLM Work? 138

Performance and Results 139

Cost Considerations 139

Batch Size Tuning 140

Frameworks for Deployment Configuration Testing 141

Cloud-Native Inference Frameworks 142

Deep Dive into Serving Stack Choices 142

Batching Options 143

Options in DJL Serving 144

High-Level Guidance for Selecting Serving Parameters 146

Automatically Finding Good Inference Configurations 146

Creating a Generic Template 148

Defining a HPO Space 149

Searching the Space for Optimal Configurations 151

Results of Inference HPO 153

Inference Acceleration Tools 155

TensorRT and GPU Acceleration Tools 156

CPU Acceleration Tools 156

Monitoring and Observability 157

LLMOps and Monitoring 157

Why Is Monitoring Important for LLMs? 159

Monitoring and Updating Guardrails 160

Summary 161

Conclusion 163

Index 181

Authors

Shreyas Subramanian AWS (Amazon Web Services, Inc).