The first and only book to systematically address methodologies and processes of leveraging non-traditional information sources in the context of investing and risk management
Harnessing non-traditional data sources to generate alpha, analyze markets, and forecast risk is a subject of intense interest for financial professionals. A growing number of regularly-held conferences on alternative data are being established, complemented by an upsurge in new papers on the subject. Alternative data is starting to be steadily incorporated by conventional institutional investors and risk managers throughout the financial world. Methodologies to analyze and extract value from alternative data, guidance on how to source data and integrate data flows within existing systems is currently not treated in literature. Filling this significant gap in knowledge, The Book of Alternative Data is the first and only book to offer a coherent, systematic treatment of the subject.
This groundbreaking volume provides readers with a roadmap for navigating the complexities of an array of alternative data sources, and delivers the appropriate techniques to analyze them. The authors - leading experts in financial modeling, machine learning, and quantitative research and analytics - employ a step-by-step approach to guide readers through the dense jungle of generated data. A first-of-its kind treatment of alternative data types, sources, and methodologies, this innovative book:
- Provides an integrated modeling approach to extract value from multiple types of datasets
- Treats the processes needed to make alternative data signals operational
- Helps investors and risk managers rethink how they engage with alternative datasets
- Features practical use case studies in many different financial markets and real-world techniques
- Describes how to avoid potential pitfalls and missteps in starting the alternative data journey
- Explains how to integrate information from different datasets to maximize informational value
The Book of Alternative Data is an indispensable resource for anyone wishing to analyze or monetize different non-traditional datasets, including Chief Investment Officers, Chief Risk Officers, risk professionals, investment professionals, traders, economists, and machine learning developers and users.
Table of Contents
Preface xv
Acknowledgments xvii
Part 1 Introduction and Theory 1
1 Alternative Data: The Lay of the Land 3
1.1 Introduction 3
1.2 What is “Alternative Data”? 5
1.3 Segmentation of Alternative Data 7
1.4 The Many Vs of Big Data 9
1.5 Why Alternative Data? 11
1.6 Who is Using Alternative Data? 15
1.7 Capacity of a Strategy and Alternative Data 16
1.8 Alternative Data Dimensions 19
1.9 Who Are the Alternative Data Vendors? 23
1.10 Usage of Alternative Datasets on the Buy Side 24
1.11 Conclusion 26
2 The Value of Alternative Data 27
2.1 Introduction 27
2.2 The Decay of Investment Value 27
2.3 Data Markets 29
2.4 The Monetary Value of Data (Part I) 31
2.4.1 Cost Value 34
2.4.2 Market Value 34
2.4.3 Economic Value 35
2.5 Evaluating (Alternative) Data Strategies with and without Backtesting 35
2.5.1 Systematic Investors 36
2.5.2 Discretionary Investors 38
2.5.3 Risk Managers 39
2.6 The Monetary Value of Data (Part II) 39
2.6.1 The Buyer’s Perspective 40
2.6.2 The Seller’s Perspective 41
2.7 The Advantages of Maturing Alternative Datasets 45
2.8 Summary 46
3 Alternative Data Risks and Challenges 47
3.1 Legal Aspects of Data 47
3.2 Risks of Using Alternative Data 50
3.3 Challenges of Using Alternative Data 51
3.3.1 Entity Matching 52
3.3.2 Missing Data 54
3.3.3 Structuring the Data 55
3.3.4 Treatment of Outliers 56
3.4 Aggregating the Data 57
3.5 Summary 58
4 Machine Learning Techniques 59
4.1 Introduction 59
4.2 Machine Learning: Definitions and Techniques 60
4.2.1 Bias, Variance, and Noise 60
4.2.2 Cross-Validation 61
4.2.3 Introducing Machine Learning 62
4.2.4 Popular Supervised Machine Learning Techniques 64
4.2.5 Clustering-Based Unsupervised Machine Learning Techniques 70
4.2.6 Other Unsupervised Machine Learning Techniques 71
4.2.7 Machine Learning Libraries 71
4.2.8 Neutral Networks and Deep Learning 72
4.2.9 Gaussian Processes 80
4.3 Which Technique to Choose? 82
4.4 Assumptions and Limitations of the Machine Learning Techniques 84
4.4.1 Causality 84
4.4.2 Non-stationarity 85
4.4.3 Restricted Information Set 86
4.4.4 The Algorithm Choice 86
4.5 Structuring Images 87
4.5.1 Features and Feature Detection Algorithms 87
4.5.2 Deep Learning and CNNs for Image Classification 89
4.5.3 Augmenting Satellite Image Data with Other Datasets 90
4.5.4 Imaging Tools 91
4.6 Natural Language Processing (NLP) 91
4.6.1 What is Natural Language Processing (NLP)? 91
4.6.2 Normalization 93
4.6.3 Creating Word Embeddings: Bag-of-Words 94
4.6.4 Creating Word Embeddings: Word2vec and Beyond 94
4.6.5 Sentiment Analysis and NLP Tasks as Classification Problems 96
4.6.6 Topic Modeling 96
4.6.7 Various Challenges in NLP 97
4.6.8 Different Languages and Different Texts 98
4.6.9 Speech in NLP 99
4.6.10 NLP Tools 100
4.7 Summary 102
5 The Processes behind the Use of Alternative Data 105
5.1 Introduction 105
5.2 Steps in the Alternative Data Journey 106
5.2.1 Step 1. Set up a Vision and Strategy 106
5.2.2 Step 2. Identify the Appropriate Datasets 107
5.2.3 Step 3. Perform Due Diligence on Vendors 108
5.2.4 Step 4. Pre-assess Risks 109
5.2.5 Step 5. Pre-assess the Existence of Signals 109
5.2.6 Step 6. Data Onboarding 110
5.2.7 Step 7. Data Preprocessing 110
5.2.8 Step 8. Signal Extraction 111
5.2.9 Step 9. Implementation (or Deployment in Production) 112
5.2.10 Maintenance Process 113
5.3 Structuring Teams to Use Alternative Data 114
5.4 Data Vendors 116
5.5 Summary 118
6 Factor Investing 119
6.1 Introduction 119
6.1.1 The CAPM 119
6.2 Factor Models 120
6.2.1 The Arbitrage Pricing Theory 122
6.2.2 The Fama-French 3-Factor Model 123
6.2.3 The Carhart Model 124
6.2.4 Other Approaches (Data Mining) 125
6.3 The Difference between Cross-Sectional and Time Series Trading Approaches 126
6.4 Why Factor Investing? 126
6.5 Smart Beta Indices Using Alternative Data Inputs 127
6.6 ESG Factors 128
6.7 Direct and Indirect Prediction 129
6.8 Summary 132
Part 2 Practical Applications 133
7 Missing Data: Background 135
7.1 Introduction 135
7.2 Missing Data Classification 136
7.2.1 Missing Data Treatments 137
7.3 Literature Overview of Missing Data Treatments 139
7.3.1 Luengo et al. (2012) 139
7.3.2 Garcia-Laencina et al. (2010) 143
7.3.3 Grzymala-Busse et al. (2000) 146
7.3.4 Zou et al. (2005) 147
7.3.5 Jerez et al. (2010) 147
7.3.6 Farhangfar et al. (2008) 148
7.3.7 Kang et al. (2013) 149
7.4 Summary 149
8 Missing Data: Case Studies 151
8.1 Introduction 151
8.2 Case Study: Imputing Missing Values in Multivariate Credit Default Swap Time Series 152
8.2.1 Missing Data Classification 153
8.2.2 Imputation Metrics 154
8.2.3 CDS Data and Test Data Generation 154
8.2.4 Multiple Imputation Methods 157
8.2.5 Deterministic and EOF-Based Techniques 160
8.2.6 Results 164
8.3 Case Study: Satellite Images 173
8.4 Summary 176
8.5 Appendix: General Description of the MICE Procedure 178
8.6 Appendix: Software Libraries Used in This Chapter 179
9 Outliers (Anomalies) 181
9.1 Introduction 181
9.2 Outliers Definition, Classification, and Approaches to Detection 182
9.3 Temporal Structure 183
9.4 Global Versus Local Outliers, Point Anomalies, and Micro-Clusters 184
9.5 Outlier Detection Problem Setup 184
9.6 Comparative Evaluation of Outlier Detection Algorithms 185
9.7 Approaches to Outlier Explanation 189
9.7.1 Micenkova et al. 189
9.7.2 Duan et al. 191
9.7.3 Angiulli et al. 192
9.8 Case Study: Outlier Detection on Fed Communications Index 194
9.9 Summary 201
9.10 Appendix 202
9.10.1 Model-Based Techniques 202
9.10.2 Distance-Based Techniques 202
9.10.3 Density-Based Techniques 203
9.10.4 Heuristics-Based Approaches 203
10 Automotive Fundamental Data 205
10.1 Introduction 205
10.2 Data 206
10.3 Approach 1: Indirect Approach 211
10.3.1 The Steps Followed 212
10.3.2 Stage 1 213
10.4 Approach 2: Direct Approach 223
10.4.1 The Data 223
10.4.2 Factor Generation 224
10.4.3 Factor Performance 225
10.4.4 Detailed Factor Results 229
10.5 Gaussian Processes Example 238
10.6 Summary 239
10.7 Appendix 240
10.7.1 List of Companies 240
10.7.2 Description of Financial Statement Items 241
10.7.3 Ratios Used 242
10.7.4 IHS Markit Data Features 243
10.7.5 Reporting Delays by Country 244
11 Surveys and Crowdsourced Data 245
11.1 Introduction 245
11.2 Survey Data as Alternative Data 245
11.3 The Data 247
11.4 The Product 247
11.5 Case Studies 249
11.5.1 Case Study: Company Event Study (Pooled Survey) 249
11.5.2 Case Study: Oil and Gas Production (Q&A Survey) 252
11.6 Some Technical Considerations on Surveys 254
11.7 Crowdsourcing Analyst Estimates Survey 255
11.8 Alpha Capture Data 256
11.9 Summary 256
11.10 Appendix 256
12 Purchasing Managers’ Index 259
12.1 Introduction 259
12.2 PMI Performance 261
12.3 Nowcasting GDP Growth 262
12.4 Impacts on Financial Markets 263
12.5 Summary 266
13 Satellite Imagery and Aerial Photography 267
13.1 Introduction 267
13.2 Forecasting US Export Growth 269
13.3 Car Counts and Earnings Per Share for Retailers 271
13.4 Measuring Chinese PMI Manufacturing with Satellite Data 277
13.5 Summary 280
14 Location Data 283
14.1 Introduction 283
14.2 Shipping Data to Track Crude Oil Supplies 283
14.3 Mobile Phone Location Data to Understand Retail Activity 287
14.3.1 Trading REIT ETF Using Mobile Phone Location Data 288
14.3.2 Estimating Earnings per Share with Mobile Phone Location Data 291
14.4 Taxi Ride Data and New York Fed Meetings 295
14.5 Corporate Jet Location Data and M&A 296
14.6 Summary 298
15 Text Web Social Media and News 299
15.1 Introduction 299
15.2 Collecting Web Data 299
15.3 Social Media 300
15.3.1 Hedonometer Index 302
15.3.2 Using Twitter Data to Help Forecast US Change in Nonfarm Payrolls 305
15.3.3 Twitter Data to Forecast Stock Market Reaction to FOMC 308
15.3.4 Liquidity and Sentiment from Social Media 309
15.4 News 309
15.4.1 Machine-Readable News to Trade FX and Understand FX Volatility 310
15.4.2 Federal Reserve Communications and US Treasury Yields 316
15.5 Other Web Sources 320
15.5.1 Measuring Consumer Price Inflation 321
15.6 Summary 322
16 Investor Attention 323
16.1 Introduction 323
16.2 Readership of Payrolls to Measure Investor Attention 323
16.3 Google Trends Data to Measure Market Themes 325
16.4 Investopedia Search Data to Measure Investor Anxiety 328
16.5 Using Wikipedia to Understand Price Action in Cryptocurrencies 330
16.6 Online Attention for Countries to Inform EMFX Trading 330
16.7 Summary 333
17 Consumer Transactions 335
17.1 Introduction 335
17.2 Credit and Debit Card Transaction Data 336
17.3 Consumer Receipts 337
17.4 Summary 340
18 Government, Industrial, and Corporate Data 341
18.1 Introduction 341
18.2 Using Innovation Measures to Trade Equities 342
18.3 Quantifying Currency Crisis Risk 344
18.4 Modeling Central Bank Intervention in Currency Markets 346
18.5 Summary 348
19 Market Data 351
19.1 Introduction 351
19.2 Relationship between Institutional FX Flow Data and FX Spot 351
19.3 Understanding Liquidity Using High-Frequency FX Data 355
19.4 Summary 357
20 Alternative Data in Private Markets 359
20.1 Introduction 359
20.2 Defining Private Equity and Venture Capital Firms 360
20.3 Private Equity Datasets 362
20.4 Understanding the Performance of Private Firms 363
20.5 Summary 364
Conclusions 365
Some Last Words 365
References 367
About the Authors 373
Index 375