Upgrade your programming language to more effectively handle high-frequency data
Machine Learning and Big Data with KDB+/Q offers quants, programmers and algorithmic traders a practical entry into the powerful but non-intuitive kdb+ database and q programming language. Ideally designed to handle the speed and volume of high-frequency financial data at sell- and buy-side institutions, these tools have become the de facto standard; this book provides the foundational knowledge practitioners need to work effectively with this rapidly-evolving approach to analytical trading.
The discussion follows the natural progression of working strategy development to allow hands-on learning in a familiar sphere, illustrating the contrast of efficiency and capability between the q language and other programming approaches. Rather than an all-encompassing “bible”-type reference, this book is designed with a focus on real-world practicality to help you quickly get up to speed and become productive with the language.
- Understand why kdb+/q is the ideal solution for high-frequency data
- Delve into “meat” of q programming to solve practical economic problems
- Perform everyday operations including basic regressions, cointegration, volatility estimation, modelling and more
- Learn advanced techniques from market impact and microstructure analyses to machine learning techniques including neural networks
The kdb+ database and its underlying programming language q offer unprecedented speed and capability. As trading algorithms and financial models grow ever more complex against the markets they seek to predict, they encompass an ever-larger swath of data - more variables, more metrics, more responsiveness and altogether more “moving parts.”
Traditional programming languages are increasingly failing to accommodate the growing speed and volume of data, and lack the necessary flexibility that cutting-edge financial modelling demands. Machine Learning and Big Data with KDB+/Q opens up the technology and flattens the learning curve to help you quickly adopt a more effective set of tools.
Table of Contents
Preface xvii
About the Authors xxiii
Part One Language Fundamentals
Chapter 1 Fundamentals of the q Programming Language 3
1.1 The (Not So Very) First Steps in q 3
1.2 Atoms and Lists 5
1.2.1 Casting Types 11
1.3 Basic Language Constructs 14
1.3.1 Assigning, Equality and Matching 14
1.3.2 Arithmetic Operations and Right-to-Left Evaluation: Introduction to q Philosophy 17
1.4 Basic Operators 19
1.5 Difference between Strings and Symbols 31
1.5.1 Enumeration 31
1.6 Matrices and Basic Linear Algebra in q 33
1.7 Launching the Session: Additional Options 35
1.8 Summary and How-To’s 38
Chapter 2 Dictionaries and Tables: The q Fundamentals 41
2.1 Dictionary 41
2.2 Table 44
2.3 The Truth about Tables 48
2.4 Keyed Tables are Dictionaries 50
2.5 From a Vector Language to an Algebraic Language 51
Chapter 3 Functions 57
3.1 Namespace 59
3.1.0.1 .quantQ. Namespace 60
3.2 The Six Adverbs 60
3.2.1 Each 60
3.2.1.1 Each 61
3.2.1.2 Each-left \: 61
3.2.1.3 Each-right /: 62
3.2.1.4 Cross Product /: \: 62
3.2.1.5 Each-both ' 63
3.2.2 Each-prior ': 66
3.2.3 Compose (’) 67
3.2.4 Over and Fold / 67
3.2.5 Scan 68
3.2.5.1 EMA: The Exponential Moving Average 69
3.2.6 Converge 70
3.2.6.1 Converge-repeat 70
3.2.6.2 Converge-iterate 71
3.3 Apply 72
3.3.1 @ (apply) 72
3.3.2 . (apply) 73
3.4 Protected Evaluations 75
3.5 Vector Operations 76
3.5.1 Aggregators 76
3.5.1.1 Simple Aggregators 76
3.5.1.2 Weighted Aggregators 77
3.5.2 Uniform Functions 77
3.5.2.1 Running Functions 77
3.5.2.2 Window Functions 78
3.6 Convention for User-Defined Functions 79
Chapter 4 Editors and Other Tools 81
4.1 Console 81
4.2 Jupyter Notebook 82
4.3 GUIs 84
4.3.1 qStudio 85
4.3.2 Q Insight Pad 88
4.4 IDEs: IntelliJ IDEA 90
4.5 Conclusion 92
Chapter 5 Debugging q Code 93
5.1 Introduction to Making It Wrong: Errors 93
5.1.1 Syntax Errors 94
5.1.2 Runtime Errors 94
5.1.2.1 The Type Error 95
5.1.2.2 Other Errors 98
5.2 Debugging the Code 100
5.3 Debugging Server-Side 102
Part Two Data Operations
Chapter 6 Splayed and Partitioned Tables 107
6.1 Introduction 107
6.2 Saving a Table as a Single Binary File 108
6.3 Splayed Tables 110
6.4 Partitioned Tables 113
6.5 Conclusion 119
Chapter 7 Joins 121
7.1 Comma Operator 121
7.2 Join Functions 125
7.2.1 ij 125
7.2.2 ej 126
7.2.3 lj 126
7.2.4 pj 127
7.2.5 upsert 128
7.2.6 uj 129
7.2.7 aj 131
7.2.8 aj0 134
7.2.8.1 The Next Valid Join 135
7.2.9 asof 138
7.2.10 wj 140
7.3 Advanced Example: Running TWAP 144
Chapter 8 Parallelisation 151
8.1 Parallel Vector Operations 152
8.2 Parallelisation over Processes 155
8.3 Map-Reduce 155
8.4 Advanced Topic: Parallel File/Directory Access 158
Chapter 9 Data Cleaning and Filtering 161
9.1 Predicate Filtering 161
9.1.1 The Where Clause 161
9.1.2 Aggregation Filtering 163
9.2 Data Cleaning, Normalising and APIs 163
Chapter 10 Parse Trees 165
10.1 Definition 166
10.1.1 Evaluation 166
10.1.2 Parse Tree Creation 170
10.1.3 Read-Only Evaluation 170
10.2 Functional Queries 171
10.2.1 Functional Select 174
10.2.2 Functional Exec 178
10.2.3 Functional Update 179
10.2.4 Functional Delete 180
Chapter 11 A Few Use Cases 181
11.1 Rolling VWAP 181
11.1.1 N Tick VWAP 181
11.1.2 TimeWindow VWAP 182
11.2 Weighted Mid for N Levels of an Order Book 183
11.3 Consecutive Runs of a Rule 185
11.4 Real-Time Signals and Alerts 186
Part Three Data Science
Chapter 12 Basic Overview of Statistics 191
12.1 Histogram 191
12.2 First Moments 196
12.3 Hypothesis Testing 198
12.3.1 Normal p-values 198
12.3.2 Correlation 201
12.3.2.1 Implementation 202
12.3.3 t-test: One Sample 202
12.3.3.1 Implementation 204
12.3.4 t-test: Two Samples 204
12.3.4.1 Implementation 205
12.3.5 Sign Test 206
12.3.5.1 Implementation of the Test 208
12.3.5.2 Median Test 211
12.3.6 Wilcoxon Signed-Rank Test 212
12.3.7 Rank Correlation and Somers’ D 214
12.3.7.1 Implementation 216
12.3.8 Multiple Hypothesis Testing 221
12.3.8.1 Bonferroni Correction 224
12.3.8.2 Šidák’s Correction 224
12.3.8.3 Holm’s Method 225
12.3.8.4 Example 226
Chapter 13 Linear Regression 229
13.1 Linear Regression 230
13.2 Ordinary Least Squares 231
13.3 The Geometric Representation of Linear Regression 233
13.3.1 Moore-Penrose Pseudoinverse 235
13.3.2 Adding Intercept 237
13.4 Implementation of the OLS 240
13.5 Significance of Parameters 243
13.6 How Good is the Fit: R2 244
13.6.1 Adjusted R-squared 247
13.7 Relationship with Maximum Likelihood Estimation and AIC with Small Sample Correction 248
13.8 Estimation Suite 252
13.9 Comparing Two Nested Models: Towards a Stopping Rule 254
13.9.1 Comparing Two General Models 256
13.10 In-/Out-of-Sample Operations 257
13.11 Cross-validation 262
13.12 Conclusion 264
Chapter 14 Time Series Econometrics 265
14.1 Autoregressive and Moving Average Processes 265
14.1.1 Introduction 265
14.1.2 AR(p) Process 266
14.1.2.1 Simulation 266
14.1.2.2 Estimation of AR(p) Parameters 268
14.1.2.3 Least Square Method 268
14.1.2.4 Example 269
14.1.2.5 Maximum Likelihood Estimator 269
14.1.2.6 Yule-Walker Technique 269
14.1.3 MA(q) Process 271
14.1.3.1 Estimation of MA(q) Parameters 272
14.1.3.2 Simulation 272
14.1.3.3 Example 273
14.1.4 ARMA(p, q) Process 273
14.1.4.1 Invertibility of the ARMA(p, q) Process 274
14.1.4.2 Hannan-Rissanen Algorithm: Two-Step Regression Estimation 274
14.1.4.3 Yule-Walker Estimation 274
14.1.4.4 Maximum Likelihood Estimation 275
14.1.4.5 Simulation 275
14.1.4.6 Forecast 276
14.1.5 ARIMA(p, d, q) Process 276
14.1.6 Code 276
14.1.6.1 Simulation 277
14.1.6.2 Estimation 278
14.1.6.3 Forecast 282
14.2 Stationarity and Granger Causality 285
14.2.1 Stationarity 285
14.2.2 Test of Stationarity - Dickey-Fuller and Augmented Dickey-Fuller Tests 286
14.2.3 Granger Causality 286
14.3 Vector Autoregression 287
14.3.1 VAR(p) Process 288
14.3.1.1 Notation 288
14.3.1.2 Estimator 288
14.3.1.3 Example 289
14.3.1.4 Code 293
14.3.2 VARX(p, q) Process 297
14.3.2.1 Estimator 297
14.3.2.2 Code 298
Chapter 15 Fourier Transform 301
15.1 Complex Numbers 301
15.1.1 Properties of Complex Numbers 302
15.2 Discrete Fourier Transform 308
15.3 Addendum: Quaternions 314
15.4 Addendum: Fractals 321
Chapter 16 Eigensystem and PCA 325
16.1 Theory 325
16.2 Algorithms 327
16.2.1 QR Decomposition 328
16.2.2 QR Algorithm for Eigenvalues 330
16.2.3 Inverse Iteration 331
16.3 Implementation of Eigensystem Calculation 332
16.3.1 QR Decomposition 333
16.3.2 Inverse Iteration 337
16.4 The Data Matrix and the Principal Component Analysis 341
16.4.1 The Data Matrix 341
16.4.2 PCA: The First Principal Component 344
16.4.3 Second Principal Component 345
16.4.4 Terminology and Explained Variance 347
16.4.5 Dimensionality Reduction 349
16.4.6 PCA Regression (PCR) 350
16.5 Implementation of PCA 351
16.6 Appendix: Determinant 354
16.6.1 Theory 354
16.6.2 Techniques to Calculate a Determinant 355
16.6.3 Implementation of the Determinant 356
Chapter 17 Outlier Detection 359
17.1 Local Outlier Factor 360
Chapter 18 Simulating Asset Prices 369
18.1 Stochastic Volatility Process with Price Jumps 369
18.2 Towards the Numerical Example 371
18.2.1 Numerical Preliminaries 371
18.2.2 Implementing Stochastic Volatility Process with Jumps 374
18.3 Conclusion 378
Part Four Machine Learning
Chapter 19 Basic Principles of Machine Learning 381
19.1 Non-Numeric Features and Normalisation 381
19.1.1 Non-Numeric Features 381
19.1.1.1 Ordinal Features 382
19.1.1.2 Categorical Features 383
19.1.2 Normalisation 383
19.1.2.1 Normal Score 384
19.1.2.2 Range Scaling 385
19.2 Iteration: Constructing Machine Learning Algorithms 386
19.2.1 Iteration 386
19.2.2 Constructing Machine Learning Algorithms 389
Chapter 20 Linear Regression with Regularisation 391
20.1 Bias-Variance Trade-off 392
20.2 Regularisation 393
20.3 Ridge Regression 394
20.4 Implementation of the Ridge Regression 396
20.4.1 Optimisation of the Regularisation Parameter 401
20.5 Lasso Regression 403
20.6 Implementation of the Lasso Regression 405
Chapter 21 Nearest Neighbours 419
21.1 k-Nearest Neighbours Classifier 419
21.2 Prototype Clustering 423
21.3 Feature Selection: Local Nearest Neighbours Approach 429
21.3.1 Implementation 430
Chapter 22 Neural Networks 437
22.1 Theoretical Introduction 437
22.1.1 Calibration 440
22.1.1.1 Backpropagation 441
22.1.2 The Learning Rate Parameter 443
22.1.3 Initialisation 443
22.1.4 Overfitting 444
22.1.5 Dimension of the Hidden Layer(s) 444
22.2 Implementation of Neural Networks 445
22.2.1 Multivariate Encoder 445
22.2.2 Neurons 446
22.2.3 Training the Neural Network 448
22.3 Examples 451
22.3.1 Binary Classification 451
22.3.2 M-class Classification 454
22.3.3 Regression 457
22.4 Possible Suggestions 463
Chapter 23 AdaBoost with Stumps 465
23.1 Boosting 465
23.2 Decision Stumps 466
23.3 AdaBoost 467
23.4 Implementation of AdaBoost 468
23.5 Recommendation for Readers 474
Chapter 24 Trees 477
24.1 Introduction to Trees 477
24.2 Regression Trees 479
24.2.1 Cost-Complexity Pruning 481
24.3 Classification Tree 482
24.4 Miscellaneous 484
24.5 Implementation of Trees 485
Chapter 25 Forests 495
25.1 Bootstrap 495
25.2 Bagging 498
25.2.1 Out-of-Bag 499
25.3 Implementation 500
25.3.1 Prediction 503
25.3.2 Feature Selection 505
Chapter 26 Unsupervised Machine Learning: The Apriori Algorithm 509
26.1 Apriori Algorithm 510
26.2 Implementation of the Apriori Algorithm 511
Chapter 27 Processing Information 523
27.1 Information Retrieval 523
27.1.1 Corpus: Leonardo da Vinci 523
27.1.2 Frequency Counting 524
27.1.3 tf-idf 528
27.2 Information as Features 532
27.2.1 Sample: Simulated Proteins 533
27.2.2 Kernels and Metrics for Proteins 535
27.2.3 Implementation of Inner Products and Nearest Neighbours Principles 535
27.2.4 Further Topics 539
Chapter 28 Towards AI - Monte Carlo Tree Search 541
28.1 Multi-Armed Bandit Problem 541
28.1.1 Analytic Solutions 543
28.1.2 Greedy Algorithms 543
28.1.3 Confidence-Based Algorithms 544
28.1.4 Bayesian Algorithms 546
28.1.5 Online Gradient Descent Algorithms 547
28.1.6 Implementation of Some Learning Algorithms 547
28.2 Monte Carlo Tree Search 558
28.2.1 Selection Step 561
28.2.2 Expansion Step 562
28.2.3 Simulation Step 563
28.2.4 Back Propagation Step 563
28.2.5 Finishing the Algorithm 563
28.2.6 Remarks and Extensions 564
28.3 Monte Carlo Tree Search Implementation - Tic-tac-toe 565
28.3.1 Random Games 566
28.3.2 Towards the MCTS 570
28.3.3 Case Study 579
28.4 Monte Carlo Tree Search - Additional Comments 579
28.4.1 Policy and Value Networks 579
28.4.2 Reinforcement Learning 581
Chapter 29 Econophysics: The Agent-Based Computational Models 583
29.1 Agent-Based Modelling 584
29.1.1 Agent-Based Models in Society 584
29.1.2 Agent-Based Models in Finance 586
29.2 Ising Agent-Based Model for Financial Markets 587
29.2.1 Ising Model in Physics 587
29.2.2 Ising Model of Interacting Agents 587
29.2.3 Numerical Implementation 588
29.3 Conclusion 592
Chapter 30 Epilogue: Art 595
Bibliography 601
Index 607