Introduces professionals and scientists to statistics and machine learning using the programming language R
Written by and for practitioners, this book provides an overall introduction to R, focusing on tools and methods commonly used in data science, and placing emphasis on practice and business use. It covers a wide range of topics in a single volume, including big data, databases, statistical machine learning, data wrangling, data visualization, and the reporting of results. The topics covered are all important for someone with a science/math background that is looking to quickly learn several practical technologies to enter or transition to the growing field of data science.
The Big R-Book for Professionals: From Data Science to Learning Machines and Reporting with R includes nine parts, starting with an introduction to the subject and followed by an overview of R and elements of statistics. The third part revolves around data, while the fourth focuses on data wrangling. Part 5 teaches readers about exploring data. In Part 6 we learn to build models, Part 7 introduces the reader to the reality in companies, Part 8 covers reports and interactive applications and finally Part 9 introduces the reader to big data and performance computing. It also includes some helpful appendices.
- Provides a practical guide for non-experts with a focus on business users
- Contains a unique combination of topics including an introduction to R, machine learning, mathematical models, data wrangling, and reporting
- Uses a practical tone and integrates multiple topics in a coherent framework
- Demystifies the hype around machine learning and AI by enabling readers to understand the provided models and program them in R
- Shows readers how to visualize results in static and interactive reports
- Supplementary materials includes PDF slides based on the book’s content, as well as all the extracted R-code and is available to everyone on a Wiley Book Companion Site
The Big R-Book is an excellent guide for science technology, engineering, or mathematics students who wish to make a successful transition from the academic world to the professional. It will also appeal to all young data scientists, quantitative analysts, and analytics professionals, as well as those who make mathematical models.
Table of Contents
Foreword xxv
About the Author xxvii
Acknowledgements xxix
Preface xxxi
About the Companion Site xxxv
I Introduction 1
1 The Big Picture with Kondratiev and Kardashev 3
2 The Scientific Method and Data 7
3 Conventions 11
II Starting with R and Elements of Statistics 19
4 The Basics of R 21
4.1 Getting Started with R 23
4.2 Variables 26
4.3 Data Types 28
4.3.1 The Elementary Types 28
4.3.2 Vectors 29
4.3.3 Accessing Data from a Vector 29
4.3.4 Matrices 32
4.3.5 Arrays 38
4.3.6 Lists 41
4.3.7 Factors 45
4.3.8 Data Frames 49
4.3.9 Strings or the Character-type 54
4.4 Operators 57
4.4.1 Arithmetic Operators 57
4.4.2 Relational Operators 57
4.4.3 Logical Operators 58
4.4.4 Assignment Operators 59
4.4.5 Other Operators 61
4.5 Flow Control Statements 63
4.5.1 Choices 63
4.5.2 Loops 65
4.6 Functions 69
4.6.1 Built-in Functions 69
4.6.2 Help with Functions 69
4.6.3 User-defined Functions 70
4.6.4 Changing Functions 70
4.6.5 Creating Function with Default Arguments 71
4.7 Packages 72
4.7.1 Discovering Packages in R 72
4.7.2 Managing Packages in R 73
4.8 Selected Data Interfaces 75
4.8.1 CSV Files 75
4.8.2 Excel Files 79
4.8.3 Databases 79
5 Lexical Scoping and Environments 81
5.1 Environments in R 81
5.2 Lexical Scoping in R 83
6 The Implementation of OO 87
6.1 Base Types 89
6.2 S3 Objects 91
6.2.1 Creating S3 Objects 94
6.2.2 Creating Generic Methods 96
6.2.3 Method Dispatch 97
6.2.4 Group Generic Functions 98
6.3 S4 Objects 100
6.3.1 Creating S4 Objects 100
6.3.2 Using S4 Objects 101
6.3.3 Validation of Input 105
6.3.4 Constructor functions 107
6.3.5 The Data slot 108
6.3.6 Recognising Objects, Generic Functions, and Methods 108
6.3.7 CreatingS4Generics 110
6.3.8 Method Dispatch 111
6.4 The Reference Class, refclass, RC or R5 Model 113
6.4.1 Creating RC Objects 113
6.4.2 Important Methods and Attributes 117
6.5 Conclusions about the OO Implementation 119
7 Tidy R with the Tidyverse 121
7.1 The Philosophy of the Tidyverse 121
7.2 Packages in the Tidyverse 124
7.2.1 The Core Tidyverse 124
7.2.2 The Non-core Tidyverse 125
7.3 Working with the Tidyverse 127
7.3.1 Tibbles 127
7.3.2 Piping with R 132
7.3.3 Attention Points When Using the Pipe 133
7.3.4 Advanced Piping 134
7.3.5 Conclusion 137
8 Elements of Descriptive Statistics 139
8.1 Measures of Central Tendency 139
8.1.1 Mean 139
8.1.2 The Median 142
8.1.3 The Mode 143
8.2 Measures of Variation or Spread 145
8.3 Measures of Covariation 147
8.3.1 The Pearson Correlation 147
8.3.2 The Spearman Correlation 148
8.3.3 Chi-square Tests 149
8.4 Distributions 150
8.4.1 Normal Distribution 150
8.4.2 Binomial Distribution 153
8.5 Creating an Overview of Data Characteristics 155
9 Visualisation Methods 159
9.1 Scatterplots 161
9.2 Line Graphs 163
9.3 Pie Charts 165
9.4 Bar Charts 167
9.5 Boxplots 171
9.6 Violin Plots 173
9.7 Histograms 176
9.8 Plotting Functions 179
9.9 Maps and Contour Plots 180
9.10 Heat-maps 181
9.11 Text Mining 184
9.11.1 Word Clouds 184
9.11.2 Word Associations 188
9.12 Colours in R 191
10 Time Series Analysis 197
10.1 Time Series in R 197
10.1.1 The Basics of Time Series in R 197
10.2 Forecasting 200
10.2.1 Moving Average 200
10.2.2 Seasonal Decomposition 206
11 Further Reading 211
III Data Import 213
12 A Short History of Modern Database Systems 215
13 RDBMS 219
14 SQL 223
14.1 Designing the Database 223
14.2 Building the Database Structure 226
14.2.1 Installing a RDBMS 226
14.2.2 Creating the Database 228
14.2.3 Creating the Tables and Relations 229
14.3 Adding Data to the Database 235
14.4 Querying the Database 239
14.4.1 The Basic Select Query 239
14.4.2 More Complex Queries 240
14.5 Modifying the Database Structure 244
14.6 Selected Features of SQL 249
14.6.1 Changing Data 249
14.6.2 Functions in SQL 249
15 Connecting R to an SQL Database 253
IV Data Wrangling 257
16 Anonymous Data 261
17 Data Wrangling in the tidyverse 265
17.1 Importing the Data 266
17.1.1 Importing from an SQLRDBMS 266
17.1.2 Importing Flat Files in the Tidyverse 267
17.2 Tidy Data 275
17.3 Tidying Up Data with tidyr 277
17.3.1 Splitting Tables 278
17.3.2 Convert Headers to Data 281
17.3.3 Spreading One Column Over Many 284
17.3.4 Split One Columns into Many 285
17.3.5 Merge Multiple Columns Into One 286
17.3.6 Wrong Data 287
17.4 SQL-like Functionality via dplyr 288
17.4.1 Selecting Columns 288
17.4.2 Filtering Rows 289
17.4.3 Joining 290
17.4.4 Mutating Data 293
17.4.5 Set Operations 296
17.5 String Manipulation in the tidyverse 299
17.5.1 Basic String Manipulation 300
17.5.2 Pattern Matching with Regular Expressions 302
17.6 Dates with lubridate 314
17.6.1 ISO 8601 Format 315
17.6.2 Time-zones 317
17.6.3 Extract Date and Time Components 318
17.6.4 Calculating with Date-times 319
17.7 Factors with Forcats 325
18 Dealing with Missing Data 333
18.1 Reasons for Data to be Missing 334
18.2 Methods to Handle Missing Data 336
18.2.1 Alternative Solutions to Missing Data 336
18.2.2 Predictive Mean Matching(PMM) 338
18.3 R Packages to Deal with Missing Data 339
18.3.1 mice 339
18.3.2 missForest 340
18.3.3 Hmisc 341
19 Data Binning 343
19.1 What is Binning and Why Use It 343
19.2 Tuning the Binning Procedure 347
19.3 More Complex Cases: Matrix Binning 352
19.4 Weight of Evidence and Information Value 359
19.4.1 Weight of Evidence(WOE) 359
19.4.2 Information Value(IV) 359
19.4.3 WOE and IV in R 359
20 Factoring Analysis and Principle Components 363
20.1 Principle Components Analysis (PCA) 364
20.2 Factor Analysis 368
V Modelling 373
21 Regression Models 375
21.1 Linear Regression 375
21.2 Multiple Linear Regression 379
21.2.1 Poisson Regression 379
21.2.2 Non-linear Regression 381
21.3 Performance of Regression Models 384
21.3.1 Mean Square Error (MSE) 384
21.3.2 R-Squared 384
21.3.3 Mean Average Deviation(MAD) 386
22 Classification Models 387
22.1 Logistic Regression 388
22.2 Performance of Binary Classification Models 390
22.2.1 The Confusion Matrix and Related Measures 391
22.2.2 ROC 393
22.2.3 The AUC 396
22.2.4 The Gini Coefficient 397
22.2.5 Kolmogorov-Smirnov (KS) for Logistic Regression 398
22.2.6 Finding an Optimal Cut-off 399
23 Learning Machines 405
23.1 Decision Tree 407
23.1.1 Essential Background 407
23.1.2 Important Considerations 412
23.1.3 Growing Trees with the Package rpart 414
23.1.4 Evaluating the Performance of a Decision Tree 424
23.2 Random Forest 428
23.3 Artificial Neural Networks (ANNs) 434
23.3.1 The Basics of ANNs in R 434
23.3.2 Neural Networks in R 436
23.3.3 The Work-flow to for Fitting a NN 438
23.3.4 Cross Validate the NN 444
23.4 Support Vector Machine 447
23.4.1 Fitting a SVM in R 447
23.4.2 Optimizing the SVM 449
23.5 Unsupervised Learning and Clustering 450
23.5.1 k-Means Clustering 450
23.5.2 Visualizing Clusters in Three Dimensions 462
23.5.3 Fuzzy Clustering 464
23.5.4 Hierarchical Clustering 466
23.5.5 Other Clustering Methods 468
24 Towards a Tidy Modelling Cycle with modelr 469
24.1 Adding Predictions 470
24.2 Adding Residuals 471
24.3 Bootstrapping Data 472
24.4 Other Functions of modelr 474
25 Model Validation 475
25.1 Model Quality Measures 476
25.2 Predictions and Residuals 477
25.3 Bootstrapping 479
25.3.1 Bootstrapping in Base R 479
25.3.2 Bootstrapping in the tidyverse with modelr 481
25.4 Cross-Validation 483
25.4.1 Elementary Cross Validation 483
25.4.2 Monte Carlo Cross Validation 486
25.4.3 k-Fold Cross Validation 488
25.4.4 Comparing Cross Validation Methods 489
25.5 Validation in a Broader Perspective 492
26 Labs 495
26.1 Financial Analysis with quantmod 495
26.1.1 The Basics of quantmod 495
26.1.2 Types of Data Available in quantmod 496
26.1.3 Plotting with quantmod 497
26.1.4 The quantmod Data Structure 500
26.1.5 Support Functions Supplied by quantmod 502
26.1.6 Financial Modelling in quantmod 504
27 Multi Criteria Decision Analysis (MCDA) 511
27.1 What and Why 511
27.2 General Work-flow 513
27.3 Identify the Issue at Hand: Steps 1 and 2 516
27.4 Step3: the Decision Matrix 518
27.4.1 Construct a Decision Matrix 518
27.4.2 Normalize the Decision Matrix 520
27.5 Step 4: Delete Inefficient and Unacceptable Alternatives 521
27.5.1 Unacceptable Alternatives 521
27.5.2 Dominance - Inefficient Alternatives 521
27.6 Plotting Preference Relationships 524
27.7 Step5: MCDA Methods 526
27.7.1 Examples of Non-compensatory Methods 526
27.7.2 The Weighted Sum Method(WSM) 527
27.7.3 Weighted Product Method(WPM) 530
27.7.4 ELECTRE 530
27.7.5 PROMethEE 540
27.7.6 PCA(Gaia) 553
27.7.7 Outranking Methods 557
27.7.8 Goal Programming 558
27.8 Summary MCDA 561
VI Introduction to Companies 563
28 Financial Accounting (FA) 567
28.1 The Statements of Accounts 568
28.1.1 Income Statement 568
28.1.2 Net Income: The P&L statement 568
28.1.3 Balance Sheet 569
28.2 The Value Chain 571
28.3 Further, Terminology 573
28.4 Selected Financial Ratios 575
29 Management Accounting 583
29.1 Introduction 583
29.1.1 Definition of Management Accounting (MA) 583
29.1.2 Management Information Systems (MIS) 584
29.2 Selected Methods in MA 585
29.2.1 Cost Accounting 585
29.2.2 Selected Cost Types 587
29.3 Selected Use Cases of MA 590
29.3.1 Balanced Scorecard 590
29.3.2 Key Performance Indicators (KPIs) 591
30 Asset Valuation Basics 597
30.1 Time Value of Money 598
30.1.1 Interest Basics 598
30.1.2 Specific Interest Rate Concepts 598
30.1.3 Discounting 600
30.2 Cash 601
30.3 Bonds 602
30.3.1 Features of a Bond 602
30.3.2 Valuation of Bonds 604
30.3.3 Duration 606
30.4 The Capital Asset Pricing Model (CAPM) 610
30.4.1 The CAPM Framework 610
30.4.2 The CAPM and Risk 612
30.4.3 Limitations and Shortcomings of the CAPM 612
30.5 Equities 614
30.5.1 Definition 614
30.5.2 Short History 614
30.5.3 Valuation of Equities 615
30.5.4 Absolute Value Models 616
30.5.5 Relative Value Models 625
30.5.6 Selection of Valuation Methods 630
30.5.7 Pitfalls in Company Valuation 631
30.6 Forwards and Futures 638
30.7 Options 640
30.7.1 Definitions 640
30.7.2 Commercial Aspects 642
30.7.3 Short History 643
30.7.4 Valuation of Options at Maturity 644
30.7.5 The Black and Scholes Model 649
30.7.6 The Binomial Model 654
30.7.7 Dependencies of the Option Price 660
30.7.8 The Greeks 664
30.7.9 Delta Hedging 665
30.7.10 Linear Option Strategies 667
30.7.11 Integrated Option Strategies 674
30.7.12 Exotic Options 678
30.7.13 Capital Protected Structures 680
VII Reporting 683
31 A Grammar of Graphics with ggplot2 687
31.1 TheBasicsofggplot2 688
31.2 Over-plotting 692
31.3 CaseStudyforggplot2 696
32 R Markdown 699
33 knitr and LATEX 703
34 An Automated Development Cycle 707
35 Writing and Communication Skills 709
36 Interactive Apps 713
36.1 Shiny 715
36.2 Browser Born Data Visualization 719
36.2.1 HTML-widgets 719
36.2.2 Interactive Maps with leaflet 720
36.2.3 Interactive Data Visualisation with ggvis 721
36.2.4 googleVis 723
36.3 Dashboards 725
36.3.1 The Business Case: a Diversity Dashboard 726
36.3.2 A Dashboard with flexdashboard 731
36.3.3 A Dashboard with shinydashboard 737
VIII Bigger and Faster R 741
37 Parallel Computing 743
37.1 Combine foreach and doParallel 745
37.2 Distribute Calculations over LAN with Snow 748
37.3 Using the GPU 752
37.3.1 Getting Started with gpuR 754
37.3.2 On the Importance of Memory use 757
37.3.3 Conclusions for GPU Programming 759
38 R and Big Data 761
38.1 Use a Powerful Server 763
38.1.1 Use R on a Server 763
38.1.2 Let the Database Server do the Heavy Lifting 763
38.2 Using more Memory than we have RAM 765
39 Parallelism for Big Data 767
39.1 Apache Hadoop 769
39.2 Apache Spark 771
39.2.1 Installing Spark 771
39.2.2 Running Spark 773
39.2.3 SparkR 776
39.2.4 sparklyr 788
39.2.5 SparkR or sparklyr 791
40 The Need for Speed 793
40.1 Benchmarking 794
40.2 Optimize Code 797
40.2.1 Avoid Repeating the Same 797
40.2.2 Use Vectorisation where Appropriate 797
40.2.3 Pre-allocating Memory 799
40.2.4 Use the Fastest Function 800
40.2.5 Use the Fastest Package 801
40.2.6 Be Mindful about Details 802
40.2.7 Compile Functions 804
40.2.8 Use C or C++ Code in R 806
40.2.9 Using a C++ Source File in R 809
40.2.10CallCompiledC++Functions in R 811
40.3 Profiling Code 812
40.3.1 The Package profr 813
40.3.2 The Package proftools 813
40.4 Optimize Your Computer 817
IX Appendices 819
A Create your own R Package 821
A.1 Creating the Package in the R Console 823
A.2 Update the Package Description 825
A.3 Documenting the Functionsxs 826
A.4 Loading the Package 827
A.5 Further Steps 828
B Levels of Measurement 829
B.1 Nominal Scale 829
B.2 Ordinal Scale 830
B.3 Interval Scale 831
B.4 Ratio Scale 832
C Trademark Notices 833
C.1 General Trademark Notices 834
C.2 R-Related Notices 835
C.2.1 Crediting Developers of R Packages 835
C.2.2 The R-packages used in this Book 835
D Code Not Shown in the Body of the Book 839
E Answers to Selected Questions 845
Bibliography 859
Nomenclature 869
Index 881