The Big R-Book. From Data Science to Learning Machines and Big Data. Edition No. 1


Book
928 Pages
December 2020
John Wiley and Sons Ltd
ID: 5838405

Introduces professionals and scientists to statistics and machine learning using the programming language R

Written by and for practitioners, this book provides an overall introduction to R, focusing on tools and methods commonly used in data science, and placing emphasis on practice and business use. It covers a wide range of topics in a single volume, including big data, databases, statistical machine learning, data wrangling, data visualization, and the reporting of results. The topics covered are all important for someone with a science/math background that is looking to quickly learn several practical technologies to enter or transition to the growing field of data science.

The Big R-Book for Professionals: From Data Science to Learning Machines and Reporting with R includes nine parts, starting with an introduction to the subject and followed by an overview of R and elements of statistics. The third part revolves around data, while the fourth focuses on data wrangling. Part 5 teaches readers about exploring data. In Part 6 we learn to build models, Part 7 introduces the reader to the reality in companies, Part 8 covers reports and interactive applications and finally Part 9 introduces the reader to big data and performance computing. It also includes some helpful appendices.

Provides a practical guide for non-experts with a focus on business users
Contains a unique combination of topics including an introduction to R, machine learning, mathematical models, data wrangling, and reporting
Uses a practical tone and integrates multiple topics in a coherent framework
Demystifies the hype around machine learning and AI by enabling readers to understand the provided models and program them in R
Shows readers how to visualize results in static and interactive reports
Supplementary materials includes PDF slides based on the book’s content, as well as all the extracted R-code and is available to everyone on a Wiley Book Companion Site

The Big R-Book is an excellent guide for science technology, engineering, or mathematics students who wish to make a successful transition from the academic world to the professional. It will also appeal to all young data scientists, quantitative analysts, and analytics professionals, as well as those who make mathematical models.

Foreword xxv

About the Author xxvii

Acknowledgements xxix

Preface xxxi

About the Companion Site xxxv

I Introduction 1

1 The Big Picture with Kondratiev and Kardashev 3

2 The Scientific Method and Data 7

3 Conventions 11

II Starting with R and Elements of Statistics 19

4 The Basics of R 21

4.1 Getting Started with R 23

4.2 Variables 26

4.3 Data Types 28

4.3.1 The Elementary Types 28

4.3.2 Vectors 29

4.3.3 Accessing Data from a Vector 29

4.3.4 Matrices 32

4.3.5 Arrays 38

4.3.6 Lists 41

4.3.7 Factors 45

4.3.8 Data Frames 49

4.3.9 Strings or the Character-type 54

4.4 Operators 57

4.4.1 Arithmetic Operators 57

4.4.2 Relational Operators 57

4.4.3 Logical Operators 58

4.4.4 Assignment Operators 59

4.4.5 Other Operators 61

4.5 Flow Control Statements 63

4.5.1 Choices 63

4.5.2 Loops 65

4.6 Functions 69

4.6.1 Built-in Functions 69

4.6.2 Help with Functions 69

4.6.3 User-defined Functions 70

4.6.4 Changing Functions 70

4.6.5 Creating Function with Default Arguments 71

4.7 Packages 72

4.7.1 Discovering Packages in R 72

4.7.2 Managing Packages in R 73

4.8 Selected Data Interfaces 75

4.8.1 CSV Files 75

4.8.2 Excel Files 79

4.8.3 Databases 79

5 Lexical Scoping and Environments 81

5.1 Environments in R 81

5.2 Lexical Scoping in R 83

6 The Implementation of OO 87

6.1 Base Types 89

6.2 S3 Objects 91

6.2.1 Creating S3 Objects 94

6.2.2 Creating Generic Methods 96

6.2.3 Method Dispatch 97

6.2.4 Group Generic Functions 98

6.3 S4 Objects 100

6.3.1 Creating S4 Objects 100

6.3.2 Using S4 Objects 101

6.3.3 Validation of Input 105

6.3.4 Constructor functions 107

6.3.5 The Data slot 108

6.3.6 Recognising Objects, Generic Functions, and Methods 108

6.3.7 CreatingS4Generics 110

6.3.8 Method Dispatch 111

6.4 The Reference Class, refclass, RC or R5 Model 113

6.4.1 Creating RC Objects 113

6.4.2 Important Methods and Attributes 117

6.5 Conclusions about the OO Implementation 119

7 Tidy R with the Tidyverse 121

7.1 The Philosophy of the Tidyverse 121

7.2 Packages in the Tidyverse 124

7.2.1 The Core Tidyverse 124

7.2.2 The Non-core Tidyverse 125

7.3 Working with the Tidyverse 127

7.3.1 Tibbles 127

7.3.2 Piping with R 132

7.3.3 Attention Points When Using the Pipe 133

7.3.4 Advanced Piping 134

7.3.5 Conclusion 137

8 Elements of Descriptive Statistics 139

8.1 Measures of Central Tendency 139

8.1.1 Mean 139

8.1.2 The Median 142

8.1.3 The Mode 143

8.2 Measures of Variation or Spread 145

8.3 Measures of Covariation 147

8.3.1 The Pearson Correlation 147

8.3.2 The Spearman Correlation 148

8.3.3 Chi-square Tests 149

8.4 Distributions 150

8.4.1 Normal Distribution 150

8.4.2 Binomial Distribution 153

8.5 Creating an Overview of Data Characteristics 155

9 Visualisation Methods 159

9.1 Scatterplots 161

9.2 Line Graphs 163

9.3 Pie Charts 165

9.4 Bar Charts 167

9.5 Boxplots 171

9.6 Violin Plots 173

9.7 Histograms 176

9.8 Plotting Functions 179

9.9 Maps and Contour Plots 180

9.10 Heat-maps 181

9.11 Text Mining 184

9.11.1 Word Clouds 184

9.11.2 Word Associations 188

9.12 Colours in R 191

10 Time Series Analysis 197

10.1 Time Series in R 197

10.1.1 The Basics of Time Series in R 197

10.2 Forecasting 200

10.2.1 Moving Average 200

10.2.2 Seasonal Decomposition 206

11 Further Reading 211

III Data Import 213

12 A Short History of Modern Database Systems 215

13 RDBMS 219

14 SQL 223

14.1 Designing the Database 223

14.2 Building the Database Structure 226

14.2.1 Installing a RDBMS 226

14.2.2 Creating the Database 228

14.2.3 Creating the Tables and Relations 229

14.3 Adding Data to the Database 235

14.4 Querying the Database 239

14.4.1 The Basic Select Query 239

14.4.2 More Complex Queries 240

14.5 Modifying the Database Structure 244

14.6 Selected Features of SQL 249

14.6.1 Changing Data 249

14.6.2 Functions in SQL 249

15 Connecting R to an SQL Database 253

IV Data Wrangling 257

16 Anonymous Data 261

17 Data Wrangling in the tidyverse 265

17.1 Importing the Data 266

17.1.1 Importing from an SQLRDBMS 266

17.1.2 Importing Flat Files in the Tidyverse 267

17.2 Tidy Data 275

17.3 Tidying Up Data with tidyr 277

17.3.1 Splitting Tables 278

17.3.2 Convert Headers to Data 281

17.3.3 Spreading One Column Over Many 284

17.3.4 Split One Columns into Many 285

17.3.5 Merge Multiple Columns Into One 286

17.3.6 Wrong Data 287

17.4 SQL-like Functionality via dplyr 288

17.4.1 Selecting Columns 288

17.4.2 Filtering Rows 289

17.4.3 Joining 290

17.4.4 Mutating Data 293

17.4.5 Set Operations 296

17.5 String Manipulation in the tidyverse 299

17.5.1 Basic String Manipulation 300

17.5.2 Pattern Matching with Regular Expressions 302

17.6 Dates with lubridate 314

17.6.1 ISO 8601 Format 315

17.6.2 Time-zones 317

17.6.3 Extract Date and Time Components 318

17.6.4 Calculating with Date-times 319

17.7 Factors with Forcats 325

18 Dealing with Missing Data 333

18.1 Reasons for Data to be Missing 334

18.2 Methods to Handle Missing Data 336

18.2.1 Alternative Solutions to Missing Data 336

18.2.2 Predictive Mean Matching(PMM) 338

18.3 R Packages to Deal with Missing Data 339

18.3.1 mice 339

18.3.2 missForest 340

18.3.3 Hmisc 341

19 Data Binning 343

19.1 What is Binning and Why Use It 343

19.2 Tuning the Binning Procedure 347

19.3 More Complex Cases: Matrix Binning 352

19.4 Weight of Evidence and Information Value 359

19.4.1 Weight of Evidence(WOE) 359

19.4.2 Information Value(IV) 359

19.4.3 WOE and IV in R 359

20 Factoring Analysis and Principle Components 363

20.1 Principle Components Analysis (PCA) 364

20.2 Factor Analysis 368

V Modelling 373

21 Regression Models 375

21.1 Linear Regression 375

21.2 Multiple Linear Regression 379

21.2.1 Poisson Regression 379

21.2.2 Non-linear Regression 381

21.3 Performance of Regression Models 384

21.3.1 Mean Square Error (MSE) 384

21.3.2 R-Squared 384

21.3.3 Mean Average Deviation(MAD) 386

22 Classification Models 387

22.1 Logistic Regression 388

22.2 Performance of Binary Classification Models 390

22.2.1 The Confusion Matrix and Related Measures 391

22.2.2 ROC 393

22.2.3 The AUC 396

22.2.4 The Gini Coefficient 397

22.2.5 Kolmogorov-Smirnov (KS) for Logistic Regression 398

22.2.6 Finding an Optimal Cut-off 399

23 Learning Machines 405

23.1 Decision Tree 407

23.1.1 Essential Background 407

23.1.2 Important Considerations 412

23.1.3 Growing Trees with the Package rpart 414

23.1.4 Evaluating the Performance of a Decision Tree 424

23.2 Random Forest 428

23.3 Artificial Neural Networks (ANNs) 434

23.3.1 The Basics of ANNs in R 434

23.3.2 Neural Networks in R 436

23.3.3 The Work-flow to for Fitting a NN 438

23.3.4 Cross Validate the NN 444

23.4 Support Vector Machine 447

23.4.1 Fitting a SVM in R 447

23.4.2 Optimizing the SVM 449

23.5 Unsupervised Learning and Clustering 450

23.5.1 k-Means Clustering 450

23.5.2 Visualizing Clusters in Three Dimensions 462

23.5.3 Fuzzy Clustering 464

23.5.4 Hierarchical Clustering 466

23.5.5 Other Clustering Methods 468

24 Towards a Tidy Modelling Cycle with modelr 469

24.1 Adding Predictions 470

24.2 Adding Residuals 471

24.3 Bootstrapping Data 472

24.4 Other Functions of modelr 474

25 Model Validation 475

25.1 Model Quality Measures 476

25.2 Predictions and Residuals 477

25.3 Bootstrapping 479

25.3.1 Bootstrapping in Base R 479

25.3.2 Bootstrapping in the tidyverse with modelr 481

25.4 Cross-Validation 483

25.4.1 Elementary Cross Validation 483

25.4.2 Monte Carlo Cross Validation 486

25.4.3 k-Fold Cross Validation 488

25.4.4 Comparing Cross Validation Methods 489

25.5 Validation in a Broader Perspective 492

26 Labs 495

26.1 Financial Analysis with quantmod 495

26.1.1 The Basics of quantmod 495

26.1.2 Types of Data Available in quantmod 496

26.1.3 Plotting with quantmod 497

26.1.4 The quantmod Data Structure 500

26.1.5 Support Functions Supplied by quantmod 502

26.1.6 Financial Modelling in quantmod 504

27 Multi Criteria Decision Analysis (MCDA) 511

27.1 What and Why 511

27.2 General Work-flow 513

27.3 Identify the Issue at Hand: Steps 1 and 2 516

27.4 Step3: the Decision Matrix 518

27.4.1 Construct a Decision Matrix 518

27.4.2 Normalize the Decision Matrix 520

27.5 Step 4: Delete Inefficient and Unacceptable Alternatives 521

27.5.1 Unacceptable Alternatives 521

27.5.2 Dominance - Inefficient Alternatives 521

27.6 Plotting Preference Relationships 524

27.7 Step5: MCDA Methods 526

27.7.1 Examples of Non-compensatory Methods 526

27.7.2 The Weighted Sum Method(WSM) 527

27.7.3 Weighted Product Method(WPM) 530

27.7.4 ELECTRE 530

27.7.5 PROMethEE 540

27.7.6 PCA(Gaia) 553

27.7.7 Outranking Methods 557

27.7.8 Goal Programming 558

27.8 Summary MCDA 561

VI Introduction to Companies 563

28 Financial Accounting (FA) 567

28.1 The Statements of Accounts 568

28.1.1 Income Statement 568

28.1.2 Net Income: The P&L statement 568

28.1.3 Balance Sheet 569

28.2 The Value Chain 571

28.3 Further, Terminology 573

28.4 Selected Financial Ratios 575

29 Management Accounting 583

29.1 Introduction 583

29.1.1 Definition of Management Accounting (MA) 583

29.1.2 Management Information Systems (MIS) 584

29.2 Selected Methods in MA 585

29.2.1 Cost Accounting 585

29.2.2 Selected Cost Types 587

29.3 Selected Use Cases of MA 590

29.3.1 Balanced Scorecard 590

29.3.2 Key Performance Indicators (KPIs) 591

30 Asset Valuation Basics 597

30.1 Time Value of Money 598

30.1.1 Interest Basics 598

30.1.2 Specific Interest Rate Concepts 598

30.1.3 Discounting 600

30.2 Cash 601

30.3 Bonds 602

30.3.1 Features of a Bond 602

30.3.2 Valuation of Bonds 604

30.3.3 Duration 606

30.4 The Capital Asset Pricing Model (CAPM) 610

30.4.1 The CAPM Framework 610

30.4.2 The CAPM and Risk 612

30.4.3 Limitations and Shortcomings of the CAPM 612

30.5 Equities 614

30.5.1 Definition 614

30.5.2 Short History 614

30.5.3 Valuation of Equities 615

30.5.4 Absolute Value Models 616

30.5.5 Relative Value Models 625

30.5.6 Selection of Valuation Methods 630

30.5.7 Pitfalls in Company Valuation 631

30.6 Forwards and Futures 638

30.7 Options 640

30.7.1 Definitions 640

30.7.2 Commercial Aspects 642

30.7.3 Short History 643

30.7.4 Valuation of Options at Maturity 644

30.7.5 The Black and Scholes Model 649

30.7.6 The Binomial Model 654

30.7.7 Dependencies of the Option Price 660

30.7.8 The Greeks 664

30.7.9 Delta Hedging 665

30.7.10 Linear Option Strategies 667

30.7.11 Integrated Option Strategies 674

30.7.12 Exotic Options 678

30.7.13 Capital Protected Structures 680

VII Reporting 683

31 A Grammar of Graphics with ggplot2 687

31.1 TheBasicsofggplot2 688

31.2 Over-plotting 692

31.3 CaseStudyforggplot2 696

32 R Markdown 699

33 knitr and LATEX 703

34 An Automated Development Cycle 707

35 Writing and Communication Skills 709

36 Interactive Apps 713

36.1 Shiny 715

36.2 Browser Born Data Visualization 719

36.2.1 HTML-widgets 719

36.2.2 Interactive Maps with leaflet 720

36.2.3 Interactive Data Visualisation with ggvis 721

36.2.4 googleVis 723

36.3 Dashboards 725

36.3.1 The Business Case: a Diversity Dashboard 726

36.3.2 A Dashboard with flexdashboard 731

36.3.3 A Dashboard with shinydashboard 737

VIII Bigger and Faster R 741

37 Parallel Computing 743

37.1 Combine foreach and doParallel 745

37.2 Distribute Calculations over LAN with Snow 748

37.3 Using the GPU 752

37.3.1 Getting Started with gpuR 754

37.3.2 On the Importance of Memory use 757

37.3.3 Conclusions for GPU Programming 759

38 R and Big Data 761

38.1 Use a Powerful Server 763

38.1.1 Use R on a Server 763

38.1.2 Let the Database Server do the Heavy Lifting 763

38.2 Using more Memory than we have RAM 765

39 Parallelism for Big Data 767

39.1 Apache Hadoop 769

39.2 Apache Spark 771

39.2.1 Installing Spark 771

39.2.2 Running Spark 773

39.2.3 SparkR 776

39.2.4 sparklyr 788

39.2.5 SparkR or sparklyr 791

40 The Need for Speed 793

40.1 Benchmarking 794

40.2 Optimize Code 797

40.2.1 Avoid Repeating the Same 797

40.2.2 Use Vectorisation where Appropriate 797

40.2.3 Pre-allocating Memory 799

40.2.4 Use the Fastest Function 800

40.2.5 Use the Fastest Package 801

40.2.6 Be Mindful about Details 802

40.2.7 Compile Functions 804

40.2.8 Use C or C++ Code in R 806

40.2.9 Using a C++ Source File in R 809

40.2.10CallCompiledC++Functions in R 811

40.3 Profiling Code 812

40.3.1 The Package profr 813

40.3.2 The Package proftools 813

40.4 Optimize Your Computer 817

IX Appendices 819

A Create your own R Package 821

A.1 Creating the Package in the R Console 823

A.2 Update the Package Description 825

A.3 Documenting the Functionsxs 826

A.4 Loading the Package 827

A.5 Further Steps 828

B Levels of Measurement 829

B.1 Nominal Scale 829

B.2 Ordinal Scale 830

B.3 Interval Scale 831

B.4 Ratio Scale 832

C Trademark Notices 833

C.1 General Trademark Notices 834

C.2 R-Related Notices 835

C.2.1 Crediting Developers of R Packages 835

C.2.2 The R-packages used in this Book 835

D Code Not Shown in the Body of the Book 839

E Answers to Selected Questions 845

Bibliography 859

Nomenclature 869

Index 881

Authors

Philippe J. S. De Brouwer

Table of Contents

Authors

Related Topics

Related Products

Practical Machine Learning in R. Edition No. 1

R-ticulate. A Beginner's Guide to Data Analysis for Natural Scientists. Edition No. 1

Immunoinformatics of Cancers. Practical Machine Learning Approaches Using R

Data Science, Analytics and Machine Learning with R

Fundamentals of Mathematics in Medical Research: Theory and Cases