R-ticulate. A Beginner's Guide to Data Analysis for Natural Scientists. Edition No. 1


Book
224 Pages
June 2024
John Wiley and Sons Ltd
ID: 5948882

An accessible learning resource that develops data analysis skills for natural science students in an efficient style using the R programming language

R-ticulate: A Beginner’s Guide to Data Analysis for Natural Scientists is a compact, example-based, and user-friendly statistics textbook without unnecessary frills, but instead filled with engaging, relatable examples, practical tips, online exercises, resources, and references to extensions, all on a level that follows contemporary curricula taught in large parts of the world.

The content structure is unique in the sense that statistical skills are introduced at the same time as software (programming) skills in R. This is by far the best way of teaching from the authors’ experience.

Readers of this introductory text will find: - Explanations of statistical concepts in simple, easy-to-understand language - A variety of approaches to problem solving using both base R and tidyverse - Boxes dedicated to specific topics and margin text that summarizes key points - A clearly outlined schedule organized into 12 chapters corresponding to the 12 semester weeks of most universities

While at its core a traditional printed book, R-ticulate: A Beginner’s Guide to Data Analysis for Natural Scientists comes with a wealth of online teaching material, making it an ideal and efficient reference for students who wish to gain a thorough understanding of the subject, as well as for instructors teaching related courses.

Foreword ix

Preface xi

About the Companion Website xiii

1 Hypotheses, Variables, Data 1

1.1 Occam’s Razor 2

1.2 Scientific Hypotheses 2

1.3 The Choice of a Software 3

1.3.1 First Steps in R 3

1.4 Variables 5

1.4.1 Variable Names and Values 5

1.4.2 Types of Variables 10

1.4.3 Predictor and Response Variables 11

1.5 Data Processing and Data Formats 12

1.5.1 The Long vs. the Wide Format 12

1.5.2 Choice of Variable, Dataset, and File Names 12

1.5.3 Adding, Removing, and Subsetting Variables and Data Frames 14

1.5.4 Aggregating Data 17

1.5.5 Working with Time and Strings 19

2 Measuring Variation 23

2.1 What Is Variation? 23

2.2 Treatment vs. Control 23

2.3 Systematic and Unsystematic Variation 24

2.4 The Signal-to-Noise Ratio 25

2.5 Measuring Variation Graphically 26

2.6 Measuring Variation Using Metrics 27

2.7 The Standard Error 29

2.8 Population vs. Sample 31

3 Distributions and Probabilities 35

3.1 Probability Distributions 35

3.2 Finding the Best Fitting Distribution for Sample Data 37

3.2.1 Graphical Tools 37

3.2.2 Goodness-of-Fit Tests 39

3.3 Quantiles 42

3.4 Probabilities 44

3.4.1 Density Functions (dnorm, dbinom, .) 44

3.4.2 Probability Distribution Functions (pnorm, pbinom, .) 46

3.4.3 Quantile Functions (qnorm, qbinom, .) 48

3.4.4 Random Sampling Functions (rnorm, rbinom, .) 49

3.5 The Normal Distribution 50

3.6 Central Limit Theorem 50

3.7 Test Statistics 52

3.7.1 Null and Alternative Hypotheses 53

3.7.2 The Alpha Threshold and Significance Levels 54

3.7.3 Type I and Type II Errors 54

References 56

4 Replication and Randomisation 57

4.1 Replication 57

4.2 Statistical Independence 60

4.3 Randomisation 61

4.4 Randomisation in R 64

4.5 Spatial Replication and Randomisation in Observational Studies 65

5 Two-Sample and One-Sample Tests 67

5.1 The t-Statistic 67

5.2 Two Sample Tests: Comparing Two Groups 67

5.2.1 Student’s t-Test 67

5.2.1.1 Testing for Normality 68

5.2.1.2 What to Write in a Report or Paper and How to Visualise the Results of a t-Test 74

5.2.1.3 Two-Tailed vs. One-Tailed t-Tests 75

5.2.2 Rank-Based Two-Sample Tests 77

5.3 One-Sample Tests 78

5.4 Power Analyses and Sample Size Determination 79

6 Communicating Quantitative Information Using Visuals 83

6.1 The Fundamentals of Scientific Plotting 84

6.2 Scatter Plots 85

6.3 Line Plots 87

6.4 Box Plots and Bar Plots 89

6.5 Multipanel Plots and Plotting Regions 91

6.6 Adding Text, Formulae, and Colour 92

6.7 Interaction Plots 94

6.8 Images, Colour Contour Plots, and 3D Plots 94

6.8.1 Adding Images to Plots 94

6.8.2 Colour Contour Plots 96

References 101

7 Working with Categorical Data 103

7.1 Tabling and Visualising Categorical Data 103

7.2 Contingency Tables 105

7.3 The Chi-squared Test 106

7.4 Decision Trees 108

7.5 Optimising Decision Trees 111

References 113

8 Working with Continuous Data 115

8.1 Covariance 115

8.2 Correlation Coefficient 116

8.3 Transformations 118

8.4 Plotting Correlations 120

8.5 Correlation Tests 122

References 124

9 Linear Regression 125

9.1 Basics and Simple Linear Regression 125

9.1.1 Making Sense of the summary Output for Regression Models Fitted with lm 128

9.1.2 Model Diagnostics 131

9.1.3 Model Predictions and Visualisation 135

9.1.4 What to Write in a Report or Paper? 137

9.1.4.1 Material and Methods 137

9.1.4.2 Results 137

9.1.5 Dealing with Variance Heterogeneity 137

9.2 Multiple Linear Regression 140

9.2.1 Multicollinearity in Multiple Regression Models 143

9.2.2 Testing Interactions Among Predictors 147

9.2.3 Model Selection and Comparison 148

9.2.4 Variable Importance 151

9.2.5 Visualising Multiple Linear Regression Results 151

References 154

10 One or More Categorical Predictors - Analysis of Variance 155

10.1 Comparing Groups 155

10.2 Comparing Groups Numerically 155

10.3 One-way ANOVA Using R 161

10.4 Checking for the Model Assumptions 162

10.5 Post Hoc Comparisons 162

10.6 Two-way ANOVA and Interactions 165

10.7 What If the Model Assumptions Are Violated? 166

Reference 168

11 Analysis of Covariance (ANCOVA) 169

11.1 Interpreting ANCOVA Results 171

11.2 Post Hoc Test for ANCOVA 176

References 177

12 Some of What Lies Ahead 179

12.1 Generalised Linear Models 179

12.2 Nonlinear Regression 185

12.2.1 Initial Parameter Estimates (Starting Values) 187

12.2.2 Nonlinear Model Fitting and Visualisation 187

12.3 Generalised Additive Models 189

12.4 Modern Approaches to Dealing with Heteroscedasticity 191

12.4.1 Variance Modelling Using Generalised Least-squares Estimation 193

12.4.2 Robust, Heteroscedasticity-Consistent Covariance Matrix Estimation 195

References 198

Index 201

Authors

Martin Bader Saarland University, Germany; Waikato University, New Zealand; University of Basel, Switzerland. Sebastian Leuzinger James Cook University, Australia; University of Neuchatel, Switzerland; University of Basel, Switzerland.

Table of Contents

Authors

Related Topics

Related Products

Next-Generation Sequencing Data Analysis Market by Product Type (Consumables & Reagents, Instruments, Services), Application (Epigenomics, Genomics, Metagenomics), End User - Global Forecast 2025-2030

Next-Generation Sequencing Data Analysis Market Report 2025

Lithium Battery Online Non-destructive Inspection Equipment Market by Inspection Technology, Equipment Type, End User Industry, Integration Mode, Performance Capability, Data Analysis and AI - Global Forecast to 2030

Clinical Next-Generation Sequencing (NGS) Data Analysis Market Report 2025

Next Generation Sequencing (NGS) Data Analysis - Global Strategic Business Report