The False Discovery Rate. Its Meaning, Interpretation and Application in Data Science. Edition No. 1. Statistics in Practice


Book
288 Pages
December 2024
John Wiley and Sons Ltd
ID: 5973619

The False Discovery Rate

An essential tool for statisticians and data scientists seeking to interpret the vast troves of data that increasingly power our world

First developed in the 1990s, the False Discovery Rate (FDR) is a way of describing the rate at which null hypothesis testing produces errors. It has since become an essential tool for interpreting large datasets. In recent years, as datasets have become ever larger, and as the importance of ‘big data’ to scientific research has grown, the significance of the FDR has grown correspondingly.

The False Discovery Rate provides an analysis of the FDR’s value as a tool, including why it should generally be preferred to the Bonferroni correction and other methods by which multiplicity can be accounted for. It offers a systematic overview of the FDR, its core claims, and its applications.

Readers of The False Discovery Rate will also find: - Case studies throughout, rooted in real and simulated data sets - Detailed discussion of topics including representation of the FDR on a Q-Q plot, consequences of non-monotonicity, and many more - Wide-ranging analysis suited for a broad readership

The False Discovery Rate is ideal for Statistics and Data Science courses, and short courses associated with conferences. It is also useful as supplementary reading in courses in other disciplines that require the statistical interpretation of ‘big data’. The book will also be of great value to statisticians and researchers looking to learn more about the FDR.

STATISTICS IN PRACTICE

A series of practical books outlining the use of statistical techniques in a wide range of applications areas: - HUMAN AND BIOLOGICAL SCIENCES - EARTH AND ENVIRONMENTAL SCIENCES - INDUSTRY, COMMERCE AND FINANCE

Preface and Acknowledgement ix

About the Companion Website xi

1 Introduction 1

1.1 A Brief History of Multiple Testing 1

1.2 Outline of the Book 9

1.3 Summary 11

References 13

2 The Meaning of the False Discovery Rate (FDR) 15

2.1 True Hypothesis Versus Conclusion from Evidence: The Confusion Matrix 15

2.2 The Meaning of the p-Value 16

2.3 The Meaning of the FDR: Its Relationship to the Confusion Matrix and the p-Value 17

2.4 Control of the FDR While Minimising False-Negative Results: The Benjamini-Hochberg (BH) Criterion 19

2.5 Graphical Illustration of the Benjamini-Hochberg FDR Criterion 22

2.6 Use of the Q-Q Plot in Other Contexts 26

2.7 Alternatives to the BH Criterion 27

2.8 Consequences of Correlations Among the Hypotheses Tested 30

2.9 The FDR in a Non-Statistical Context: A Diagnostic Test 42

2.10 Summary 44

References 45

3 Graphical Presentation of the FDR 47

3.1 Presentation of the Q-Q Plot on the -log 10 (p) Scale 47

3.2 Association of the BH-FDR with Individual p-Values 48

3.3 Distinctive Plotting Symbols for Plotting of BH-FDR Values 50

3.4 Non-Monotonicity of the BH-FDR: Detection of Correlation Among p-Values from the -log 10 -Transformed Q-Q Plot 51

3.5 Summary 53

Reference 54

4 Application of the FDR to Multiple Hypothesis Testing in Real-World Data 55

4.1 Collation of Gene-Expression Data from the Plant-Genetics Model Organism Arabidopsis thaliana 55

4.2 Hypotheses Concerning Multiple Response Variables in the Analysis of a Balanced Experimental Design 59

4.3 Partitioning of Model Terms in a Balanced Experimental Design: Hypotheses Concerning Individual Terms 62

4.4 Comparison of the Results of Multiple Testing for Contrasting Subsets of Response Variables 66

4.5 Representation of the FDR on a Volcano Plot: Selection of Hypotheses for Further Investigation 71

4.6 Summary 74

References 76

5 Alternative Approaches to the Multiple-Testing Problem 79

5.1 An FDR Is Not a p-Value: The Formal Distinction 79

5.2 Retaining the p-Value Conceptual Basis: The Šidák and Bonferroni ‘Corrections’ 79

5.3 Multiple Testing of Pairwise Comparisons Among Groups of Samples 81

5.4 Repeated Testing in Interim Analyses Before Study Completion: Alpha Spending 95

5.5 Is Control of the Family-Wise Error Rate (FWER) a Desirable Goal? 99

5.6 Holm’s Method: A Generalisation of the Bonferroni Correction 106

5.7 Summary 113

References 115

6 The FDR in the Context of Bayesian Statistics 117

6.1 The Bayesian Interpretation of the BH-FDR 117

6.2 Numerical Equivalence Between a One-Sided p-Value and a Posterior Probability 119

6.3 Does the Bayesian Interpretation of a p-Value Offer a Solution to the Multiplicity Problem? 122

6.4 Numerical Equivalence of p-Value, Posterior Probability and BH-FDR: The Prosecutor’s Answer to the Accusation of Fallacy? 125

6.5 Summary 130

References 131

7 Alternative Specifications of the FDR 133

7.1 The Local and Non-Local FDR (LFDR and NFDR) 133

7.2 Direct Estimation of the LFDR 140

7.3 Estimation of the NFDR 142

7.4 Estimation of the LFDR from the NFDR: Re-Ranking Approach 144

7.5 Estimation of the LFDR from the NFDR: Power Parameter Approach 150

7.6 Review of Methods for Estimation of the LFDR 158

7.7 Summary 158

References 161

8 The FDR in Relation to an ‘Uninteresting’ Rather Than a Null Hypothesis 163

8.1 The Vulnerability of the FDR to Mis-Specification of the Statistical Model 163

8.2 ‘Uninteresting’ and ‘Interesting’ Distributions of Test Results Versus Distributions on H 0 and H 1 : A Defence Against Model Mis-Specification 164

8.3 An ‘Uninteresting’ Distribution to Account for Unrecognised Pseudoreplication: Fewer Discoveries Announced 165

8.4 Unrecognised Pseudoreplication: Results When Real Effects Are Present 177

8.5 An ‘Uninteresting’ Distribution to Account for Unrecognised Balanced-Block Effects: More Discoveries Announced 184

8.6 The Relative Merits of the Correct Model and an ‘Uninteresting’ Distribution as a Basis for Testing 193

8.7 Summary 197

References 200

9 Supplementation of p-Values with an Auxiliary Covariate: The Conditional FDR (cFDR) 201

9.1 Extension of the Relationship Between FDR and p to Take Account of an Additional Relevant Variable q 201

9.2 Method for the Evaluation of the cFDR from Data 204

9.3 Application of the cFDR to Non-Genetic Data 210

9.4 Summary 222

References 224

10 An FDR-Based Analogue of the Confidence Interval: The False Coverage Rate (FCR) 225

10.1 The Concept of the Coverage of a Confidence Interval 225

10.2 A Confidence Interval Based on the FDR-Determined Significance Threshold 231

10.3 Numerical Illustration of the FCR 233

10.4 Summary 236

References 237

11 The FDR as a Criterion for Sample Size Calculations 239

11.1 Review of the Standard Methods for Power and Sample Size Calculations 239

11.2 Connection of Statistical-Power-Related Concepts to FDR-Related Concepts 245

11.3 Sample Size Required to Achieve a Specified FDR 247

11.4 Significance Threshold (α) Required to Achieve a Specified FDR When Sample Size Is Fixed 251

11.5 Summary 254

References 255

Index 257

Authors

N. W. Galwey GlaxoSmithKline, UK.

Table of Contents

Authors

Related Topics

Related Products

Introductory Statistics, International Adaptation. Edition No. 10

Advanced Engineering Mathematics, International Adaptation. Edition No. 11

Applying Computational Intelligence for Social Good. Track, Understand and Build a Better world. Advances in Computers Volume 132

Database Automation - Market Share Analysis, Industry Trends & Statistics, Growth Forecasts (2025 - 2030)

Pharma 4.0 Focus Data Integrity by Design - Newer Approaches and Technologies to Cost Effective 21 CFR Part 11 Compliance (Recorded)