Provides a foundation in classical parametric methods of regression and classification essential for pursuing advanced topics in predictive analytics and statistical learning
This book covers a broad range of topics in parametric regression and classification including multiple regression, logistic regression (binary and multinomial), discriminant analysis, Bayesian classification, generalized linear models and Cox regression for survival data. The book also gives brief introductions to some modern computer-intensive methods such as classification and regression trees (CART), neural networks and support vector machines.
The book is organized so that it can be used by both advanced undergraduate or masters students with applied interests and by doctoral students who also want to learn the underlying theory. This is done by devoting the main body of the text of each chapter with basic statistical methodology illustrated by real data examples. Derivations, proofs and extensions are relegated to the Technical Notes section of each chapter, Exercises are also divided into theoretical and applied. Answers to selected exercises are provided. A solution manual is available to instructors who adopt the text.
Data sets of moderate to large sizes are used in examples and exercises. They come from a variety of disciplines including business (finance, marketing and sales), economics, education, engineering and sciences (biological, health, physical and social). All data sets are available at the book’s web site. Open source software R is used for all data analyses. R codes and outputs are provided for most examples. R codes are also available at the book’s web site.
Predictive Analytics: Parametric Models for Regression and Classification Using R is ideal for a one-semester upper-level undergraduate and/or beginning level graduate course in regression for students in business, economics, finance, marketing, engineering, and computer science. It is also an excellent resource for practitioners in these fields.
Table of Contents
Preface xiii
Acknowledgments xv
Abbreviations xvii
About the companion website xxi
1 Introduction 1
1.1 Supervised versus unsupervised learning 2
1.2 Parametric versus nonparametric models 3
1.3 Types of data 4
1.4 Overview of parametric predictive analytics 5
2 Simple linear regression and correlation 7
2.1 Fitting a straight line 9
2.1.1 Least squares (LS) method 9
2.1.2 Linearizing transformations 11
2.1.3 Fitted values and residuals 13
2.1.4 Assessing goodness of fit 14
2.2 Statistical inferences for simple linear regression 17
2.2.1 Simple linear regression model 17
2.2.2 Inferences on β0 and β1 18
2.2.3 Analysis of variance for simple linear regression 19
2.2.4 Pure error versus model error 20
2.2.5 Prediction of future observations 21
2.3 Correlation analysis 24
2.3.1 Bivariate normal distribution 26
2.3.2 Inferences on correlation coefficient 27
2.4 Modern extensions 28
2.5 Technical notes 29
2.5.1 Derivation of the LS estimators 29
2.5.2 Sums of squares 30
2.5.3 Distribution of the LS estimators 30
2.5.4 Prediction interval 32
Exercises 32
3 Multiple linear regression: basics 37
3.1 Multiple linear regression model 39
3.1.1 Model in scalar notation 39
3.1.2 Model in matrix notation 40
3.2 Fitting a multiple regression model 41
3.2.1 Least squares (LS) method 41
3.2.2 Interpretation of regression coefficients 45
3.2.3 Fitted values and residuals 45
3.2.4 Measures of goodness of fit 47
3.2.5 Linearizing transformations 48
3.3 Statistical inferences for multiple regression 49
3.3.1 Analysis of variance for multiple regression 49
3.3.2 Inferences on regression coefficients 51
3.3.3 Confidence ellipsoid for the β vector 52
3.3.4 Extra sum of squares method 54
3.3.5 Prediction of future observations 59
3.4 Weighted and generalized least squares 60
3.4.1 Weighted least squares 60
3.4.2 Generalized least squares 62
3.4.3 Statistical inference on GLS estimator 63
3.5 Partial correlation coefficients 63
3.5.1 Test of significance of partial correlation coefficient 65
3.6 Special topics 66
3.6.1 Dummy variables 66
3.6.2 Interactions 69
3.6.3 Standardized regression 74
3.7 Modern extensions 75
3.7.1 Regression trees 76
3.7.2 Neural nets 78
3.8 Technical notes 81
3.8.1 Derivation of the LS estimators 81
3.8.2 Distribution of the LS estimators 81
3.8.3 Gauss-Markov theorem 82
3.8.4 Properties of fitted values and residuals 83
3.8.5 Geometric interpretation of least squares 83
3.8.6 Confidence ellipsoid for β 85
3.8.7 Population partial correlation coefficient 85
Exercises 86
4 Multiple linear regression: model diagnostics 95
4.1 Model assumptions and distribution of residuals 95
4.2 Checking normality 96
4.3 Checking homoscedasticity 98
4.3.1 Variance stabilizing transformations 99
4.3.2 Box-Cox transformation 100
4.4 Detecting outliers 103
4.5 Checking model misspecification 106
4.6 Checking independence 108
4.6.1 Runs test 109
4.6.2 Durbin-Watson test 109
4.7 Checking influential observations 110
4.7.1 Leverage 111
4.7.2 Cook’s distance 111
4.8 Checking multicollinearity 114
4.8.1 Multicollinearity: causes and consequences 114
4.8.2 Multicollinearity diagnostics 115
Exercises 119
5 Multiple linear regression: shrinkage and dimension reduction methods 127
5.1 Ridge regression 128
5.1.1 Ridge problem 128
5.1.2 Choice of λ 129
5.2 Lasso regression 132
5.2.1 Lasso problem 132
5.3 Principal components analysis and regression135
5.3.1 Principal components analysis (PCA) 135
5.3.2 Principal components regression (PCR) 142
5.4 Partial least squares (PLS) 146
5.4.1 PLS1 algorithm 147
5.5 Technical notes 154
5.5.1 Properties of ridge estimator 154
5.5.2 Derivation of principal components 155
Exercises 156
6 Multiple linear regression: variable selection and model building 159
6.1 Best subset selection 160
6.1.1 Model selection criteria 160
6.2 Stepwise regression 165
6.3 Model building 174
6.4 Technical notes 175
6.4.1 Derivation of the Cp statistic 175
Exercises 177
7 Logistic regression and classification 181
7.1 Simple logistic regression 183
7.1.1 Model 183
7.1.2 Parameter estimation 185
7.1.3 Inferences on parameters 189
7.2 Multiple logistic regression 190
7.2.1 Model and inference 190
7.3 Likelihood ratio (LR) test 194
7.3.1 Deviance 195
7.3.2 Akaike information criterion (AIC) 197
7.3.3 Model selection and diagnostics 197
7.4 Binary classification using logistic regression 201
7.4.1 Measures of correct classification 201
7.4.2 Receiver operating characteristic (ROC) curve 204
7.5 Polytomous logistic regression 207
7.5.1 Nominal logistic regression 208
7.5.2 Ordinal logistic regression 212
7.6 Modern extensions 215
7.6.1 Classification trees 215
7.6.2 Support vector machines 218
7.7 Technical notes 222
Exercises 224
8 Discriminant analysis 233
8.1 Linear discriminant analysis based on Mahalnobis distance 234
8.1.1 Mahalnobis distance 234
8.1.2 Bayesian classification 235
8.2 Fisher’s linear discriminant function 239
8.2.1 Two groups 239
8.2.2 Multiple groups 241
8.3 Naive Bayes 243
8.4 Technical notes 244
8.4.1 Calculation of pooled sample covariance matrix 244
8.4.2 Derivation of Fisher’s linear discriminant functions 245
8.4.3 Bayes rule 247
Exercises 247
9 Generalized linear models 251
9.1 Exponential family and link function 251
9.1.1 Exponential family 251
9.1.2 Link function 254
9.2 Estimation of parameters of GLM 255
9.2.1 Maximum likelihood estimation 255
9.2.2 Iteratively reweighted least squares (IRWLS) Algorithm 256
9.3 Deviance and AIC 258
9.4 Poisson regression 263
9.4.1 Poisson regression for rates 266
9.5 Gamma regression 269
9.6 Technical notes 273
9.6.1 Mean and variance of the exponential family of distributions 273
9.6.2 MLE of βand its evaluation using the IRWLS algorithm 274
Exercises 277
10 Survival analysis 281
10.1 Hazard rate and survival distribution 282
10.2 Kaplan-Meier estimator 283
10.3 Logrank test 286
10.4 Cox’s proportional hazards model 289
10.4.1 Estimation 290
10.4.2 Examples 291
10.4.3 Time-dependent covariates 295
10.5 Technical notes 300
10.5.1 ML estimation of the Cox proportional hazards model 300
Exercises 301
Appendix A Primer on matrix algebra and multivariate distributions 305
A.1 Review of matrix algebra 305
A.2 Review of multivariate distributions 307
A.3 Multivariate normal distribution 309
Appendix B Primer on maximum likelihood estimation 311
B.1 Maximum likelihood estimation 311
B.2 Large sample inference on MLEs 313
B.3 Newton-Raphson and Fisher scoring algorithms 315
B.4 Technical notes 317
Appendix C Projects 319
C.1 Project 1 321
C.2 Project 2 322
C.3 Project 3 324
Appendix D Statistical tables 327
References 339
Answers to selected exercises 343
Index 355