Table of Contents
List of Contributors xix
Series Preface xxiii
Preface xxv
1 Intelligent Data Analysis: Black Box Versus White Box Modeling 1
Sarthak Gupta, Siddhant Bagga, and Deepak Kumar Sharma
1.1 Introduction 1
1.1.1 Intelligent Data Analysis 1
1.1.2 Applications of IDA and Machine Learning 2
1.1.3 White Box Models Versus Black Box Models 2
1.1.4 Model Interpretability 3
1.2 Interpretation of White Box Models 3
1.2.1 Linear Regression 3
1.2.2 Decision Tree 5
1.3 Interpretation of Black Box Models 7
1.3.1 Partial Dependence Plot 7
1.3.2 Individual Conditional Expectation 9
1.3.3 Accumulated Local Effects 9
1.3.4 Global Surrogate Models 12
1.3.5 Local Interpretable Model-Agnostic Explanations 12
1.3.6 Feature Importance 12
1.4 Issues and Further Challenges 13
1.5 Summary 13
References 14
2 Data: Its Nature and Modern Data Analytical Tools 17
Ravinder Ahuja, Shikhar Asthana, Ayush Ahuja, and Manu Agarwal
2.1 Introduction 17
2.2 Data Types and Various File Formats 18
2.2.1 Structured Data 18
2.2.2 Semi-Structured Data 20
2.2.3 Unstructured Data 20
2.2.4 Need for File Formats 21
2.2.5 Various Types of File Formats 22
2.2.5.1 Comma Separated Values (CSV) 22
2.2.5.2 ZIP 22
2.2.5.3 Plain Text (txt) 23
2.2.5.4 JSON 23
2.2.5.5 XML 23
2.2.5.6 Image Files 24
2.2.5.7 HTML 24
2.3 Overview of Big Data 25
2.3.1 Sources of Big Data 27
2.3.1.1 Media 27
2.3.1.2 The Web 27
2.3.1.3 Cloud 27
2.3.1.4 Internet of Things 27
2.3.1.5 Databases 27
2.3.1.6 Archives 28
2.3.2 Big Data Analytics 28
2.3.2.1 Descriptive Analytics 28
2.3.2.2 Predictive Analytics 28
2.3.2.3 Prescriptive Analytics 29
2.4 Data Analytics Phases 29
2.5 Data Analytical Tools 30
2.5.1 Microsoft Excel 30
2.5.2 Apache Spark 33
2.5.3 Open Refine 34
2.5.4 R Programming 35
2.5.4.1 Advantages of R 36
2.5.4.2 Disadvantages of R 36
2.5.5 Tableau 36
2.5.5.1 How TableauWorks 36
2.5.5.2 Tableau Feature 37
2.5.5.3 Advantages 37
2.5.5.4 Disadvantages 37
2.5.6 Hadoop 37
2.5.6.1 Basic Components of Hadoop 38
2.5.6.2 Benefits 38
2.6 Database Management System for Big Data Analytics 38
2.6.1 Hadoop Distributed File System 38
2.6.2 NoSql 38
2.6.2.1 Categories of NoSql 39
2.7 Challenges in Big Data Analytics 39
2.7.1 Storage of Data 40
2.7.2 Synchronization of Data 40
2.7.3 Security of Data 40
2.7.4 Fewer Professionals 40
2.8 Conclusion 40
References 41
3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts 43
Shubham Kumaram, Samarth Chugh, and Deepak Kumar Sharma
3.1 Introduction 43
3.2 Probability 43
3.2.1 Definitions 43
3.2.1.1 Random Experiments 43
3.2.1.2 Probability 44
3.2.1.3 Probability Axioms 44
3.2.1.4 Conditional Probability 44
3.2.1.5 Independence 44
3.2.1.6 Random Variable 44
3.2.1.7 Probability Distribution 45
3.2.1.8 Expectation 45
3.2.1.9 Variance and Standard Deviation 45
3.2.2 Bayes’ Rule 45
3.3 Descriptive Statistics 46
3.3.1 Picture Representation 46
3.3.1.1 Frequency Distribution 46
3.3.1.2 Simple Frequency Distribution 46
3.3.1.3 Grouped Frequency Distribution 46
3.3.1.4 Stem and Leaf Display 46
3.3.1.5 Histogram and Bar Chart 47
3.3.2 Measures of Central Tendency 47
3.3.2.1 Mean 47
3.3.2.2 Median 47
3.3.2.3 Mode 47
3.3.3 Measures of Variability 48
3.3.3.1 Range 48
3.3.3.2 Box Plot 48
3.3.3.3 Variance and Standard Deviation 48
3.3.4 Skewness and Kurtosis 48
3.4 Inferential Statistics 49
3.4.1 Frequentist Inference 49
3.4.1.1 Point Estimation 50
3.4.1.2 Interval Estimation 50
3.4.2 Hypothesis Testing 51
3.4.3 Statistical Significance 51
3.5 Statistical Methods 52
3.5.1 Regression 52
3.5.1.1 Linear Model 52
3.5.1.2 Nonlinear Models 52
3.5.1.3 Generalized Linear Models 53
3.5.1.4 Analysis of Variance 53
3.5.1.5 Multivariate Analysis of Variance 55
3.5.1.6 Log-Linear Models 55
3.5.1.7 Logistic Regression 56
3.5.1.8 Random Effects Model 56
3.5.1.9 Overdispersion 57
3.5.1.10 Hierarchical Models 57
3.5.2 Analysis of Survival Data 57
3.5.3 Principal Component Analysis 58
3.6 Errors 59
3.6.1 Error in Regression 60
3.6.2 Error in Classification 61
3.7 Conclusion 61
References 61
4 Intelligent Data Analysis with Data Mining: Theory and Applications 63
Shivam Bachhety, Ramneek Singhal, and Rachna Jain Objective 63
4.1 Introduction to Data Mining 63
4.1.1 Importance of Intelligent Data Analytics in Business 64
4.1.2 Importance of Intelligent Data Analytics in Health Care 65
4.2 Data and Knowledge 65
4.3 Discovering Knowledge in Data Mining 66
4.3.1 Process Mining 67
4.3.2 Process of Knowledge Discovery 67
4.4 Data Analysis and Data Mining 69
4.5 Data Mining: Issues 69
4.6 Data Mining: Systems and Query Language 71
4.6.1 Data Mining Systems 71
4.6.2 Data Mining Query Language 72
4.7 Data Mining Methods 73
4.7.1 Classification 74
4.7.2 Cluster Analysis 75
4.7.3 Association 75
4.7.4 Decision Tree Induction 76
4.8 Data Exploration 77
4.9 Data Visualization 80
4.10 Probability Concepts for Intelligent Data Analysis (IDA) 83
Reference 83
5 Intelligent Data Analysis: Deep Learning and Visualization 85
Than D. Le and Huy V. Pham
5.1 Introduction 85
5.2 Deep Learning and Visualization 86
5.2.1 Linear and Logistic Regression and Visualization 86
5.2.2 CNN Architecture 89
5.2.2.1 Vanishing Gradient Problem 90
5.2.2.2 Convolutional Neural Networks (CNNs) 91
5.2.3 Reinforcement Learning 91
5.2.4 Inception and ResNet Networks 93
5.2.5 Softmax 94
5.3 Data Processing and Visualization 97
5.3.1 Regularization for Deep Learning and Visualization 98
5.3.1.1 Regularization for Linear Regression 98
5.4 Experiments and Results 102
5.4.1 Mask RCNN Based on Object Detection and Segmentation 102
5.4.2 Deep Matrix Factorization 108
5.4.2.1 Network Visualization 108
5.4.3 Deep Learning and Reinforcement Learning 111
5.5 Conclusion 112
References 113
6 A Systematic Review on the Evolution of Dental Caries Detection Methods and Its Significance in Data Analysis Perspective 115
Soma Datta, Nabendu Chaki, and Biswajit Modak
6.1 Introduction 115
6.1.1 Analysis of Dental Caries 115
6.2 Different Caries Lesion Detection Methods and Data Characterization 119
6.2.1 Point Detection Method 120
6.2.2 Visible Light Property Method 121
6.2.3 Radiographs 121
6.2.4 Light-Emitting Devices 123
6.2.5 Optical Coherent Tomography (OCT) 125
6.2.6 Software Tools 125
6.3 Technical Challenges with the Existing Methods 126
6.3.1 Challenges in Data Analysis Perspective 127
6.4 Result Analysis 129
6.5 Conclusion 129
Acknowledgment 131
References 131
7 Intelligent Data Analysis Using Hadoop Cluster - Inspired MapReduce Framework and Association Rule Mining on Educational Domain 137
Pratiyush Guleria and Manu Sood
7.1 Introduction 137
7.1.1 Research Areas of IDA 138
7.1.2 The Need for IDA in Education 139
7.2 Learning Analytics in Education 139
7.2.1 Role of Web-Enabled and Mobile Computing in Education 141
7.2.2 Benefits of Learning Analytics 142
7.2.3 Future Research Directions of IDA 142
7.3 Motivation 142
7.4 Literature Review 143
7.4.1 Association Rule Mining and Big Data 143
7.5 Intelligent Data Analytical Tools 145
7.6 Intelligent Data Analytics Using MapReduce Framework in an Educational Domain 149
7.6.1 Data Description 149
7.6.2 Objective 150
7.6.3 Proposed Methodology 150
7.6.3.1 Stage 1 Map Reduce Algorithm 150
7.6.3.2 Stage 2 Apriori Algorithm 150
7.7 Results 151
7.8 Conclusion and Future Scope 153
References 153
8 Influence of Green Space on Global Air Quality Monitoring: Data Analysis Using K-Means Clustering Algorithm 157
Gihan S. Pathirana and Malka N. Halgamuge
8.1 Introduction 157
8.2 Material and Methods 159
8.2.1 Data Collection 159
8.2.2 Data Inclusion Criteria 159
8.2.3 Data Preprocessing 159
8.2.4 Data Analysis 161
8.3 Results 161
8.4 Quantitative Analysis 163
8.4.1 K-Means Clustering 163
8.4.2 Level of Difference of Green Area 167
8.5 Discussion 167
8.6 Conclusion 169
References 170
9 IDA with Space Technology and Geographic Information System 173
Bright Keswani, Tarini Ch. Mishra, Ambarish G. Mohapatra, Poonam Keswani, Priyatosh Sahu, and Anish Kumar Sarangi
9.1 Introduction 173
9.1.1 Real-Time in Space 176
9.1.2 Generating Programming Triggers 178
9.1.3 Analytical Architecture 178
9.1.4 Remote Sensing Big Data Acquisition Unit (RSDU) 180
9.1.5 Data Processing Unit 180
9.1.6 Data Analysis and Decision Unit 181
9.1.7 Analysis 181
9.1.8 Incorporating Machine Learning and Artificial Intelligence 181
9.1.8.1 Methodologies Applicable 182
9.1.8.2 Support Vector Machines (SVM) and Cross-Validation 182
9.1.8.3 Massively Parallel Computing and I/O 183
9.1.8.4 Data Architecture and Governance 183
9.1.9 Real-Time Spacecraft Detection 185
9.1.9.1 Active Phased Array 186
9.1.9.2 Relay Communication 186
9.1.9.3 Low-Latency Random Access 186
9.1.9.4 Channel Modeling and Prediction 186
9.2 Geospatial Techniques 187
9.2.1 The Big-GIS 187
9.2.2 Technologies Applied 187
9.2.2.1 Internet of Things and Sensor Web 188
9.2.2.2 Cloud Computing 188
9.2.2.3 Stream Processing 188
9.2.2.4 Big Data Analytics 188
9.2.2.5 Coordinated Observation 188
9.2.2.6 Big Geospatial Data Management 189
9.2.2.7 Parallel Geocomputation Framework 189
9.2.3 Data Collection Using GIS 189
9.2.3.1 NoSQL Databases 190
9.2.3.2 Parallel Processing 190
9.2.3.3 Knowledge Discovery and Intelligent Service 190
9.2.3.4 Data Analysis 191
9.3 Comparative Analysis 192
9.4 Conclusion 192
References 194
10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT 199
Rakesh Roshan and Om Prakash Rishi
10.1 Introduction to Intelligent Transportation System (ITS) 199
10.1.1 Working of Intelligent Transportation System 201
10.1.2 Services of Intelligent Transportation System 201
10.1.3 Advantages of Intelligent Transportation System 203
10.2 Issues and Challenges of Intelligent Transportation System (ITS) 204
10.2.1 Communication Technology Used Currently in ITS 205
10.2.2 Challenges in the Implementation of ITS 206
10.2.3 Opportunity for Popularity of Automated/Autonomous/Self-Driving Car or Vehicle 207
10.3 Intelligent Data Analysis Makes an IoT-Based Transportation System Intelligent 208
10.3.1 Introduction to Intelligent Data Analysis 208
10.3.2 How IDA Makes IoT-Based Transportation Systems Intelligent 210
10.3.2.1 Traffic Management Through IoT and Intelligent Data Analysis 210
10.3.2.2 Tracking of Multiple Vehicles 211
10.4 Intelligent Data Analysis for Security in Intelligent Transportation System 212
10.5 Tools to Support IDA in an Intelligent Transportation System 215
References 217
11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City 219
Dhanushka Abeyratne and Malka N. Halgamuge
11.1 Introduction 219
11.1.1 Overview of Big Data Analytics on Motor Vehicle Collision Predictions 219
11.2 Materials and Methods 220
11.2.1 Collection of Raw Data 220
11.2.2 Data Inclusion Criteria 220
11.2.3 Data Preprocessing 220
11.2.4 Data Analysis 221
11.3 Classification Algorithms and K-Fold Validation Using Data Set Obtained from NYPD (2012-2017) 223
11.3.1 Classification Algorithms 223
11.3.1.1 k-Fold Cross-Validation 223
11.3.2 Statistical Analysis 225
11.4 Results 225
11.4.1 Measured Processing Time and Accuracy of Each Classifier 225
11.4.2 Measured p-Value in each Vehicle Group Using K-Means Clustering/One-Way ANOVA 227
11.4.3 Identified High Collision Concentration Locations of Each Vehicle Group 229
11.4.4 Measured Different Criteria for Further Analysis of NYPD Data Set (2012-2017) 229
11.5 Discussion 233
11.6 Conclusion 237
References 238
12 A Smart and Promising Neurological Disorder Diagnostic System: An Amalgamation of Big Data, IoT, and Emerging Computing Techniques 241
Prableen Kaur and Manik Sharma
12.1 Introduction 241
12.1.1 Difference Between Neurological and Psychological Disorders 241
12.2 Statistics of Neurological Disorders 243
12.3 Emerging Computing Techniques 244
12.3.1 Internet of Things 244
12.3.2 Big Data 245
12.3.3 Soft Computing Techniques 245
12.4 Related Works and Publication Trends of Articles 249
12.5 The Need for Neurological Disorders Diagnostic System 251
12.5.1 Design of Smart and Intelligent Neurological Disorders Diagnostic System 251
12.6 Conclusion 259
References 260
13 Comments-Based Analysis of a Bug Report Collection System and Its Applications 265
Arvinder Kaur and Shubhra Goyal
13.1 Introduction 265
13.2 Background 267
13.2.1 Issue Tracking System 267
13.2.2 Bug Report Statistics 267
13.3 Related Work 268
13.3.1 Data Extraction Process 268
13.3.2 Applications of Bug Report Comments 270
13.3.2.1 Bug Summarization 270
13.3.2.2 Emotion Mining 271
13.4 Data Collection Process 272
13.4.1 Steps of Data Extraction 273
13.4.2 Block Diagram for Data Extraction 274
13.4.3 Reports Generated 274
13.4.3.1 Bug Attribute Report 274
13.4.3.2 Long Description Report 275
13.4.3.3 Bug Comments Reports 275
13.4.3.4 Error Report 275
13.5 Analysis of Bug Reports 275
13.5.1 Research Question 1: Is the Performance of Software Affected by Open Bugs that are Critical in Nature? 275
13.5.2 Research Question 2: How Can Test Leads Improve the Performance of Software Systems? 277
13.5.3 Research Question 3: Which Are the Most Error-Prone Areas that Can Cause System Failure? 277
13.5.4 Research Question 4: Which Are the Most Frequent Words and Keywords to Predict Most Critical Bugs? 279
13.5.5 Research Questions 5: What is the Importance of Frequent Words Mined from Bug Reports? 281
13.6 Threats to Validity 284
13.7 Conclusion 284
References 286
14 Sarcasm Detection Algorithms Based on Sentiment Strength 289
Pragya Katyayan and Nisheeth Joshi
14.1 Introduction 289
14.2 Literature Survey 291
14.3 Experiment 294
14.3.1 Data Collection 294
14.3.2 Finding SentiStrengths 294
14.3.3 Proposed Algorithm 295
14.3.4 Explanation of the Algorithms 297
14.3.5 Classification 300
14.3.5.1 Explanation 300
14.3.6 Evaluation 302
14.4 Results and Evaluation 303
14.5 Conclusion 305
References 305
15 SNAP: Social Network Analysis Using Predictive Modeling 307
Samridhi Seth and Rahul Johari
15.1 Introduction 307
15.1.1 Types of Predictive Analytics Models 307
15.1.2 Predictive Analytics Techniques 308
15.1.2.1 Regression Techniques 308
15.1.2.2 Machine Learning Techniques 308
15.2 Literature Survey 309
15.3 Comparative Study 313
15.4 Simulation and Analysis 313
15.4.1 Few Analyses Made on the Data Set Are Given Below 314
15.4.1.1 Duration of Each Contact Was Found 314
15.4.1.2 Total Number of Contacts of Source Node with Destination Node Was Found for all Nodes 314
15.4.1.3 Total Duration of Contact of Source Node with Each Node Was Found 315
15.4.1.4 Mobility Pattern Describes Direction of Contact and Relation Between Number of Contacts and Duration of Contact 315
15.4.1.5 Unidirectional Contact, that is, Only 1 Node is Contacting Second Node but Vice Versa is Not There 317
15.4.1.6 Graphical Representation for the Duration of Contacts with Each Node is Given below 317
15.4.1.7 Rank and Percentile for Number of Contacts with Each Node 320
15.4.1.8 Data Set is Described for Three Days Where Time is Calculated in Seconds. Data Set can be Divided Into Three Days. Some of the Analyses Conducted on the Data set Day Wise Are Given Below 326
15.5 Conclusion and Future Work 329
References 329
16 Intelligent Data Analysis for Medical Applications 333
Moolchand Sharma, Vikas Chaudhary, Prerna Sharma, and R. S. Bhatia
16.1 Introduction 333
16.1.1 IDA (Intelligent Data Analysis) 335
16.1.1.1 Elicitation of Background Knowledge 337
16.1.2 Medical Applications 337
16.2 IDA Needs in Medical Applications 338
16.2.1 Public Health 339
16.2.2 Electronic Health Record 339
16.2.3 Patient Profile Analytics 339
16.2.3.1 Patient’s Profile 339
16.3 IDA Methods Classifications 339
16.3.1 Data Abstraction 339
16.3.2 Data Mining Method 340
16.3.3 Temporal Data Mining 341
16.4 Intelligent Decision Support System in Medical Applications 341
16.4.1 Need for Intelligent Decision System (IDS) 342
16.4.2 Understanding Intelligent Decision Support: Some Definitions 342
16.4.3 Advantages/Disadvantages of IDS 344
16.5 Conclusion 345
References 345
17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording 347
Md Belal Bin Heyat, Dakun Lai, Faijan Akhtar, Mohd Ammar Bin Hayat, Shafan Azad, Shadab Azad, and Shajan Azad
17.1 Introduction 347
17.1.1 Side Effect of Poor Snooze 348
17.2 History of Sleep Disorder 349
17.2.1 Classification of Sleep Disorder 349
17.2.2 Sleep Stages of the Human 351
17.3 Electroencephalogram Signal 351
17.3.1 Electroencephalogram Generation 351
17.3.1.1 Classification of Electroencephalogram Signal 352
17.4 EEG Data Measurement Technique 352
17.4.1 10-20 Electrode Positioning System 352
17.4.1.1 Procedure of Electrode placement 353
17.5 Literature Review 354
17.6 Subjects and Methodology 354
17.6.1 Data Collection 354
17.6.2 Low Pass Filter 355
17.6.3 Hanning Window 355
17.6.4 Welch Method 356
17.7 Data Analysis of the Bruxism and Normal Data Using EEG Signal 356
17.8 Result 358
17.9 Conclusions 361
Acknowledgments 363
References 364
18 Handwriting Analysis for Early Detection of Alzheimer’s Disease 369
Rajib Saha, Anirban Mukherjee, Aniruddha Sadhukhan, Anisha Roy, and Manashi De
18.1 Introduction and Background 369
18.2 Proposed Work and Methodology 376
18.3 Results and Discussions 379
18.3.1 Character Segmentation 380
18.4 Conclusion 384
References 385
Index 387