+353-1-416-8900REST OF WORLD
+44-20-3973-8888REST OF WORLD
1-917-300-0470EAST COAST U.S
1-800-526-8630U.S. (TOLL FREE)

Intelligent Data Analysis. From Data Gathering to Data Comprehension. Edition No. 1. The Wiley Series in Intelligent Signal and Data Processing

  • Book

  • 432 Pages
  • June 2020
  • John Wiley and Sons Ltd
  • ID: 5837958
This book focuses on methods and tools for intelligent data analysis, aimed at narrowing the increasing gap between data gathering and data comprehension, and emphasis will also be given to solving of problems which result from automated data collection, such as analysis of computer-based patient records, data warehousing tools, intelligent alarming, effective and efficient monitoring, and so on. This book aims to describe the different approaches of Intelligent Data Analysis from a practical point of view: solving common life problems with data analysis tools.

Table of Contents

List of Contributors xix

Series Preface xxiii

Preface xxv

1 Intelligent Data Analysis: Black Box Versus White Box Modeling 1
Sarthak Gupta, Siddhant Bagga, and Deepak Kumar Sharma

1.1 Introduction 1

1.1.1 Intelligent Data Analysis 1

1.1.2 Applications of IDA and Machine Learning 2

1.1.3 White Box Models Versus Black Box Models 2

1.1.4 Model Interpretability 3

1.2 Interpretation of White Box Models 3

1.2.1 Linear Regression 3

1.2.2 Decision Tree 5

1.3 Interpretation of Black Box Models 7

1.3.1 Partial Dependence Plot 7

1.3.2 Individual Conditional Expectation 9

1.3.3 Accumulated Local Effects 9

1.3.4 Global Surrogate Models 12

1.3.5 Local Interpretable Model-Agnostic Explanations 12

1.3.6 Feature Importance 12

1.4 Issues and Further Challenges 13

1.5 Summary 13

References 14

2 Data: Its Nature and Modern Data Analytical Tools 17
Ravinder Ahuja, Shikhar Asthana, Ayush Ahuja, and Manu Agarwal

2.1 Introduction 17

2.2 Data Types and Various File Formats 18

2.2.1 Structured Data 18

2.2.2 Semi-Structured Data 20

2.2.3 Unstructured Data 20

2.2.4 Need for File Formats 21

2.2.5 Various Types of File Formats 22

2.2.5.1 Comma Separated Values (CSV) 22

2.2.5.2 ZIP 22

2.2.5.3 Plain Text (txt) 23

2.2.5.4 JSON 23

2.2.5.5 XML 23

2.2.5.6 Image Files 24

2.2.5.7 HTML 24

2.3 Overview of Big Data 25

2.3.1 Sources of Big Data 27

2.3.1.1 Media 27

2.3.1.2 The Web 27

2.3.1.3 Cloud 27

2.3.1.4 Internet of Things 27

2.3.1.5 Databases 27

2.3.1.6 Archives 28

2.3.2 Big Data Analytics 28

2.3.2.1 Descriptive Analytics 28

2.3.2.2 Predictive Analytics 28

2.3.2.3 Prescriptive Analytics 29

2.4 Data Analytics Phases 29

2.5 Data Analytical Tools 30

2.5.1 Microsoft Excel 30

2.5.2 Apache Spark 33

2.5.3 Open Refine 34

2.5.4 R Programming 35

2.5.4.1 Advantages of R 36

2.5.4.2 Disadvantages of R 36

2.5.5 Tableau 36

2.5.5.1 How TableauWorks 36

2.5.5.2 Tableau Feature 37

2.5.5.3 Advantages 37

2.5.5.4 Disadvantages 37

2.5.6 Hadoop 37

2.5.6.1 Basic Components of Hadoop 38

2.5.6.2 Benefits 38

2.6 Database Management System for Big Data Analytics 38

2.6.1 Hadoop Distributed File System 38

2.6.2 NoSql 38

2.6.2.1 Categories of NoSql 39

2.7 Challenges in Big Data Analytics 39

2.7.1 Storage of Data 40

2.7.2 Synchronization of Data 40

2.7.3 Security of Data 40

2.7.4 Fewer Professionals 40

2.8 Conclusion 40

References 41

3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts 43
Shubham Kumaram, Samarth Chugh, and Deepak Kumar Sharma

3.1 Introduction 43

3.2 Probability 43

3.2.1 Definitions 43

3.2.1.1 Random Experiments 43

3.2.1.2 Probability 44

3.2.1.3 Probability Axioms 44

3.2.1.4 Conditional Probability 44

3.2.1.5 Independence 44

3.2.1.6 Random Variable 44

3.2.1.7 Probability Distribution 45

3.2.1.8 Expectation 45

3.2.1.9 Variance and Standard Deviation 45

3.2.2 Bayes’ Rule 45

3.3 Descriptive Statistics 46

3.3.1 Picture Representation 46

3.3.1.1 Frequency Distribution 46

3.3.1.2 Simple Frequency Distribution 46

3.3.1.3 Grouped Frequency Distribution 46

3.3.1.4 Stem and Leaf Display 46

3.3.1.5 Histogram and Bar Chart 47

3.3.2 Measures of Central Tendency 47

3.3.2.1 Mean 47

3.3.2.2 Median 47

3.3.2.3 Mode 47

3.3.3 Measures of Variability 48

3.3.3.1 Range 48

3.3.3.2 Box Plot 48

3.3.3.3 Variance and Standard Deviation 48

3.3.4 Skewness and Kurtosis 48

3.4 Inferential Statistics 49

3.4.1 Frequentist Inference 49

3.4.1.1 Point Estimation 50

3.4.1.2 Interval Estimation 50

3.4.2 Hypothesis Testing 51

3.4.3 Statistical Significance 51

3.5 Statistical Methods 52

3.5.1 Regression 52

3.5.1.1 Linear Model 52

3.5.1.2 Nonlinear Models 52

3.5.1.3 Generalized Linear Models 53

3.5.1.4 Analysis of Variance 53

3.5.1.5 Multivariate Analysis of Variance 55

3.5.1.6 Log-Linear Models 55

3.5.1.7 Logistic Regression 56

3.5.1.8 Random Effects Model 56

3.5.1.9 Overdispersion 57

3.5.1.10 Hierarchical Models 57

3.5.2 Analysis of Survival Data 57

3.5.3 Principal Component Analysis 58

3.6 Errors 59

3.6.1 Error in Regression 60

3.6.2 Error in Classification 61

3.7 Conclusion 61

References 61

4 Intelligent Data Analysis with Data Mining: Theory and Applications 63
Shivam Bachhety, Ramneek Singhal, and Rachna Jain Objective 63

4.1 Introduction to Data Mining 63

4.1.1 Importance of Intelligent Data Analytics in Business 64

4.1.2 Importance of Intelligent Data Analytics in Health Care 65

4.2 Data and Knowledge 65

4.3 Discovering Knowledge in Data Mining 66

4.3.1 Process Mining 67

4.3.2 Process of Knowledge Discovery 67

4.4 Data Analysis and Data Mining 69

4.5 Data Mining: Issues 69

4.6 Data Mining: Systems and Query Language 71

4.6.1 Data Mining Systems 71

4.6.2 Data Mining Query Language 72

4.7 Data Mining Methods 73

4.7.1 Classification 74

4.7.2 Cluster Analysis 75

4.7.3 Association 75

4.7.4 Decision Tree Induction 76

4.8 Data Exploration 77

4.9 Data Visualization 80

4.10 Probability Concepts for Intelligent Data Analysis (IDA) 83

Reference 83

5 Intelligent Data Analysis: Deep Learning and Visualization 85
Than D. Le and Huy V. Pham

5.1 Introduction 85

5.2 Deep Learning and Visualization 86

5.2.1 Linear and Logistic Regression and Visualization 86

5.2.2 CNN Architecture 89

5.2.2.1 Vanishing Gradient Problem 90

5.2.2.2 Convolutional Neural Networks (CNNs) 91

5.2.3 Reinforcement Learning 91

5.2.4 Inception and ResNet Networks 93

5.2.5 Softmax 94

5.3 Data Processing and Visualization 97

5.3.1 Regularization for Deep Learning and Visualization 98

5.3.1.1 Regularization for Linear Regression 98

5.4 Experiments and Results 102

5.4.1 Mask RCNN Based on Object Detection and Segmentation 102

5.4.2 Deep Matrix Factorization 108

5.4.2.1 Network Visualization 108

5.4.3 Deep Learning and Reinforcement Learning 111

5.5 Conclusion 112

References 113

6 A Systematic Review on the Evolution of Dental Caries Detection Methods and Its Significance in Data Analysis Perspective 115
Soma Datta, Nabendu Chaki, and Biswajit Modak

6.1 Introduction 115

6.1.1 Analysis of Dental Caries 115

6.2 Different Caries Lesion Detection Methods and Data Characterization 119

6.2.1 Point Detection Method 120

6.2.2 Visible Light Property Method 121

6.2.3 Radiographs 121

6.2.4 Light-Emitting Devices 123

6.2.5 Optical Coherent Tomography (OCT) 125

6.2.6 Software Tools 125

6.3 Technical Challenges with the Existing Methods 126

6.3.1 Challenges in Data Analysis Perspective 127

6.4 Result Analysis 129

6.5 Conclusion 129

Acknowledgment 131

References 131

7 Intelligent Data Analysis Using Hadoop Cluster - Inspired MapReduce Framework and Association Rule Mining on Educational Domain 137
Pratiyush Guleria and Manu Sood

7.1 Introduction 137

7.1.1 Research Areas of IDA 138

7.1.2 The Need for IDA in Education 139

7.2 Learning Analytics in Education 139

7.2.1 Role of Web-Enabled and Mobile Computing in Education 141

7.2.2 Benefits of Learning Analytics 142

7.2.3 Future Research Directions of IDA 142

7.3 Motivation 142

7.4 Literature Review 143

7.4.1 Association Rule Mining and Big Data 143

7.5 Intelligent Data Analytical Tools 145

7.6 Intelligent Data Analytics Using MapReduce Framework in an Educational Domain 149

7.6.1 Data Description 149

7.6.2 Objective 150

7.6.3 Proposed Methodology 150

7.6.3.1 Stage 1 Map Reduce Algorithm 150

7.6.3.2 Stage 2 Apriori Algorithm 150

7.7 Results 151

7.8 Conclusion and Future Scope 153

References 153

8 Influence of Green Space on Global Air Quality Monitoring: Data Analysis Using K-Means Clustering Algorithm 157
Gihan S. Pathirana and Malka N. Halgamuge

8.1 Introduction 157

8.2 Material and Methods 159

8.2.1 Data Collection 159

8.2.2 Data Inclusion Criteria 159

8.2.3 Data Preprocessing 159

8.2.4 Data Analysis 161

8.3 Results 161

8.4 Quantitative Analysis 163

8.4.1 K-Means Clustering 163

8.4.2 Level of Difference of Green Area 167

8.5 Discussion 167

8.6 Conclusion 169

References 170

9 IDA with Space Technology and Geographic Information System 173
Bright Keswani, Tarini Ch. Mishra, Ambarish G. Mohapatra, Poonam Keswani, Priyatosh Sahu, and Anish Kumar Sarangi

9.1 Introduction 173

9.1.1 Real-Time in Space 176

9.1.2 Generating Programming Triggers 178

9.1.3 Analytical Architecture 178

9.1.4 Remote Sensing Big Data Acquisition Unit (RSDU) 180

9.1.5 Data Processing Unit 180

9.1.6 Data Analysis and Decision Unit 181

9.1.7 Analysis 181

9.1.8 Incorporating Machine Learning and Artificial Intelligence 181

9.1.8.1 Methodologies Applicable 182

9.1.8.2 Support Vector Machines (SVM) and Cross-Validation 182

9.1.8.3 Massively Parallel Computing and I/O 183

9.1.8.4 Data Architecture and Governance 183

9.1.9 Real-Time Spacecraft Detection 185

9.1.9.1 Active Phased Array 186

9.1.9.2 Relay Communication 186

9.1.9.3 Low-Latency Random Access 186

9.1.9.4 Channel Modeling and Prediction 186

9.2 Geospatial Techniques 187

9.2.1 The Big-GIS 187

9.2.2 Technologies Applied 187

9.2.2.1 Internet of Things and Sensor Web 188

9.2.2.2 Cloud Computing 188

9.2.2.3 Stream Processing 188

9.2.2.4 Big Data Analytics 188

9.2.2.5 Coordinated Observation 188

9.2.2.6 Big Geospatial Data Management 189

9.2.2.7 Parallel Geocomputation Framework 189

9.2.3 Data Collection Using GIS 189

9.2.3.1 NoSQL Databases 190

9.2.3.2 Parallel Processing 190

9.2.3.3 Knowledge Discovery and Intelligent Service 190

9.2.3.4 Data Analysis 191

9.3 Comparative Analysis 192

9.4 Conclusion 192

References 194

10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT 199
Rakesh Roshan and Om Prakash Rishi

10.1 Introduction to Intelligent Transportation System (ITS) 199

10.1.1 Working of Intelligent Transportation System 201

10.1.2 Services of Intelligent Transportation System 201

10.1.3 Advantages of Intelligent Transportation System 203

10.2 Issues and Challenges of Intelligent Transportation System (ITS) 204

10.2.1 Communication Technology Used Currently in ITS 205

10.2.2 Challenges in the Implementation of ITS 206

10.2.3 Opportunity for Popularity of Automated/Autonomous/Self-Driving Car or Vehicle 207

10.3 Intelligent Data Analysis Makes an IoT-Based Transportation System Intelligent 208

10.3.1 Introduction to Intelligent Data Analysis 208

10.3.2 How IDA Makes IoT-Based Transportation Systems Intelligent 210

10.3.2.1 Traffic Management Through IoT and Intelligent Data Analysis 210

10.3.2.2 Tracking of Multiple Vehicles 211

10.4 Intelligent Data Analysis for Security in Intelligent Transportation System 212

10.5 Tools to Support IDA in an Intelligent Transportation System 215

References 217

11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City 219
Dhanushka Abeyratne and Malka N. Halgamuge

11.1 Introduction 219

11.1.1 Overview of Big Data Analytics on Motor Vehicle Collision Predictions 219

11.2 Materials and Methods 220

11.2.1 Collection of Raw Data 220

11.2.2 Data Inclusion Criteria 220

11.2.3 Data Preprocessing 220

11.2.4 Data Analysis 221

11.3 Classification Algorithms and K-Fold Validation Using Data Set Obtained from NYPD (2012-2017) 223

11.3.1 Classification Algorithms 223

11.3.1.1 k-Fold Cross-Validation 223

11.3.2 Statistical Analysis 225

11.4 Results 225

11.4.1 Measured Processing Time and Accuracy of Each Classifier 225

11.4.2 Measured p-Value in each Vehicle Group Using K-Means Clustering/One-Way ANOVA 227

11.4.3 Identified High Collision Concentration Locations of Each Vehicle Group 229

11.4.4 Measured Different Criteria for Further Analysis of NYPD Data Set (2012-2017) 229

11.5 Discussion 233

11.6 Conclusion 237

References 238

12 A Smart and Promising Neurological Disorder Diagnostic System: An Amalgamation of Big Data, IoT, and Emerging Computing Techniques 241
Prableen Kaur and Manik Sharma

12.1 Introduction 241

12.1.1 Difference Between Neurological and Psychological Disorders 241

12.2 Statistics of Neurological Disorders 243

12.3 Emerging Computing Techniques 244

12.3.1 Internet of Things 244

12.3.2 Big Data 245

12.3.3 Soft Computing Techniques 245

12.4 Related Works and Publication Trends of Articles 249

12.5 The Need for Neurological Disorders Diagnostic System 251

12.5.1 Design of Smart and Intelligent Neurological Disorders Diagnostic System 251

12.6 Conclusion 259

References 260

13 Comments-Based Analysis of a Bug Report Collection System and Its Applications 265
Arvinder Kaur and Shubhra Goyal

13.1 Introduction 265

13.2 Background 267

13.2.1 Issue Tracking System 267

13.2.2 Bug Report Statistics 267

13.3 Related Work 268

13.3.1 Data Extraction Process 268

13.3.2 Applications of Bug Report Comments 270

13.3.2.1 Bug Summarization 270

13.3.2.2 Emotion Mining 271

13.4 Data Collection Process 272

13.4.1 Steps of Data Extraction 273

13.4.2 Block Diagram for Data Extraction 274

13.4.3 Reports Generated 274

13.4.3.1 Bug Attribute Report 274

13.4.3.2 Long Description Report 275

13.4.3.3 Bug Comments Reports 275

13.4.3.4 Error Report 275

13.5 Analysis of Bug Reports 275

13.5.1 Research Question 1: Is the Performance of Software Affected by Open Bugs that are Critical in Nature? 275

13.5.2 Research Question 2: How Can Test Leads Improve the Performance of Software Systems? 277

13.5.3 Research Question 3: Which Are the Most Error-Prone Areas that Can Cause System Failure? 277

13.5.4 Research Question 4: Which Are the Most Frequent Words and Keywords to Predict Most Critical Bugs? 279

13.5.5 Research Questions 5: What is the Importance of Frequent Words Mined from Bug Reports? 281

13.6 Threats to Validity 284

13.7 Conclusion 284

References 286

14 Sarcasm Detection Algorithms Based on Sentiment Strength 289
Pragya Katyayan and Nisheeth Joshi

14.1 Introduction 289

14.2 Literature Survey 291

14.3 Experiment 294

14.3.1 Data Collection 294

14.3.2 Finding SentiStrengths 294

14.3.3 Proposed Algorithm 295

14.3.4 Explanation of the Algorithms 297

14.3.5 Classification 300

14.3.5.1 Explanation 300

14.3.6 Evaluation 302

14.4 Results and Evaluation 303

14.5 Conclusion 305

References 305

15 SNAP: Social Network Analysis Using Predictive Modeling 307
Samridhi Seth and Rahul Johari

15.1 Introduction 307

15.1.1 Types of Predictive Analytics Models 307

15.1.2 Predictive Analytics Techniques 308

15.1.2.1 Regression Techniques 308

15.1.2.2 Machine Learning Techniques 308

15.2 Literature Survey 309

15.3 Comparative Study 313

15.4 Simulation and Analysis 313

15.4.1 Few Analyses Made on the Data Set Are Given Below 314

15.4.1.1 Duration of Each Contact Was Found 314

15.4.1.2 Total Number of Contacts of Source Node with Destination Node Was Found for all Nodes 314

15.4.1.3 Total Duration of Contact of Source Node with Each Node Was Found 315

15.4.1.4 Mobility Pattern Describes Direction of Contact and Relation Between Number of Contacts and Duration of Contact 315

15.4.1.5 Unidirectional Contact, that is, Only 1 Node is Contacting Second Node but Vice Versa is Not There 317

15.4.1.6 Graphical Representation for the Duration of Contacts with Each Node is Given below 317

15.4.1.7 Rank and Percentile for Number of Contacts with Each Node 320

15.4.1.8 Data Set is Described for Three Days Where Time is Calculated in Seconds. Data Set can be Divided Into Three Days. Some of the Analyses Conducted on the Data set Day Wise Are Given Below 326

15.5 Conclusion and Future Work 329

References 329

16 Intelligent Data Analysis for Medical Applications 333
Moolchand Sharma, Vikas Chaudhary, Prerna Sharma, and R. S. Bhatia

16.1 Introduction 333

16.1.1 IDA (Intelligent Data Analysis) 335

16.1.1.1 Elicitation of Background Knowledge 337

16.1.2 Medical Applications 337

16.2 IDA Needs in Medical Applications 338

16.2.1 Public Health 339

16.2.2 Electronic Health Record 339

16.2.3 Patient Profile Analytics 339

16.2.3.1 Patient’s Profile 339

16.3 IDA Methods Classifications 339

16.3.1 Data Abstraction 339

16.3.2 Data Mining Method 340

16.3.3 Temporal Data Mining 341

16.4 Intelligent Decision Support System in Medical Applications 341

16.4.1 Need for Intelligent Decision System (IDS) 342

16.4.2 Understanding Intelligent Decision Support: Some Definitions 342

16.4.3 Advantages/Disadvantages of IDS 344

16.5 Conclusion 345

References 345

17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording 347
Md Belal Bin Heyat, Dakun Lai, Faijan Akhtar, Mohd Ammar Bin Hayat, Shafan Azad, Shadab Azad, and Shajan Azad

17.1 Introduction 347

17.1.1 Side Effect of Poor Snooze 348

17.2 History of Sleep Disorder 349

17.2.1 Classification of Sleep Disorder 349

17.2.2 Sleep Stages of the Human 351

17.3 Electroencephalogram Signal 351

17.3.1 Electroencephalogram Generation 351

17.3.1.1 Classification of Electroencephalogram Signal 352

17.4 EEG Data Measurement Technique 352

17.4.1 10-20 Electrode Positioning System 352

17.4.1.1 Procedure of Electrode placement 353

17.5 Literature Review 354

17.6 Subjects and Methodology 354

17.6.1 Data Collection 354

17.6.2 Low Pass Filter 355

17.6.3 Hanning Window 355

17.6.4 Welch Method 356

17.7 Data Analysis of the Bruxism and Normal Data Using EEG Signal 356

17.8 Result 358

17.9 Conclusions 361

Acknowledgments 363

References 364

18 Handwriting Analysis for Early Detection of Alzheimer’s Disease 369
Rajib Saha, Anirban Mukherjee, Aniruddha Sadhukhan, Anisha Roy, and Manashi De

18.1 Introduction and Background 369

18.2 Proposed Work and Methodology 376

18.3 Results and Discussions 379

18.3.1 Character Segmentation 380

18.4 Conclusion 384

References 385

Index 387

Authors

Deepak Gupta Dr. APJ Abdul Kalam Technical University, Lucknow, India. Siddhartha Bhattacharyya CHRIST (Deemed to be University), Bengaluru, India. Ashish Khanna National Institute of Technology, Kurukshetra, India. Kalpna Sagar Guru Gobind Singh Indraprastha University, Delhi, India.