DATA WRANGLING
Written and edited by some of the world's top experts in the field, this exciting new volume provides state-of-the-art research and latest technological breakthroughs in data wrangling, its theoretical concepts, practical applications, and tools for solving everyday problems.
Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. This process typically includes manually converting and mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. Data wrangling is increasingly ubiquitous at today’s top firms.
Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data's format, typically by converting "raw" data into another format more suitable for use. Data wrangling is a necessary component of any business. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale, including many applications, such as Datameer, Infogix, Paxata, Talend, Tamr, TMMData, and Trifacta.
This book synthesizes the processes of data wrangling into a comprehensive overview, with a strong focus on recent and rapidly evolving agile analytic processes in data-driven enterprises, for businesses and other enterprises to use to find solutions for their everyday problems and practical applications. Whether for the veteran engineer, scientist, or other industry professional, this book is a must have for any library.
Table of Contents
1 Basic Principles of Data Wrangling 1
Akshay Singh, Surender Singh and Jyotsna Rathee
1.1 Introduction 2
1.2 Data Workflow Structure 4
1.3 Raw Data Stage 4
1.3.1 Data Input 5
1.3.2 Output Actions at Raw Data Stage 6
1.3.3 Structure 6
1.3.4 Granularity 7
1.3.5 Accuracy 7
1.3.6 Temporality 8
1.3.7 Scope 8
1.4 Refined Stage 9
1.4.1 Data Design and Preparation 9
1.4.2 Structure Issues 9
1.4.3 Granularity Issues 10
1.4.4 Accuracy Issues 10
1.4.5 Scope Issues 11
1.4.6 Output Actions at Refined Stage 11
1.5 Produced Stage 12
1.5.1 Data Optimization 13
1.5.2 Output Actions at Produced Stage 13
1.6 Steps of Data Wrangling 14
1.7 Do’s for Data Wrangling 16
1.8 Tools for Data Wrangling 16
References 17
2 Skills and Responsibilities of Data Wrangler 19
Prabhjot Kaur, Anupama Kaushik and Aditya Kapoor
2.1 Introduction 20
2.2 Role as an Administrator (Data and Database) 21
2.3 Skills Required 22
2.3.1 Technical Skills 22
2.3.1.1 Python 22
2.3.1.2 R Programming Language 25
2.3.1.3 Sql 26
2.3.1.4 MATLAB 27
2.3.1.5 Scala 27
2.3.1.6 Excel 28
2.3.1.7 Tableau 28
2.3.1.8 Power BI 29
2.3.2 Soft Skills 31
2.3.2.1 Presentation Skills 31
2.3.2.2 Storytelling 32
2.3.2.3 Business Insights 32
2.3.2.4 Writing/Publishing Skills 32
2.3.2.5 Listening 33
2.3.2.6 Stop and Think 33
2.3.2.7 Soft Issues 33
2.4 Responsibilities as Database Administrator 34
2.4.1 Software Installation and Maintenance 34
2.4.2 Data Extraction, Transformation, and Loading 34
2.4.3 Data Handling 35
2.4.4 Data Security 35
2.4.5 Data Authentication 35
2.4.6 Data Backup and Recovery 35
2.4.7 Security and Performance Monitoring 36
2.4.8 Effective Use of Human Resource 36
2.4.9 Capacity Planning 36
2.4.10 Troubleshooting 36
2.4.11 Database Tuning 36
2.5 Concerns for a DBA 37
2.6 Data Mishandling and Its Consequences 39
2.6.1 Phases of Data Breaching 40
2.6.2 Data Breach Laws 41
2.6.3 Best Practices For Enterprises 41
2.7 The Long-Term Consequences: Loss of Trust and Diminished Reputation 42
2.8 Solution to the Problem 42
2.9 Case Studies 42
2.9.1 UBER Case Study 42
2.9.1.1 Role of Analytics and Business Intelligence in Optimization 44
2.9.1.2 Mapping Applications for City Ops Teams 46
2.9.1.3 Marketplace Forecasting 47
2.9.1.4 Learnings from Data 48
2.9.2 PepsiCo Case Study 48
2.9.2.1 Searching for a Single Source of Truth 49
2.9.2.2 Finding the Right Solution for Better Data 49
2.9.2.3 Enabling Powerful Results with Self-Service Analytics 50
2.10 Conclusion 50
References 50
3 Data Wrangling Dynamics 53
Simarjit Kaur, Anju Bala and Anupam Garg
3.1 Introduction 53
3.2 Related Work 54
3.3 Challenges: Data Wrangling 55
3.4 Data Wrangling Architecture 56
3.4.1 Data Sources 57
3.4.2 Auxiliary Data 57
3.4.3 Data Extraction 58
3.4.4 Data Wrangling 58
3.4.4.1 Data Accessing 58
3.4.4.2 Data Structuring 58
3.4.4.3 Data Cleaning 58
3.4.4.4 Data Enriching 59
3.4.4.5 Data Validation 59
3.4.4.6 Data Publication 59
3.5 Data Wrangling Tools 59
3.5.1 Excel 59
3.5.2 Altair Monarch 60
3.5.3 Anzo 60
3.5.4 Tabula 61
3.5.5 Trifacta 61
3.5.6 Datameer 63
3.5.7 Paxata 63
3.5.8 Talend 65
3.6 Data Wrangling Application Areas 65
3.7 Future Directions and Conclusion 67
References 68
4 Essentials of Data Wrangling 71
Menal Dahiya, Nikita Malik and Sakshi Rana
4.1 Introduction 71
4.2 Holistic Workflow Framework for Data Projects 72
4.2.1 Raw Stage 73
4.2.2 Refined Stage 74
4.2.3 Production Stage 74
4.3 The Actions in Holistic Workflow Framework 74
4.3.1 Raw Data Stage Actions 74
4.3.1.1 Data Ingestion 75
4.3.1.2 Creating Metadata 75
4.3.2 Refined Data Stage Actions 76
4.3.3 Production Data Stage Actions 77
4.4 Transformation Tasks Involved in Data Wrangling 78
4.4.1 Structuring 78
4.4.2 Enriching 78
4.4.3 Cleansing 79
4.5 Description of Two Types of Core Profiling 79
4.5.1 Individual Values Profiling 80
4.5.1.1 Syntactic 80
4.5.1.2 Semantic 80
4.5.2 Set-Based Profiling 80
4.6 Case Study 80
4.6.1 Importing Required Libraries 81
4.6.2 Changing the Order of the Columns in the Dataset 82
4.6.3 To Display the DataFrame (Top 10 Rows) and Verify that the Columns are in Order 82
4.6.4 To Display the DataFrame (Bottom 10 rows) and Verify that the Columns Are in Order 83
4.6.5 Generate the Statistical Summary of the DataFrame for All the Columns 83
4.7 Quantitative Analysis 84
4.7.1 Maximum Number of Fires on Any Given Day 84
4.7.2 Total Number of Fires for the Entire Duration for Every State 85
4.7.3 Summary Statistics 86
4.8 Graphical Representation 86
4.8.1 Line Graph 86
4.8.2 Pie Chart 86
4.8.3 Bar Graph 87
4.9 Conclusion 89
References 90
5 Data Leakage and Data Wrangling in Machine Learning for Medical Treatment 91
P.T. Jamuna Devi and B.R. Kavitha
5.1 Introduction 91
5.2 Data Wrangling and Data Leakage 93
5.3 Data Wrangling Stages 94
5.3.1 Discovery 94
5.3.2 Structuring 95
5.3.3 Cleaning 95
5.3.4 Improving 95
5.3.5 Validating 95
5.3.6 Publishing 95
5.4 Significance of Data Wrangling 96
5.5 Data Wrangling Examples 96
5.6 Data Wrangling Tools for Python 96
5.7 Data Wrangling Tools and Methods 99
5.8 Use of Data Preprocessing 100
5.9 Use of Data Wrangling 101
5.10 Data Wrangling in Machine Learning 104
5.11 Enhancement of Express Analytics Using Data Wrangling Process 106
5.12 Conclusion 106
References 106
6 Importance of Data Wrangling in Industry 4.0 109
Rachna Jain, Geetika Dhand, Kavita Sheoran and Nisha Aggarwal
6.1 Introduction 110
6.1.1 Data Wrangling Entails 110
6.2 Steps in Data Wrangling 111
6.2.1 Obstacles Surrounding Data Wrangling 113
6.3 Data Wrangling Goals 114
6.4 Tools and Techniques of Data Wrangling 115
6.4.1 Basic Data Munging Tools 115
6.4.2 Data Wrangling in Python 115
6.4.3 Data Wrangling in R 116
6.5 Ways for Effective Data Wrangling 116
6.5.1 Ways to Enhance Data Wrangling Pace 117
6.6 Future Directions 119
References 120
7 Managing Data Structure in R 123
Mittal Desai and Chetan Dudhagara
7.1 Introduction to Data Structure 123
7.2 Homogeneous Data Structures 125
7.2.1 Vector 125
7.2.2 Factor 131
7.2.3 Matrix 132
7.2.4 Array 136
7.3 Heterogeneous Data Structures 138
7.3.1 List 139
7.3.2 Dataframe 144
References 146
8 Dimension Reduction Techniques in Distributional Semantics: An Application Specific Review 147
Pooja Kherwa, Jyoti Khurana, Rahul Budhraj, Sakshi Gill, Shreyansh Sharma and Sonia Rathee
8.1 Introduction 148
8.2 Application Based Literature Review 150
8.3 Dimensionality Reduction Techniques 158
8.3.1 Principal Component Analysis 158
8.3.2 Linear Discriminant Analysis 161
8.3.2.1 Two-Class LDA 162
8.3.2.2 Three-Class LDA 162
8.3.3 Kernel Principal Component Analysis 165
8.3.4 Locally Linear Embedding 169
8.3.5 Independent Component Analysis 171
8.3.6 Isometric Mapping (Isomap) 172
8.3.7 Self-Organising Maps 173
8.3.8 Singular Value Decomposition 174
8.3.9 Factor Analysis 175
8.3.10 Auto-Encoders 176
8.4 Experimental Analysis 178
8.4.1 Datasets Used 178
8.4.2 Techniques Used 178
8.4.3 Classifiers Used 179
8.4.4 Observations 179
8.4.5 Results Analysis Red-Wine Quality Dataset 179
8.5 Conclusion 182
References 182
9 Big Data Analytics in Real Time for Enterprise Applications to Produce Useful Intelligence 187
Prashant Vats and Siddhartha Sankar Biswas
9.1 Introduction 188
9.2 The Internet of Things and Big Data Correlation 190
9.3 Design, Structure, and Techniques for Big Data Technology 191
9.4 Aspiration for Meaningful Analyses and Big Data Visualization Tools 193
9.4.1 From Information to Guidance 194
9.4.2 The Transition from Information Management to Valuation Offerings 195
9.5 Big Data Applications in the Commercial Surroundings 196
9.5.1 IoT and Data Science Applications in the Production Industry 197
9.5.1.1 Devices that are Inter Linked 199
9.5.1.2 Data Transformation 199
9.5.2 Predictive Analysis for Corporate Enterprise Applications in the Industrial Sector 204
9.6 Big Data Insights’ Constraints 207
9.6.1 Technological Developments 207
9.6.2 Representation of Data 207
9.6.3 Data That Is Fragmented and Imprecise 208
9.6.4 Extensibility 208
9.6.5 Implementation in Real Time Scenarios 208
9.7 Conclusion 209
References 210
10 Generative Adversarial Networks: A Comprehensive Review 213
Jyoti Arora, Meena Tushir, Pooja Kherwa and Sonia Rathee
List of Abbreviations 213
10.1 Introductıon 214
10.2 Background 215
10.2.1 Supervised vs Unsupervised Learning 215
10.2.2 Generative Modeling vs Discriminative Modeling 216
10.3 Anatomy of a GAN 217
10.4 Types of GANs 218
10.4.1 Conditional GAN (CGAN) 218
10.4.2 Deep Convolutional GAN (DCGAN) 220
10.4.3 Wasserstein GAN (WGAN) 221
10.4.4 Stack GAN 222
10.4.5 Least Square GAN (LSGANs) 222
10.4.6 Information Maximizing GAN (INFOGAN) 223
10.5 Shortcomings of GANs 224
10.6 Areas of Application 226
10.6.1 Image 226
10.6.2 Video 226
10.6.3 Artwork 227
10.6.4 Music 227
10.6.5 Medicine 227
10.6.6 Security 227
10.7 Conclusion 228
References 228
11 Analysis of Machine Learning Frameworks Used in Image Processing: A Review 235
Gurpreet Kaur and Kamaljit Singh Saini
11.1 Introduction 235
11.2 Types of ML Algorithms 236
11.2.1 Supervised Learning 236
11.2.2 Unsupervised Learning 237
11.2.3 Reinforcement Learning 238
11.3 Applications of Machine Learning Techniques 238
11.3.1 Personal Assistants 238
11.3.2 Predictions 238
11.3.3 Social Media 240
11.3.4 Fraud Detection 240
11.3.5 Google Translator 242
11.3.6 Product Recommendations 242
11.3.7 Videos Surveillance 243
11.4 Solution to a Problem Using ml 243
11.4.1 Classification Algorithms 243
11.4.2 Anomaly Detection Algorithm 244
11.4.3 Regression Algorithm 244
11.4.4 Clustering Algorithms 245
11.4.5 Reinforcement Algorithms 245
11.5 ml in Image Processing 246
11.5.1 Frameworks and Libraries Used for ML Image Processing 246
11.6 Conclusion 248
References 248
12 Use and Application of Artificial Intelligence in Accounting and Finance: Benefits and Challenges 251
Ram Singh, Rohit Bansal and Niranjanamurthy M.
12.1 Introduction 252
12.1.1 Artificial Intelligence in Accounting and Finance Sector 252
12.2 Uses of AI in Accounting & Finance Sector 254
12.2.1 Pay and Receive Processing 254
12.2.2 Supplier on Boarding and Procurement 255
12.2.3 Audits 255
12.2.4 Monthly, Quarterly Cash Flows, and Expense Management 255
12.2.5 AI Chatbots 255
12.3 Applications of AI in Accounting and Finance Sector 256
12.3.1 AI in Personal Finance 257
12.3.2 AI in Consumer Finance 257
12.3.3 AI in Corporate Finance 257
12.4 Benefits and Advantages of AI in Accounting and Finance 258
12.4.1 Changing the Human Mindset 259
12.4.2 Machines Imitate the Human Brain 260
12.4.3 Fighting Misrepresentation 260
12.4.4 AI Machines Make Accounting Tasks Easier 260
12.4.5 Invisible Accounting 261
12.4.6 Build Trust through Better Financial Protection and Control 261
12.4.7 Active Insights Help Drive Better Decisions 261
12.4.8 Fraud Protection, Auditing, and Compliance 262
12.4.9 Machines as Financial Guardians 263
12.4.10 Intelligent Investments 264
12.4.11 Consider the “Runaway Effect” 264
12.4.12 Artificial Control and Effective Fiduciaries 264
12.4.13 Accounting Automation Avenues and Investment Management 265
12.5 Challenges of AI Application in Accounting and Finance 265
12.5.1 Data Quality and Management 267
12.5.2 Cyber and Data Privacy 267
12.5.3 Legal Risks, Liability, and Culture Transformation 267
12.5.4 Practical Challenges 268
12.5.5 Limits of Machine Learning and AI 269
12.5.6 Roles and Skills 269
12.5.7 Institutional Issues 270
12.6 Suggestions and Recommendation 271
12.7 Conclusion and Future Scope of the Study 272
References 272
13 Obstacle Avoidance Simulation and Real-Time Lane Detection for AI-Based Self-Driving Car 275
B. Eshwar, Harshaditya Sheoran, Shivansh Pathak and Meena Rao
13.1 Introduction 275
13.1.1 Environment Overview 277
13.1.1.1 Simulation Overview 277
13.1.1.2 Agent Overview 278
13.1.1.3 Brain Overview 279
13.1.2 Algorithm Used 279
13.1.2.1 Markovs Decision Process (MDP) 279
13.1.2.2 Adding a Living Penalty 280
13.1.2.3 Implementing a Neural Network 280
13.2 Simulations and Results 281
13.2.1 Self-Driving Car Simulation 281
13.2.2 Real-Time Lane Detection and Obstacle Avoidance 283
13.2.3 About the Model 283
13.2.4 Preprocessing the Image/Frame 285
13.3 Conclusion 286
References 287
14 Impact of Suppliers Network on SCM of Indian Auto Industry: A Case of Maruti Suzuki India Limited 289
Ruchika Pharswan, Ashish Negi and Tridib Basak
14.1 Introduction 290
14.2 Literature Review 292
14.2.1 Prior Pandemic Automobile Industry/COVID-19
Thump on the Automobile Sector 294
14.2.2 Maruti Suzuki India Limited (MSIL) During COVID-19 and Other Players in the Automobile Industry and How MSIL Prevailed 296
14.3 Methodology 297
14.4 Findings 298
14.4.1 Worldwide Economic Impact of the Epidemic 298
14.4.2 Effect on Global Automobile Industry 298
14.4.3 Effect on Indian Automobile Industry 301
14.4.4 Automobile Industry Scenario That Can Be Expected Post COVID-19 Recovery 306
14.5 Discussion 306
14.5.1 Competitive Dimensions 306
14.5.2 MSIL Strategies 307
14.5.3 MSIL Operations and Supply Chain Management 308
14.5.4 MSIL Suppliers Network 309
14.5.5 MSIL Manufacturing 310
14.5.5 MSIL Distributors Network 311
14.5.6 MSIL Logistics Management 312
14.6 Conclusion 312
References 312
About the Editors 315
Index 317