What is Data Cleansing | What is Data Cleaning
This page covers data cleaning or data cleansing definition, Data cleansing use cases and challenges of data cleansing or data cleaning.
Data Cleansing Definition
The process which converts sourced data with errors, duplicates and inconsistencies into cleaned
data is known as data cleansing. It is used as one of the methods in data analytics.
The data in real world is dirty as depicted in the figure-1 above.
• Incomplete data comes from non-available data value at the
time of recording or human/hardware/software errors.
• Noisy data comes from data transmission errors and faulty
equipments, human or computer errors etc.
• Duplicate data comes from different data sources.
THe dirty data consists of following issues.
Incomplete: lacking attribute values
Example: occupation = " "
Noisy: Containing errors (e.g. spelling mistake,
phonetic and typing errors, transpositions,
multiple values instead of single field etc.)
Example: Salary = " -10 "
Inconsistent: containing discrepancies in codes or names
(synonyms and nicknames, prefix and suffix variations,
abbreviations, truncation and initials)
Example#1: Age = "42" Birthday = "03/07/1997 "
Example#2: was rating "1,2,3" , now rating " A, B, C "
Example#3: discrepancy between approximate duplicate records as explained below.
➨To address data quality problems, one of the methods used in data analytics is data cleansing or data cleaning. It is one of the methods. The other methods include data quality checking, data normalization, data standardization, data analysis, data Deduplication etc.
➨The data cleansing does many functions to improve quality of data from dirtiness. One of the function is using "string matching" to find same entity from two different datasets (i.e. tables) as shown in the figure-3.
Data Cleansing Use cases
Following are the use cases of Data cleansing operation used in data analytics.
• MDM-Master Data Management
• CRM-Customer Relationship Management
• DWH-Data Warehousing
• DWH-Business Intelligence BI
Typical examples include Inventory levels, Banking risks, IT overhead, Incorrect KPIs and Poor publicity.
Data cleaning or data cleansing Challenges
Following are the challenges to handle while performing data cleansing tasks.
➨How to define data quality?
• This is done by data profiling task.
➨Semantic complexity
• Domain experts can only evaluate correct value.
• The data set and expected result will decide use of the techniques.
Much fine-tuning is needed to achieve desired result.
➨Computational complexity
• Duplicate detection is quadratic in nature.
➨Evaluation is difficult as there is no defined gold standard.
Data Analytics Related Links
what is data analytics
Advantages and Disadvantages of data analytics
What is big data
What is Hadoop
Data Mining Glossary
Data mining tools and techniques
What is Cloud Storage
data mining tutorial
cloud storage tutorial
Infrastructure
How does it work
Service providers
cloud storage security
cloud computing tutorial
What is Difference between
traditional storage vs cloud storage
Types
DNS vs DHCP
FTP vs HTTP
FTP vs SMTP
FTP vs TFTP