What is Data Scrubbing?
Data scrubbing refers to the procedure of modifying or removing incomplete, incorrect,
inaccurately formatted, or repeated data in a database. The key objective of
data scrubbing is to make the data more accurate and consistent.
Data scrubbing is a vital strategy for ensuring that databases remain accurate. It is
especially important in data-intensive industries, including
telecommunications, insurance, banking and retailing. Data scrubbing
systematically evaluates data for flaws or mistakes with the help of look-up
tables, rules and algorithms.
As per Wikipedia, Data
scrubbing is an error correction technique that uses a background task to periodically
inspect main memory or storage for errors, and then correct detected errors using redundant data in form of different checksums or copies of data. Data scrubbing reduces the likelihood
that single correctable errors will accumulate, leading to reduced risks of
uncorrectable errors.
Explaining Data Scrubbing:
Database errors are
common, and may originate from the following:
·
Human errors during data entry
·
Database merging
·
Absence of industry-wide or company-specific data
standards
·
Aged systems that contain obsolete data
In the past, data
scrubbing was performed manually. This not only increased the time required to
complete the process, but also made the process much more expensive and prone
to errors. This led to the creation of effective data scrubbing tools, which
systematically evaluate data for flaws that could not be identified in a manual
cleaning process.
Generally, a database scrubbing tool consists of solutions that are ideal for rectifying several specific kinds of mistakes, like locating duplicate records, or replacing missing ZIP codes. Merging erroneous or corrupt data is the most complicated issue. It is even described as the "dirty data" problem because it costs organizations millions of dollars every year. This phenomenon is increasing with the introduction of more complex business environments with more systems and data. Data scrubbing helps organizations tackle such issues by providing powerful data scrubbing tools to identify and wipe out data flaws.
Data
cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality
of data. Data quality problems are present in single data collections, such as
files and databases, e.g., due to misspellings during data entry, missing
information or other invalid data. When multiple data sources need to be
integrated, e.g., in data warehouses, federated database systems or global
web-based information systems, the need for data cleaning increases significantly.
This is because the sources often contain redundant data in different
representations. In order to provide access to accurate and consistent data, consolidation
of different data representations and elimination of duplicate information
become necessary. Data warehouses require and provide extensive support for
data cleaning. They load and continuously refresh huge amounts of data from a
variety of sources so the probability that some of the sources contain “dirty
data” is high. Furthermore, data warehouses are used for decision making, so
that the correctness of their data is vital to avoid wrong conclusions. For
instance, duplicated or missing information will produce incorrect or
misleading statistics (“garbage in, garbage out”). Due to the wide range of
possible data inconsistencies and the sheer data volume, data
cleaning is considered to be one of the biggest problems in data warehousing.
ETL Process of DWH |
Data cleaning approaches
In
general, data cleaning involves several phases:
Data analysis: In order to detect
which kinds of errors and inconsistencies are to be removed, a detailed data
analysis is required. In addition to a manual inspection of the data or data
samples, analysis programs should be used to gain metadata about the data
properties and detect data quality problems.
Definition of
transformation workflow and mapping rules: Depending on the number of data
sources, their degree of heterogeneity and the “dirtiness” of the data, a large
number of data transformation and cleaning steps may have to be executed.
Sometime, a schema translation is used to map sources to a common data model;
for data warehouses, typically a relational representation is used. Early data cleaning
steps can correct single-source instance problems and prepare the data for
integration. Later steps deal with schema/data integration and cleaning
multi-source instance problems, e.g., duplicates. For data warehousing, the
control and data flow for these transformation and cleaning steps should be specified
within a workflow that defines the ETL process. The schema-related data transformations
as well as the cleaning steps should be specified by a declarative query and
mapping language as far as possible, to enable automatic generation of the transformation
code. In addition, it should be possible to invoke user-written cleaning code
and special purpose tools during a data transformation workflow. The
transformation steps may request user feedback on data instances for which they
have no built-in cleaning logic.
Verification: The correctness and
effectiveness of a transformation workflow and the transformation definitions
should be tested and evaluated, e.g., on a sample or copy of the source data,
to improve the definitions if necessary. Multiple iterations of the analysis,
design and verification steps may be needed, e.g., since some errors only
become apparent after applying some transformations.
Transformation: Execution of the
transformation steps either by running the ETL workflow for loading and
refreshing a data warehouse or during answering queries on multiple sources.
Back-flow of cleaned data:
After
(single-source) errors are removed, the cleaned data should also replace the
dirty data in the original sources in order to give legacy applications the
improved data too and to avoid redoing the cleaning work for future data
extractions. For data warehousing, the cleaned data is available from the data
staging area.
Awesome! Go ahead Mr. Oracle Expert...
ReplyDeleteit's my pleasure
Delete