Pages

Thursday, March 19, 2015

Data Scrubbing




What is Data Scrubbing?

Data scrubbing refers to the procedure of modifying or removing incomplete, incorrect, inaccurately formatted, or repeated data in a database. The key objective of data scrubbing is to make the data more accurate and consistent.

Data scrubbing is a vital strategy for ensuring that databases remain accurate. It is especially important in data-intensive industries, including telecommunications, insurance, banking and retailing. Data scrubbing systematically evaluates data for flaws or mistakes with the help of look-up tables, rules and algorithms.

As per Wikipedia, Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, and then correct detected errors using redundant data in form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.

Explaining Data Scrubbing:
Database errors are common, and may originate from the following:
·         Human errors during data entry
·         Database merging
·         Absence of industry-wide or company-specific data standards
·         Aged systems that contain obsolete data

In the past, data scrubbing was performed manually. This not only increased the time required to complete the process, but also made the process much more expensive and prone to errors. This led to the creation of effective data scrubbing tools, which systematically evaluate data for flaws that could not be identified in a manual cleaning process.

Generally, a database scrubbing tool consists of solutions that are ideal for rectifying several specific kinds of mistakes, like locating duplicate records, or replacing missing ZIP codes. Merging erroneous or corrupt data is the most complicated issue. It is even described as the "dirty data" problem because it costs organizations millions of dollars every year. This phenomenon is increasing with the introduction of more complex business environments with more systems and data. Data scrubbing helps organizations tackle such issues by providing powerful data scrubbing tools to identify and wipe out data flaws.

Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data quality problems are present in single data collections, such as files and databases, e.g., due to misspellings during data entry, missing information or other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleaning increases significantly. This is because the sources often contain redundant data in different representations. In order to provide access to accurate and consistent data, consolidation of different data representations and elimination of duplicate information become necessary. Data warehouses require and provide extensive support for data cleaning. They load and continuously refresh huge amounts of data from a variety of sources so the probability that some of the sources contain “dirty data” is high. Furthermore, data warehouses are used for decision making, so that the correctness of their data is vital to avoid wrong conclusions. For instance, duplicated or missing information will produce incorrect or misleading statistics (“garbage in, garbage out”). Due to the wide range of possible data inconsistencies and the sheer data volume, data cleaning is considered to be one of the biggest problems in data warehousing.

ETL Process of DWH
ETL Process of DWH



Data cleaning approaches

In general, data cleaning involves several phases:

Data analysis: In order to detect which kinds of errors and inconsistencies are to be removed, a detailed data analysis is required. In addition to a manual inspection of the data or data samples, analysis programs should be used to gain metadata about the data properties and detect data quality problems.

Definition of transformation workflow and mapping rules: Depending on the number of data sources, their degree of heterogeneity and the “dirtiness” of the data, a large number of data transformation and cleaning steps may have to be executed. Sometime, a schema translation is used to map sources to a common data model; for data warehouses, typically a relational representation is used. Early data cleaning steps can correct single-source instance problems and prepare the data for integration. Later steps deal with schema/data integration and cleaning multi-source instance problems, e.g., duplicates. For data warehousing, the control and data flow for these transformation and cleaning steps should be specified within a workflow that defines the ETL process. The schema-related data transformations as well as the cleaning steps should be specified by a declarative query and mapping language as far as possible, to enable automatic generation of the transformation code. In addition, it should be possible to invoke user-written cleaning code and special purpose tools during a data transformation workflow. The transformation steps may request user feedback on data instances for which they have no built-in cleaning logic.

Verification: The correctness and effectiveness of a transformation workflow and the transformation definitions should be tested and evaluated, e.g., on a sample or copy of the source data, to improve the definitions if necessary. Multiple iterations of the analysis, design and verification steps may be needed, e.g., since some errors only become apparent after applying some transformations.

Transformation: Execution of the transformation steps either by running the ETL workflow for loading and refreshing a data warehouse or during answering queries on multiple sources.

Back-flow of cleaned data:
After (single-source) errors are removed, the cleaned data should also replace the dirty data in the original sources in order to give legacy applications the improved data too and to avoid redoing the cleaning work for future data extractions. For data warehousing, the cleaned data is available from the data staging area.

2 comments: