The objective of a data migration is to ensure that correct data is available in the new environment. In the intelligent data-driven organization, data is an asset; it has a high value. If data is incorrect, it leads to wrong decisions, errors in process execution, errors in communication. This can have major impact on the organization’s image or simply be costly through remedial actions or lost revenue.
In a data migration, data quality is critical. Will the new system work if there are errors in the data? What if the data in the current system, the source for the migration, is already polluted? Often the source system is dated and the data has become polluted over the years. Whereas users now know how to deal with imperfections, after migration this will no longer be a given for the new environment.
Data cleansing requires a lot of effort. This can get in the way of a smooth migration. In this article, we show how we address data quality, data cleansing and enrichment, such that we have correct and up-to-date data after migration.
Measuring
To address data quality, we need to identify errors or potential errors in the data. We recognize four types of checks. Syntax or input checks determine whether the content of a field contains an acceptable value: for example, a date within a certain range. Business rules are mainly about combinations of fields: if a contract is to be canceled, the end date must be filled in. Referential checks are about links: a loan requires linked collateral or securities. Pattern validation is about the logical sequence of events, often involving timestamps: children are not older than their parents, an application cannot be completed before it was submitted.
In addition to incorrect or missing data, double entry is a common error. Everyone recognizes the double-entered relationships in a CRM system. There are methods to detect this, for example with fuzzy logic. During data analysis, we look for possible errors in the data. After that, a cleaning plan will be made.
Cleansing
Obviously, in order to clean data, we must have the correct data. Sometimes this is present internally within the organization. For example, in documents or in other systems. Sometimes the correct data must be searched externally, for example in independent reference files. The Dutch zip code table is a well-known example of this.
Cleansing can be manual or automated. We recognize four variants. Looking up: the correct data is available internally but must be looked up manually after which an employee enters the correct values. Finding out: the correct data is not available internally and it requires finding out, for example by calling the customer. Updating: the correct values are availably internally and can be extracted automatically to be used during migration. Enriching, this is the variant where external data sources are used and the correct data can be added to the migration environment. Updating or enriching of missing or erroneous data is done during migration. The correct data is added to the migration environment as an additional data source. This way, the audit trail remains intact.
Test data
We migrate only data that is correct. If data is incorrect, we filter out the data object in question and, if necessary, the related objects as well. The migrated data is then always correct and consistent. Our experience is that data pollution is always worse than expected and cleansing takes time. At the start of the migration project, sometimes only little data gets through all the filters. For testing purposes, however, it is desirable to have sufficient converted data at an early stage. For this reason, in the beginning we choose to detect incorrect data, but not to filter it. If necessary, incorrect data is temporarily replaced by dummy data. This way we have test data available for the target system in a very early stage.
Often the migration project and the realization or configuration of the target system are parallel projects. The migration project has to deal with a moving target. We provide continuous integration from source to a test environment of the target system. We continuously monitor progress in cleansing, and functioning of the target system using migrated data.
Continuous improvement
With every data migration, data quality is an issue. It needs attention at that time to make the data migration successful. But for many organizations, data quality is always an issue. It may not always be perceived as urgent; after all, the costs often remain hidden. However, it does lead to employee frustration, customer dissatisfaction, unnecessary risks and missed opportunities. Data quality needs more attention: continuous monitoring and improvement.
The way we address data quality during a data migration project can be applied at any time. Using alerts, we check periodically or continuously whether the data meets all criteria. We monitor the trends. Does it improve? At what points does data quality degrade? What is the cause of this? Such an analysis helps to implement process improvements or system modification to prevent errors.
Improving data quality is necessary for the success of a data migration, but it also ensures acceptance of the new system and proper execution of the business process. All the efforts required to improve data quality pay off quickly.
This is the fourth in a series of articles on data migration. Previously published: Data migration: Towards a correctly functioning new information system, Data migration: Big bang or dripping data? and Data Migration: From data mapping to dataflow. For more information about the Sopra Steria Solution for Data Migration, please contact Henk Zwaan, management consultant at the Data Engineering & Data Migration practice: henk.zwaan@ordina.nl.