Data quality: how to implement a good data quality strategy?
Data quality: how to implement a good data quality strategy?
Data quality is one of the many challenges that every organization must face in the face of exponential data growth. If we set aside the problems associated with data storage and protection, the most important issues are as follows:
-
data analysis: organizations need to be able to analyze data to transform it into useful, actionable information, improve operations and make informed decisions,
-
data quality, an essential prerequisite for analysis: it is essential to ensure data integrity in order to guarantee accurate, relevant and appropriate results.
Contents
What is data quality? 🤔
Data quality is a set of metrics that enable you to judge the relevance and usability of your data. Dealing with it means being able to measure the accuracy, completeness, integrity and timeliness of your data:
-
Accuracy means that the data is correct and consistent,
-
Completeness means that the data is not incomplete,
-
integrity means that data is protected against unauthorized modification, deletion or addition,
-
timeliness means that the data is up to date.
In many organizations today, data is produced at high speed and on a massive scale, making it difficult to manage and control. Data can be:
-
incomplete, incorrect or even aberrant,
-
recorded in different formats and in different storage systems, which complicates their interpretation.
To remedy these difficulties, the implementation of a data quality policy is a major challenge. High-quality data is the key to informed decision-making in all sectors and disciplines. Data quality processes are essential to confidence and accuracy, both in terms of the quantity of information gathered and its reliability.
The more efficiently your data is collected, checked, corrected and harmonized, the better your conclusions will be, and the more relevant your decisions will be.
It is therefore essential to determine how to control and improve data quality, in order to put in place the governance rules needed to guarantee this quality in the long term.
For a more detailed introduction to these fundamentals and to find out how to apply them concretely in your business, read our dedicated article:👉 What is Data Quality? Everything you need to know to master data quality in the workplace
This article also explores the strategic and operational challenges of Data Quality, while highlighting practical approaches and tools to support you in your approach.
Why is data quality a business issue?
Data quality is actually a recurring problem for these main reasons:
-
Human input regularly creates new inconsistencies or duplications (in CRM, ERP, HR software...). Some of these errors can be avoided by advanced data entry checks (e.g.: immediate verification of a town name or zip code).
However, not all errors can be avoided, especially those involving consistency between information entered in different fields/zones. This type of error is mainly identified by our customers when migrating data to a new tool.
-
For example, in the IOT field, sensors are not free from failure: they can emit outliers, or exhibit erratic behavior in the time lapse between two measurements.
-
In Machine Learning, predictive models may well have been trained on high-quality data, but when we put them into production, it's to confront them with data that these models have never seen. If the quality of the input data declines over time (missing data, outliers), the accuracy of the predictions, by nature very sensitive to data quality, will drop significantly. The predictive model may end up doing just about anything.
AI production therefore requires continuous monitoring of data quality.
Data quality: how to detect input errors?
The first step in a data quality control process is error detection, to correct incomplete, incorrect or aberrant data.
The main sources of data anomalies
Data errors, no matter how marginal, can have a huge impact on business decisions, as long as these decisions are based on... :
-
dashboards built from data of inadequate quality, possibly containing duplicates (e.g.: duplicates in a customer database are a major obstacle to identifying the best customers - absence of a Single Customer View),
-
more technicalpredictive models(neural network, random forest, logistic regression) are, by their very nature, extremely sensitive to inaccurate or incomplete data during the learning phase.
Data anomalies can come from a wide variety of sources: erroneous or illegible manual entries, transmission failures, conversion problems, incomplete or unsuitable processes, and so on. It is important to be able to identify the sources and types of errors, so that we can understand, prevent and correct them.
Implementing regular, automated quality control rules then ensures that errors are spotted and can be corrected before they affect decision-making.
Working on data quality means recognizing that it can be influenced by humans, but not only. Data entry errors can also result from what is known as "bad encoding" or poor transcription.
Detecting input errors can be tricky, especially when you're dealing with duplicates, or even "near-duplicates". For example, when a letter is mistyped (the typo), it is extremely difficult, if not impossible, to detect with tools such as Excel or even SQL.
To improve data quality, we need to be in a certain frame of mind: recognizing that these errors can exist, even if we don't see them at first. 😇
Detect data quality problems in data with specialized functions
To move from the "blind" to the "seeing" stage, it is possible to use solutions featuring artificial intelligence functions, such as fuzzy logic. This technique is used to detect input errors when data is close together. We call these "near-duplicates". Fuzzy logic makes it possible to compare names of people who have been entered differently, such as :
-
Emma Dupont' and 'Emma Dupond'.
-
Emma Dupond' and 'Emma née Dupond' (the word 'née' is added)
-
Malaurie' or 'Malorie' or even 'Mallorie
Traditional tools, such as Excel, are ill-suited to identifying 'matching' data. By using more advanced solutions, based on artificial intelligence, it is possible to:
-
detect anomalies much more effectively, correct them, normalize textual data, deduplicate and thus improve data quality,
-
automate these detection/correction operations and integrate them into data pipelines.
If the very first step is awareness, i.e. admitting that you have anomalies in your data, you also have to admit that this has a cost for the organization.
And the real cost of data quality problems can be difficult to assess.
At Tale of Data, we propose to break these costs down into two dimensions. This can make it easier for you to measure the impact of data quality problems on your business:
-
The dimension: hidden costs / direct costs (direct in the sense of visible)
-
The dimension: operational / strategic
Here's a matrix to illustrate what we mean:

"Businesses need to take pragmatic, targeted steps to improve their enterprise data quality if they want to accelerate the digital transformation of their organization," says Gartner, whose recent study estimates that poor data quality costs companies an average of $12.9 million a year.
But detecting anomalies isn't the only issue at stake in data quality: working with heterogeneous data is also a challenge that needs to be met.
How can heterogeneous data be processed to improve data quality?
Managing heterogeneous data has become increasingly necessary with the explosion of data and the proliferation of data sources within organizations.
Data is rarely analyzed on its own. To analyze it, it is often necessary to combine it with other data, to group it together or enrich it.
To process heterogeneous data, make them more coherent and consistent with each other, and thus facilitate their use through their combination, two analyses are necessary:
-
identify sources: it's important to first identify all your data sources and their respective formats. It's not the most exciting step to take, but it's the one that will dictate the success of your heterogeneous data quality project.
-
Harmonize the format: this step involves creating a common format for all data, wherever it comes from. Choosing this format can be tricky, but crucial. It will then be used so that all your data can be interpreted by a computer system. Without this format harmonization, it's impossible to link data together. This is a major problem when you have to carry out actions such as enhancing the quality of product catalog data. It is therefore essential to transform or "normalize" your data, according to the standard you have chosen.
To illustrate harmonization, let's take the example of a company collecting product references from different suppliers. Harmonizing data means using the same format for each type of information.
You may have to decide on:
-
How many characters should a product reference contain: 8, 12 or more?
-
Will the character string consist exclusively of numbers? or letters? or a mix of the two?
-
Will the beginning of the character string have a particular meaning: the first 2 letters will be the country of manufacture, a warehouse code, a supplier code?
When handling heterogeneous data sources (i.e. from different "silos"), you need to create correspondence tables and "repositories of repositories". Fuzzy logic is indispensable for matching two representations of the same entity. For example, in the case of a product database, the solution you use must be capable of automatically matching (with a confidence coefficient if possible) the following two products:
-
HUAWEI MediaPad M5 10.8
-
HUAWEI M5 10.8"
Heterogeneous data processing is therefore essential for exploiting the wealth of corporate data and building bridges between information from different sources.
A data quality policy with the right metrics to assess accuracy and completeness is necessary, but not sufficient. To guarantee lasting quality, and therefore a sustainable policy, it is essential to succeed in the industrialization stage. There can be no data quality policy without automating the quality control of incomplete or incorrect data from different storage systems.
You May Also Like
These Related Stories

What is Data Quality? Everything you need to know to master data quality in the workplace

Data quality: a major pillar of Data Governance
