Estimated reading time: 4 minutes
Data profiling is the process of creating statistics on a data set that will allow readers of the metrics to understand how good the data quality is for that data.
Usually this is one of the many functions of a data analyst.
Many organisations have data quality issues, and the ability to identify them and fix helps with many customer and operational problems proactively.
As a result, it can help to identify errors in data that may:
- Feed into reports.
- Reduce the effectiveness of machine learning outputs.
- Have a regulatory impact on reports submitted and how their effectiveness is measured.
- Dissatisfied customers will get irritated with receiving communications that have incorrect data on them.
- Batch processes will fail, reducing the effectiveness of automated tasks.
To understand how to implement an effective data profiling process, it is essential to identify the data where the issues may occur:
- Data entry by a human.
- Imported data not cleansed.
- Third-party systems are feeding you data that has errors in it.
- Company takeovers, integrating data that has errors on it.
The amount of data that is now collected and stored in big data systems, needs a process to manage and capture errors.
So what are the different ways to profile data?
To ensure a high level of data quality, you would look at some of the following techniques:
- Completeness – Does the data available represent a complete picture of the data that should be present?
- Conformity – Is the data conforming to the correct structure as would be expected when you observe it?
- Consistency – If you have the same data in two different systems, are they the same values.
- Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have known on effects.
- Uniqueness – If there are properties of data that are unique, does the data set show that.
When should data profiling take place?
This will depend on the organisation and the process that relies on it.
We will outline some different scenarios that may influence how to approach this
Straight through processing – If you are looking to automate, there will be a need to ensure that no automated process fails.
As a result, there will be a need to check the data before it feeds a new system. Some steps could be implemented include:
- Scan the data source for known data issues.
- Apply logic to fix any data issues found.
- Feed the data to its destination once all corrections have been made.
Problems that may occur with this:
- New errors how to handle them, do you let them occur and fix them and the logic to be caught in the future?
- This leads to fixes been required in the destination system, which leads to the more downstream fixing of data.
- You cant control data with errors coming in; you need to report and validate updates that are required.
2. Batch processing – In this scenario, there is a delay in feeding the data, as the data has to be available to feed into the destination system.
As with the automated process, there is some level of automation, but there is more control around when the data is provided, and it can be paused or rerun. Some of the steps that can be implemented include:
- Scan the data and provide a report on its quality. Fix the data if errors found, then upload.
- Allow the data to load, and then using a report, fix it in a downstream system.
- Work with the providers of the data to improve the data quality of the data received.
Scenarios where data profiling can be applied
Measurement | Scenario Example | Impact |
Completeness – Does the data available represent a complete picture of the data that should be present. | DOB populated | Cant use as part of security checks when discussing customer or miscalculate values that are dependant on the DOB. |
Conformity – Is the data conforming to the correct structure as would be expected when you observe it? | Email address incorrect | Emails to customers bounce back; needs follow up to correct, the customer does not get proper communication. |
Consistency – If you have the same data in two different systems, are they the same values? | Data stored on different systems needs to be exactly the same. | The customer could be communicated different versions of the same data. |
Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have a knock-on effect. | Innaccurate data means incorrext decisions | Sending out communications to the wrong set of customers who don’t expect or need the information. |
Uniqueness – If there are properties of data that are unique, does the data set show that? | The same data is populated for different sets of independent customers. | No visibility to the customer and their actual correct data. Incorrect information processed for them. The financial and reputational risk could also be a problem. |