What is data profiling and its benefits?

Estimated reading time: 4 minutes

Data profiling is the process of creating statistics on a data set that will allow readers of the metrics to understand how good the data quality is for that data.

Many organisations have data quality issues, and the ability to identify them and fix helps with many customer and operational problems proactively.

As a result, it can help to identify errors in data that may:

  • Feed into reports.
  • Reduce the effectiveness of machine learning outputs.
  • Have a regulatory impact on reports submitted and how their effectiveness is measured.
  • Dissatisfied customers will get irritated with receiving communications that have incorrect data on them.
  • Batch processes will fail, reducing the effectiveness of automated tasks.

To understand how to implement an effective data profiling process, it is essential to identify the data where the issues may occur:

  • Data entry by a human.
  • Imported data not cleansed.
  • Third-party systems are feeding you data that has errors in it.
  • Company takeovers, integrating data that has errors on it.

The amount of data that is now collected and stored in big data systems, needs a process to manage and capture errors.

So what are the different ways to profile data?

To ensure a high level of data quality, you would look at some of the following techniques:

  • Completeness – Does the data available represent a complete picture of the data that should be present?
  • Conformity – Is the data conforming to the correct structure as would be expected when you observe it?
  • Consistency – If you have the same data in two different systems, are they the same values.
  • Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have known on effects.
  • Uniqueness – If there are properties of data that are unique, does the data set show that.

When should data profiling take place?

This will depend on the organisation and the process that relies on it.

We will outline  some different scenarios that may influence how to approach this

Straight through processing – If you are looking to automate, there will be a need to ensure that no automated process fails.

As a result, there will be a need to check the data before it feeds a new system. Some steps could be implemented include:

  • Scan the data source for known data issues.
  • Apply logic to fix any data issues found.
  • Feed the data to its destination once all corrections have been made.

Problems that may occur with this:

  • New errors how to handle them, do you let them occur and fix them and the logic to be caught in the future?
  • This leads to fixes been required in the destination system, which leads to the more downstream fixing of data.
  • You cant control data with errors coming in; you need to report and validate updates that are required.

2. Batch processing – In this scenario, there is a delay in feeding the data, as the data has to be available to feed into the destination system.

As with the automated process, there is some level of automation, but there is more control around when the data is provided, and it can be paused or rerun. Some of the steps that can be implemented include:

  • Scan the data and provide a report on its quality. Fix the data if errors found, then upload.
  • Allow the data to load, and then using a report, fix it in a downstream system.
  • Work with the providers of the data to improve the data quality of the data received.

Scenarios where data profiling can be applied

MeasurementScenario ExampleImpact
Completeness – Does the data available represent a complete picture of the data that should be present.DOB populatedCant use as part of security checks when discussing customer or miscalculate values that are dependant on the DOB.
Conformity – Is the data conforming to the correct structure as would be expected when you observe it?  Email address incorrectEmails to customers bounce back; needs follow up to correct, the customer does not get proper communication.
Consistency – If you have the same data in two different systems, are they the same values?  Data stored on different systems needs to be exactly the same.The customer could be communicated different versions of the same data.
Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have a knock-on effect.Innaccurate data means incorrext decisionsSending out communications to the wrong set of customers who don’t expect or need the information.
Uniqueness – If there are properties of data that are unique, does the data set show that?The same data is populated for different sets of independent  customers.No visibility to the customer and their actual correct data. Incorrect information processed for them. The financial and reputational risk could also be a problem.

How to data cleanse a database table

In Data Analytics, that is a very relevant question, and something I look to implement in most projects, sometimes it is too easy to click the shortcut icon to your spreadsheet application!

Here we are looking to bring a little automation into these videos. Building on How to import data from files and Removing characters from an imported CSV file this video connects to a Microsoft Azure cloud database table, brings in the data with errors on it, fixes the errors and displays the correct output on the screen.

What can this do for organisations?

There are several benefits to automating this step:

  • Less manual intervention if there is a need to fix data issues.
  • Better productivity.
  • Better data flows with no errors and quicker reporting.

 

Moving away from files

The process of moving away from files and into automation has several steps:

  • Be clear on your data needs.
  • Understand what you are trying to achieve.
  • Build a process that is repeatable but can be updated easily.
  • Ensure that you build in data quality checks, helps deliver the better output to the users.

Thanks for stopping by!

Data Analytics Ireland

 

How to remove unwanted characters

Estimated reading time: 2 minutes

Removing the unwanted, that is holding you up.
A situation has arisen of you having information, which has erroneous data inside it, what do you do?

Data issues are a common scenario faced by many data analytics professionals and the industry as a whole. Data quality now has become more critical, especially as we move more processes online and the digital landscape increases.

Most pieces of data go through a process of been transferred somewhere between systems to be used or reports rely on the accuracy of them. If the data in the source system has quality issues, the problem if not addressed before going somewhere else can then push the data quality issues more throughout an organisation, like a spiders web it expands further.

The next step, looking to fix the problem and planning for it.

To combat this problem professionals need to come up with a plan on how to tackle this, either:

  • Fix at source
  • Take the data in before moving it on, and investigate the problems.
  •  Reject the file or part thereof.

All three options above have scenarios around them with costs and implications, depending on the industry, you need to pick the most appropriate way to handle. As an example, in the banking industry payment files can sometimes have data in them that is rejected entirely or in part.

But the bank may decide they will only discard the records with the wrong data and process everything else.

How to go about it and how regular expressions can help

In this video, we look to go through an example of how to cleanse a data set;

(A) We use a list to check what problems we need to find.

(B) Using functions again to process through the data to find the problem and extract them.

(C) Regular expressions also appear as they look to find the special characters in the data set.

The concept of regular expressions is used extensively across several programming languages; it is a good way to test data and find erroneous values. If you are thinking about machine learning, it is quite important to get a more thorough knowledge of how they work. Here is a good link for further reading if you need more information Regular Expression how to

Thanks for watching and if you like, please share and subscribe through the buttons on this page!

Data Analytics Ireland

YouTube channel lists – Python Data Cleansing

Ever had a process where you received a set of data, and it took a bit of effort to cleanse the data, so it looks the way you want it?

The world of data processing and data exchanged between servers and organisations needs careful attention, one person’s idea of clean data might not be other persons, hence the difference between the two can lead to data issues.

Experian had an excellent article What is data cleansing? in that they talk about several factors about data:

  • It could be incorrect
  • and incomplete
  •  and Duplicated

 

One of the things they highlighted also is that under GDPR, organisations have to deal with more focus on data been accurate, complete and up to date.

We are putting together several videos in this area over time, so you will start to see them as they go up.

Please like and share through the social media buttons shared on the page here, thanks for watching!

Data Analytics Ireland