What is data profiling and its benefits?

Estimated reading time: 4 minutes

Data profiling is the process of creating statistics on a data set that will allow readers of the metrics to understand how good the data quality is for that data.

Usually this is one of the many functions of a data analyst.

Many organisations have data quality issues, and the ability to identify them and fix helps with many customer and operational problems proactively.

As a result, it can help to identify errors in data that may:

  • Feed into reports.
  • Reduce the effectiveness of machine learning outputs.
  • Have a regulatory impact on reports submitted and how their effectiveness is measured.
  • Dissatisfied customers will get irritated with receiving communications that have incorrect data on them.
  • Batch processes will fail, reducing the effectiveness of automated tasks.

To understand how to implement an effective data profiling process, it is essential to identify the data where the issues may occur:

  • Data entry by a human.
  • Imported data not cleansed.
  • Third-party systems are feeding you data that has errors in it.
  • Company takeovers, integrating data that has errors on it.

The amount of data that is now collected and stored in big data systems, needs a process to manage and capture errors.

So what are the different ways to profile data?

To ensure a high level of data quality, you would look at some of the following techniques:

  • Completeness – Does the data available represent a complete picture of the data that should be present?
  • Conformity – Is the data conforming to the correct structure as would be expected when you observe it?
  • Consistency – If you have the same data in two different systems, are they the same values.
  • Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have known on effects.
  • Uniqueness – If there are properties of data that are unique, does the data set show that.

When should data profiling take place?

This will depend on the organisation and the process that relies on it.

We will outline  some different scenarios that may influence how to approach this

Straight through processing – If you are looking to automate, there will be a need to ensure that no automated process fails.

As a result, there will be a need to check the data before it feeds a new system. Some steps could be implemented include:

  • Scan the data source for known data issues.
  • Apply logic to fix any data issues found.
  • Feed the data to its destination once all corrections have been made.

Problems that may occur with this:

  • New errors how to handle them, do you let them occur and fix them and the logic to be caught in the future?
  • This leads to fixes been required in the destination system, which leads to the more downstream fixing of data.
  • You cant control data with errors coming in; you need to report and validate updates that are required.

2. Batch processing – In this scenario, there is a delay in feeding the data, as the data has to be available to feed into the destination system.

As with the automated process, there is some level of automation, but there is more control around when the data is provided, and it can be paused or rerun. Some of the steps that can be implemented include:

  • Scan the data and provide a report on its quality. Fix the data if errors found, then upload.
  • Allow the data to load, and then using a report, fix it in a downstream system.
  • Work with the providers of the data to improve the data quality of the data received.

Scenarios where data profiling can be applied

MeasurementScenario ExampleImpact
Completeness – Does the data available represent a complete picture of the data that should be present.DOB populatedCant use as part of security checks when discussing customer or miscalculate values that are dependant on the DOB.
Conformity – Is the data conforming to the correct structure as would be expected when you observe it?  Email address incorrectEmails to customers bounce back; needs follow up to correct, the customer does not get proper communication.
Consistency – If you have the same data in two different systems, are they the same values?  Data stored on different systems needs to be exactly the same.The customer could be communicated different versions of the same data.
Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have a knock-on effect.Innaccurate data means incorrext decisionsSending out communications to the wrong set of customers who don’t expect or need the information.
Uniqueness – If there are properties of data that are unique, does the data set show that?The same data is populated for different sets of independent  customers.No visibility to the customer and their actual correct data. Incorrect information processed for them. The financial and reputational risk could also be a problem.

What does a data analyst do?

Estimated reading time: 4 minutes

Livestream #2 – What does a data analyst do?

You are probably sitting there hearing about big data and databases, data analytics and machine learning and wonder where a data analyst fits in?

Here we will look to break it down step by step.

Sometimes a data analyst can be confused with a business analyst; there are subtle differences:

  • Business Analyst: Their role is to document the user’s requirements in a document that is descriptive of what the user wants.
    • In this case, a document that all parties can agree to is created, and it can be used as part of the project sign off.
  • Data Analyst: On the other hand, a data analyst will take the business requirements and translate them into data deliverables.
    • They use the document to ensure the project has the right data to meet the project objectives in the right place at the right time.

Data Mapping

In different data projects there will be a need to reconcile the data between systems, a data analysis will help here.

In a data mapping exercise, the data analyst will be expected to look at one or more sources and map them to a destination system.

  • This ensures a match between the two datasets.
  • Which results in the ability to reconcile the two systems.
  • Allows the ability to use data in multiple systems, knowing the consistency is in place.
  • Consistency of the data types between the systems.
  • It ensures that data validation errors are kept to a minimum.

Often a Data Analyst will build a traceability matrix, which tracks the data item from creation through to consumption.

Data Quality

In most companies, there will be teams (depending on their size) dedicated to this, and their input will be pivotal to existing and future data use.

It is an important task that could impact internal and external reporting and a company’s ability to make decisions accurately.

Some of the areas that might be looked at include:

(A) Investigate duplicate data – There could be a number of reasons this has to be checked:

  • Data manually entered multiple times.
  • An automated process ran multiple times.
  • A change to an IT system has unknowingly duplicated data.

(B) Finding errors – This could be completed in conjunction with data reporting outlined below.

  • Normally companies will clearly have rules that pick up the data errors that are not expected.
  • A data analyst will analyse why these errors are occurring.

(C) Checking for missing data.

  • Data feeds have failed. A request to reload the data will be required.
  • Data that was not requested as part of the business requirements confirm that this is the case.

(D) Enhancing the data with additional information – Is there additional information that can be added that can enrich the dataset?

(E) Checking data is in the correct format – There are scenarios where this can go wrong, and example is a date field is populated with text.

Data Reporting

In some of the areas above, we touched on the importance of the quality of data.

Ultimately there may be a need to track:

  • Data Quality – Build reports to capture the quality of data based on predefined business measurements.
  • Real-time Reporting – No new customers or customers who have left an organisation.
  • Track Targets – Is the target set by the business been met daily, weekly, monthly?
  • Management Reporting – Build reports that provide input to management packs that provide an overview of how the business performs.

Data Testing

Organisations go through change projects where new data is being introduced or enhanced.

As a result the data analyst will have a number of tasks to complete:

  • Write Test Scripts – Write all scripts for record counts, transformations and table to table comparisons.
  • Datatype Validation – Ensures all new data will be the same as the other data where it is stored.
  • No loss of data – Check all data is imported correctly with no data truncated.
  • Record count – Write an SQL script that would complete a source to the destination reconciliation.
  • Data Transformation – Ensure any transformations are applied correctly.

Supporting data projects

Ad hoc projects are common , and sometimes become a priority for businses as they deal with requirements that result as part of an immediate business need.

Data Analysts will be called upon to support projects where there is a need to ensure the data required is of a standard that meets the project deliverables:

Some common areas where this might occur includes:

  • Extract data where it has been found to have been corrupted.
  • Investigate data changes, to analyse where a data breach may have occurred.
  • An external regulatory body has requested information to back up some reports submitted.
  • A customer has requested all the company’s information on them; usually the case for a GDPR request.