Skip to content
  • YouTube
  • FaceBook
  • Twitter
  • Instagram

Data Analytics Ireland

Data Analytics and Video Tutorials

  • Home
  • Contact
  • About Us
    • Latest
    • Write for us
    • Learn more information about our website
  • Useful Links
  • Glossary
  • All Categories
  • Faq
  • Livestream
  • Toggle search form
  • How To Join Tables In SQL SQL
  • What are measures in Tableau? data visualisation
  • How to add a date when a record is created SQL
  • Python Tutorial: How to create charts in Excel Python Tutorial
  • What are the reserved keywords in Python Definition
  • Python tutorial: How to create a graphical user interface in Tkinter Python
  • how to add sine and cosine in python code numpy
  • How To Add Values to a Python Dictionary Python

What is data profiling and its benefits?

Posted on April 30, 2021November 9, 2022 By admin No Comments on What is data profiling and its benefits?

Estimated reading time: 4 minutes

Data profiling is the process of creating statistics on a data set that will allow readers of the metrics to understand how good the data quality is for that data.

Usually this is one of the many functions of a data analyst.

Many organisations have data quality issues, and the ability to identify them and fix helps with many customer and operational problems proactively.

As a result, it can help to identify errors in data that may:

  • Feed into reports.
  • Reduce the effectiveness of machine learning outputs.
  • Have a regulatory impact on reports submitted and how their effectiveness is measured.
  • Dissatisfied customers will get irritated with receiving communications that have incorrect data on them.
  • Batch processes will fail, reducing the effectiveness of automated tasks.

To understand how to implement an effective data profiling process, it is essential to identify the data where the issues may occur:

  • Data entry by a human.
  • Imported data not cleansed.
  • Third-party systems are feeding you data that has errors in it.
  • Company takeovers, integrating data that has errors on it.

The amount of data that is now collected and stored in big data systems, needs a process to manage and capture errors.

So what are the different ways to profile data?

To ensure a high level of data quality, you would look at some of the following techniques:

  • Completeness – Does the data available represent a complete picture of the data that should be present?
  • Conformity – Is the data conforming to the correct structure as would be expected when you observe it?
  • Consistency – If you have the same data in two different systems, are they the same values.
  • Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have known on effects.
  • Uniqueness – If there are properties of data that are unique, does the data set show that.

When should data profiling take place?

This will depend on the organisation and the process that relies on it.

We will outline  some different scenarios that may influence how to approach this

Straight through processing – If you are looking to automate, there will be a need to ensure that no automated process fails.

As a result, there will be a need to check the data before it feeds a new system. Some steps could be implemented include:

  • Scan the data source for known data issues.
  • Apply logic to fix any data issues found.
  • Feed the data to its destination once all corrections have been made.

Problems that may occur with this:

  • New errors how to handle them, do you let them occur and fix them and the logic to be caught in the future?
  • This leads to fixes been required in the destination system, which leads to the more downstream fixing of data.
  • You cant control data with errors coming in; you need to report and validate updates that are required.

2. Batch processing – In this scenario, there is a delay in feeding the data, as the data has to be available to feed into the destination system.

As with the automated process, there is some level of automation, but there is more control around when the data is provided, and it can be paused or rerun. Some of the steps that can be implemented include:

  • Scan the data and provide a report on its quality. Fix the data if errors found, then upload.
  • Allow the data to load, and then using a report, fix it in a downstream system.
  • Work with the providers of the data to improve the data quality of the data received.
What is Data Profiling?

Scenarios where data profiling can be applied

MeasurementScenario ExampleImpact
Completeness – Does the data available represent a complete picture of the data that should be present.DOB populatedCant use as part of security checks when discussing customer or miscalculate values that are dependant on the DOB.
Conformity – Is the data conforming to the correct structure as would be expected when you observe it?  Email address incorrectEmails to customers bounce back; needs follow up to correct, the customer does not get proper communication.
Consistency – If you have the same data in two different systems, are they the same values?  Data stored on different systems needs to be exactly the same.The customer could be communicated different versions of the same data.
Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have a knock-on effect.Innaccurate data means incorrext decisionsSending out communications to the wrong set of customers who don’t expect or need the information.
Uniqueness – If there are properties of data that are unique, does the data set show that?The same data is populated for different sets of independent  customers.No visibility to the customer and their actual correct data. Incorrect information processed for them. The financial and reputational risk could also be a problem.
data profiling, Livestream Tags:accuracy, Completeness, Conformity, consistency, Data, profiling, uniqueness

Post navigation

Previous Post: What is GITHUB, and should I use it?
Next Post: TypeError: ‘int’ object is not callable

Related Posts

  • Welcome to Data Analytics Ireland Livestream
  • What does a data analyst do? Livestream
  • Free ways to Extract Data from Files Livestream

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Select your language!

  • हिंदी
  • Español
  • Português
  • Français
  • Italiano
  • Python tutorial: How to create a graphical user interface in Tkinter Python
  • TypeError: ‘int’ object is not callable Python
  • How to check if a file is empty Python
  • Python Tutorial: Pandas groupby columns ( video 2) Python
  • How To Run Python Validation From Javascript Javascript
  • How to save data frame changes to a file Python Dataframe
  • What does a data analyst do? Livestream
  • How to Add Formulas to Excel using Python numpy

Copyright © 2023 Data Analytics Ireland.

Powered by PressBook Premium theme

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT