What is data profiling and its benefits?

Estimated reading time: 4 minutes

Data profiling is the process of creating statistics on a data set that will allow readers of the metrics to understand how good the data quality is for that data.

Usually this is one of the many functions of a data analyst.

Many organisations have data quality issues, and the ability to identify them and fix helps with many customer and operational problems proactively.

As a result, it can help to identify errors in data that may:

  • Feed into reports.
  • Reduce the effectiveness of machine learning outputs.
  • Have a regulatory impact on reports submitted and how their effectiveness is measured.
  • Dissatisfied customers will get irritated with receiving communications that have incorrect data on them.
  • Batch processes will fail, reducing the effectiveness of automated tasks.

To understand how to implement an effective data profiling process, it is essential to identify the data where the issues may occur:

  • Data entry by a human.
  • Imported data not cleansed.
  • Third-party systems are feeding you data that has errors in it.
  • Company takeovers, integrating data that has errors on it.

The amount of data that is now collected and stored in big data systems, needs a process to manage and capture errors.

So what are the different ways to profile data?

To ensure a high level of data quality, you would look at some of the following techniques:

  • Completeness – Does the data available represent a complete picture of the data that should be present?
  • Conformity – Is the data conforming to the correct structure as would be expected when you observe it?
  • Consistency – If you have the same data in two different systems, are they the same values.
  • Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have known on effects.
  • Uniqueness – If there are properties of data that are unique, does the data set show that.

When should data profiling take place?

This will depend on the organisation and the process that relies on it.

We will outline  some different scenarios that may influence how to approach this

Straight through processing – If you are looking to automate, there will be a need to ensure that no automated process fails.

As a result, there will be a need to check the data before it feeds a new system. Some steps could be implemented include:

  • Scan the data source for known data issues.
  • Apply logic to fix any data issues found.
  • Feed the data to its destination once all corrections have been made.

Problems that may occur with this:

  • New errors how to handle them, do you let them occur and fix them and the logic to be caught in the future?
  • This leads to fixes been required in the destination system, which leads to the more downstream fixing of data.
  • You cant control data with errors coming in; you need to report and validate updates that are required.

2. Batch processing – In this scenario, there is a delay in feeding the data, as the data has to be available to feed into the destination system.

As with the automated process, there is some level of automation, but there is more control around when the data is provided, and it can be paused or rerun. Some of the steps that can be implemented include:

  • Scan the data and provide a report on its quality. Fix the data if errors found, then upload.
  • Allow the data to load, and then using a report, fix it in a downstream system.
  • Work with the providers of the data to improve the data quality of the data received.

Scenarios where data profiling can be applied

MeasurementScenario ExampleImpact
Completeness – Does the data available represent a complete picture of the data that should be present.DOB populatedCant use as part of security checks when discussing customer or miscalculate values that are dependant on the DOB.
Conformity – Is the data conforming to the correct structure as would be expected when you observe it?  Email address incorrectEmails to customers bounce back; needs follow up to correct, the customer does not get proper communication.
Consistency – If you have the same data in two different systems, are they the same values?  Data stored on different systems needs to be exactly the same.The customer could be communicated different versions of the same data.
Accuracy – There will be a need to ensure that the data present is accurate. This could fundamentally make any decisions made on the back of it not correct, which could have a knock-on effect.Innaccurate data means incorrext decisionsSending out communications to the wrong set of customers who don’t expect or need the information.
Uniqueness – If there are properties of data that are unique, does the data set show that?The same data is populated for different sets of independent  customers.No visibility to the customer and their actual correct data. Incorrect information processed for them. The financial and reputational risk could also be a problem.

What does a data analyst do?

Estimated reading time: 4 minutes

Livestream #2 – What does a data analyst do?

You are probably sitting there hearing about big data and databases, data analytics and machine learning and wonder where a data analyst fits in?

Here we will look to break it down step by step.

Sometimes a data analyst can be confused with a business analyst; there are subtle differences:

  • Business Analyst: Their role is to document the user’s requirements in a document that is descriptive of what the user wants.
    • In this case, a document that all parties can agree to is created, and it can be used as part of the project sign off.
  • Data Analyst: On the other hand, a data analyst will take the business requirements and translate them into data deliverables.
    • They use the document to ensure the project has the right data to meet the project objectives in the right place at the right time.

Data Mapping

In different data projects there will be a need to reconcile the data between systems, a data analysis will help here.

In a data mapping exercise, the data analyst will be expected to look at one or more sources and map them to a destination system.

  • This ensures a match between the two datasets.
  • Which results in the ability to reconcile the two systems.
  • Allows the ability to use data in multiple systems, knowing the consistency is in place.
  • Consistency of the data types between the systems.
  • It ensures that data validation errors are kept to a minimum.

Often a Data Analyst will build a traceability matrix, which tracks the data item from creation through to consumption.

Data Quality

In most companies, there will be teams (depending on their size) dedicated to this, and their input will be pivotal to existing and future data use.

It is an important task that could impact internal and external reporting and a company’s ability to make decisions accurately.

Some of the areas that might be looked at include:

(A) Investigate duplicate data – There could be a number of reasons this has to be checked:

  • Data manually entered multiple times.
  • An automated process ran multiple times.
  • A change to an IT system has unknowingly duplicated data.

(B) Finding errors – This could be completed in conjunction with data reporting outlined below.

  • Normally companies will clearly have rules that pick up the data errors that are not expected.
  • A data analyst will analyse why these errors are occurring.

(C) Checking for missing data.

  • Data feeds have failed. A request to reload the data will be required.
  • Data that was not requested as part of the business requirements confirm that this is the case.

(D) Enhancing the data with additional information – Is there additional information that can be added that can enrich the dataset?

(E) Checking data is in the correct format – There are scenarios where this can go wrong, and example is a date field is populated with text.

Data Reporting

In some of the areas above, we touched on the importance of the quality of data.

Ultimately there may be a need to track:

  • Data Quality – Build reports to capture the quality of data based on predefined business measurements.
  • Real-time Reporting – No new customers or customers who have left an organisation.
  • Track Targets – Is the target set by the business been met daily, weekly, monthly?
  • Management Reporting – Build reports that provide input to management packs that provide an overview of how the business performs.

Data Testing

Organisations go through change projects where new data is being introduced or enhanced.

As a result the data analyst will have a number of tasks to complete:

  • Write Test Scripts – Write all scripts for record counts, transformations and table to table comparisons.
  • Datatype Validation – Ensures all new data will be the same as the other data where it is stored.
  • No loss of data – Check all data is imported correctly with no data truncated.
  • Record count – Write an SQL script that would complete a source to the destination reconciliation.
  • Data Transformation – Ensure any transformations are applied correctly.

Supporting data projects

Ad hoc projects are common , and sometimes become a priority for businses as they deal with requirements that result as part of an immediate business need.

Data Analysts will be called upon to support projects where there is a need to ensure the data required is of a standard that meets the project deliverables:

Some common areas where this might occur includes:

  • Extract data where it has been found to have been corrupted.
  • Investigate data changes, to analyse where a data breach may have occurred.
  • An external regulatory body has requested information to back up some reports submitted.
  • A customer has requested all the company’s information on them; usually the case for a GDPR request.

Tkinter GUI tutorial python – how to clean excel data

Estimated reading time: 2 minutes

Tkinter is an application within Python that allows users to create GUI or graphical user interfaces to manage data in a more user-friendly way.

We are building our data analytics capability here, and looking to provide the user with the functionality they use in their work or college projects.

We have tested this code over 100,000 records sitting on the Microsoft OneDrive network so in a way, for this reason, its speeds were quite good.

As a result over five tests, they all were under 100s from start to finish.

data cleansing data cleansing fixed

In this Tkinter GUI tutorial python, you will be shown how to find the data errors, clean them and then export the final result to excel.

We will take you through the following:

  • Creation of the Tkinter interface.
  • Methods/ functions to find errors.
  • Methods/functions to clean the data.
  • Exporting the clean data to an excel file.

 

To sum up:

The video walks through the creation of a Tkinter window using a canvas and a frame to store the data frame.

Then it looks at importing the data through pd.read_excel, to load the data into a pandas data frame.

Next, there is a function and or method that will extract the errors through str.extract , which is loaded into separate columns

Finally, I have exported the clean dataset using rawdata.to_excel , and saved the file as a separate new spreadsheet.

how to create an instance of a class

Estimated reading time: 1 minute

Here in how to create an instance of a class, as described herein, how to create a class in Python, we will further explore the instance of class and how this can be used within a program to assign values to an object. This allows that object to inherit those values contained within the class, making it easier to have consistency regards functionality and data.

This video covers off

(a) creating an instance of a class

(B) Using the __init__ within the class

(C) define the constructor method __init__

(D) Creating an object that calls a class and uses the class to process some piece of data.

What are the benefits of this?

  • You only need to create one class that holds all the attributes required.
  • That class can be called from anywhere within a program, once an instance of it is created.
  • You can update the class, and once completed, those new values will become available to an instance of that class.
  • Makes for better management of objects and their properties, not multiple different versions contained within a program

 

 

How to create a class in Python

Estimated reading time: 1 minute

How to create a class in Python: In this video explaining classes will be the main topic on how they are constructed,  explain how to create an instance of a class.

When talking about classes, they can also be referred to as object-orientated programming.

Also, we look at what class attributes are and how they can be used to assign key data that can be called anywhere within a program.

The steps involve the following:

(a) Create a class

(B) Assign attributes to the class

(C) Create a method within the class ( similar to a function)

(D) Create an instance of a class to call its attributes and methods.

This video is a follow on from object oriented programming – Python Classes explained

Python tutorial: Pandas groupby ( Video 1)

In this first video about pandas groupby and as part of expanding the data analytics information of this website, we are looking to explain how you can use a groupby selection to sort your data into similar datasets better so they can be better analysed. In the video below, we import our data into a dataframe, and then group as follows:

  • Directly naming the column
  • Through get_group
  • Using a loop
  • Utilising a lambda function

 

 

Regular expressions python

Estimated reading time: 3 minutes

Regular expressions explained

Regular expressions are a set of characters usually in a particular sequence that helps find a match/pattern for a specific piece of data in a dataset.

The purpose is to allow a uniform of set characters that can be reused multiple times, based on the requirements of the user, without having to build each time.

The patterns are similar to those that you would find in Perl.

How are regular expressions built?

To start, in regular expressions, there are metacharacters, which are characters that have a special meaning. Their values are as follows:

. ^ $ * + ? { } [ ] \ | ( )

.e = All occurrences which have one “e”, and value before that e. There can be multiple e, eg ..e means check two characters before e.

^ =Check if a string starts with a particular pattern.

*  = Match zero or more occurrences of a pattern, at least one of the characters can be found.

+ = Looks to match exact patterns, one or more times, and if they are not precisely equal, then nothing is returned.

? =Check if a string after ? exists in a pattern and returns it. If a value before the ? is directly beside the value after ? then returns both values.

—> e.g. t?e is the search pattern. “The” is the string. The result will return only the value e, but if the string is “te”, then it will return te, as the letters are directly beside each other.

da{2} = Check to see if a character has a set of other characters following it. E.g. sees if d has two “a” following it.

[abc] = These are the characters you are looking for in the data. Could also use [a-c] and will give you the same result. Change to uppercase to get only those with uppercase.

\ = Denoting a backslash used to escape all metacharacters, so if they need to be found in a string, they can be. Used to escape $ in a string so they can be found as a literal value.

| = This is used when you want an “or” operator in the logic, i.e. check for one or more values from a pattern, either or both can be present.

() = Looks to group pattern searches or a partial match, to see if they are together or not.

 

Special sequences, making it easier again

\a = Matches if the specified characters are at the start of the string been searched.

\b = Matches if the specified characters are at the beginning or the end of the string been searched.

\B = Matches if the specified characters are NOT at the beginning or the end of the string been searched.

\d = Matches any digits 0-9.

\D = Matches any character is not a digit.

\s = Matches where a string contains a whitespace character.

\S = Matches where a string contains a non-whitespace character.

\w = Matches if digits or character or _ found

\W = Matches if non-digits and or characters or _found

\z = matches if the specified characters are at the end of the string.

 

 

For further references and reading materials, please see the below websites, the last one is really useful in testing any regular expressions you would like to build:

See further reading material here: regular expression RE explained

Another complementary page to the link above regular expression REGEX explained

I found this link on the internet, and would thoroughly recommend you bookmark it. It will also allow you to play around with regular expressions and test them before you put into your code, a very recommended resource Testing regular expressions

 

What are the reserved keywords in Python

What are python reserved keywords?

When coding in the Python language there are particular python reserved words that the system uses, which cannot be accessed as a variable or a function as the computer program uses them to perform specific tasks.

When you try to use them, the system will block it and throws out an error. Running the below code in Python

import keyword
keywordlist = keyword.kwlist
print(keywordlist)

Produces the below keyword values
['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', 'class', 'continue', 'def', 'del',
'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal',
'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']

When writing your code, it is important to follow the following guidelines:

(A) Research the keywords first for the language you are writing in.

(B) Ensure that your programming language highlights keywords when used, so you can fix the issue.

(C) Setup your computer program in debug mode to highlight keywords use.

With some programs running into thousands of lines of code, with additional functions and variables, it can become harder to spot the problem, so good rigour in the initial stages of coding will help down the road any issues that you may find that need to fixed.

This code was run in Python version 3.8

Python tutorial: Create an input box in Tkinter

Using an tkinter input box for your data projects

There may be an occasion as you are building out a data science or data analytics project, checks need to be performed on the dataset as follows:

  •  Big data sets and speed requirements in conjunction with
  • The need to reduce the volume of data returned which is impeding performance

and this is where input boxes and Tkinter can help!

In the below video, we are demonstrating an introduction to using an input box and validating the input.

We demonstrate how to validate the data entered into the tkinter input box and return a message, this will ensure the user gets the correct data.

Types of uses for a tkinter input box are varied, here are some thoughts:

  • Use an input box to return a set of data for a particular day.
  • Using them to filter down the results to a particular cohort of data.
  • Conduct a string search to find data quality issues to be fixed.

Python tutorial: How to create a graphical user interface in Tkinter

How would you like to present your data analytics work better?

When starting your data analytics projects, one of the critical considerations is how to present your results quickly and understandably?

Undoubtedly this is true if you are only going to look at the results yourself.

If the work you do is a repeatable process, a more robust longer-term solution needs to be applied, this is where Tkinter can help, which is a python graphical user interface.

When you are importing tkinter, some of the functionality that can be used include:

  • Use them to build calculators.
  • They can show graphs and bar charts.
  • Show graphics on a screen.
  • Validate user input, through building entry widgets.

Where this all fits in with data analytics?

While going through a set of data and getting some meaning to it can be challenging, using the python graphical user interface tutorial below can help build the screens that will allow a repeatable process to display in a meaningful way.

Using the tkinter widget could help achieve the following:

  • Build a screen that shows data analytics errors in a data set, e.g. The number of blank column values in a dataset.
  • Another application is to run your analytics to show the results on a screen that can be printed or exported.
  • Similarly, you could also have a screen where a user selects several parameters that are fed into the data analytics code and produces information for the user to analyse.

There are many more ways that you could do this, but one of the most important things is that data analytics can be built into a windows environment using Tkinter.

These GUI applications are what the user would be used to currently seeing. As a result, this could help to distribute a solution across an enterprise to lots of different users.

Also, another benefit is that they will work on many different operating systems.

The only thing that needs to happen is that the requirements the user needs are defined, and the developer then builds on those, with the data analytics code run in the background of this program with Tkinter and output into a user-friendly screen for review.