What does a data analyst do?

Estimated reading time: 4 minutes

Livestream #2 – What does a data analyst do?

You are probably sitting there hearing about big data and databases, data analytics and machine learning and wonder where a data analyst fits in?

Here we will look to break it down step by step.

Sometimes a data analyst can be confused with a business analyst; there are subtle differences:

  • Business Analyst: Their role is to document the user’s requirements in a document that is descriptive of what the user wants.
    • In this case, a document that all parties can agree to is created, and it can be used as part of the project sign off.
  • Data Analyst: On the other hand, a data analyst will take the business requirements and translate them into data deliverables.
    • They use the document to ensure the project has the right data to meet the project objectives in the right place at the right time.

Data Mapping

In different data projects there will be a need to reconcile the data between systems, a data analysis will help here.

In a data mapping exercise, the data analyst will be expected to look at one or more sources and map them to a destination system.

  • This ensures a match between the two datasets.
  • Which results in the ability to reconcile the two systems.
  • Allows the ability to use data in multiple systems, knowing the consistency is in place.
  • Consistency of the data types between the systems.
  • It ensures that data validation errors are kept to a minimum.

Often a Data Analyst will build a traceability matrix, which tracks the data item from creation through to consumption.

Data Quality

In most companies, there will be teams (depending on their size) dedicated to this, and their input will be pivotal to existing and future data use.

It is an important task that could impact internal and external reporting and a company’s ability to make decisions accurately.

Some of the areas that might be looked at include:

(A) Investigate duplicate data – There could be a number of reasons this has to be checked:

  • Data manually entered multiple times.
  • An automated process ran multiple times.
  • A change to an IT system has unknowingly duplicated data.

(B) Finding errors – This could be completed in conjunction with data reporting outlined below.

  • Normally companies will clearly have rules that pick up the data errors that are not expected.
  • A data analyst will analyse why these errors are occurring.

(C) Checking for missing data.

  • Data feeds have failed. A request to reload the data will be required.
  • Data that was not requested as part of the business requirements confirm that this is the case.

(D) Enhancing the data with additional information – Is there additional information that can be added that can enrich the dataset?

(E) Checking data is in the correct format – There are scenarios where this can go wrong, and example is a date field is populated with text.

Data Reporting

In some of the areas above, we touched on the importance of the quality of data.

Ultimately there may be a need to track:

  • Data Quality – Build reports to capture the quality of data based on predefined business measurements.
  • Real-time Reporting – No new customers or customers who have left an organisation.
  • Track Targets – Is the target set by the business been met daily, weekly, monthly?
  • Management Reporting – Build reports that provide input to management packs that provide an overview of how the business performs.

Data Testing

Organisations go through change projects where new data is being introduced or enhanced.

As a result the data analyst will have a number of tasks to complete:

  • Write Test Scripts – Write all scripts for record counts, transformations and table to table comparisons.
  • Datatype Validation – Ensures all new data will be the same as the other data where it is stored.
  • No loss of data – Check all data is imported correctly with no data truncated.
  • Record count – Write an SQL script that would complete a source to the destination reconciliation.
  • Data Transformation – Ensure any transformations are applied correctly.

Supporting data projects

Ad hoc projects are common , and sometimes become a priority for businses as they deal with requirements that result as part of an immediate business need.

Data Analysts will be called upon to support projects where there is a need to ensure the data required is of a standard that meets the project deliverables:

Some common areas where this might occur includes:

  • Extract data where it has been found to have been corrupted.
  • Investigate data changes, to analyse where a data breach may have occurred.
  • An external regulatory body has requested information to back up some reports submitted.
  • A customer has requested all the company’s information on them; usually the case for a GDPR request.

Tkinter GUI tutorial python – how to clean excel data

Estimated reading time: 2 minutes

Tkinter is an application within Python that allows users to create GUI or graphical user interfaces to manage data in a more user-friendly way.

We are building our data analytics capability here, and looking to provide the user with the functionality they use in their work or college projects.

We have tested this code over 100,000 records sitting on the Microsoft OneDrive network so in a way, for this reason, its speeds were quite good.

As a result over five tests, they all were under 100s from start to finish.

data cleansing data cleansing fixed

In this Tkinter GUI tutorial python, you will be shown how to find the data errors, clean them and then export the final result to excel.

We will take you through the following:

  • Creation of the Tkinter interface.
  • Methods/ functions to find errors.
  • Methods/functions to clean the data.
  • Exporting the clean data to an excel file.

 

To sum up:

The video walks through the creation of a Tkinter window using a canvas and a frame to store the data frame.

Then it looks at importing the data through pd.read_excel, to load the data into a pandas data frame.

Next, there is a function and or method that will extract the errors through str.extract , which is loaded into separate columns

Finally, I have exported the clean dataset using rawdata.to_excel , and saved the file as a separate new spreadsheet.

how to remove unwanted characters from your data

Using r programming functions to cleanse data
In our recent post  Python Tutorial: How do I remove unwanted characters  we walked through the concepts behind data cleansing.

We demonstrated several different approaches to data cleansing, and the use of regular expressions was also shown.

Here we look at that approach using rstudio and the following functions:

How to approach data cleansing

In removing unwanted characters, you want to ensure that you have a defined list of what should not appear and will cause you errors. It is also essential to understand the type of data errors that can occur as follows:

  • Data entry errors
  • Data columns have the incorrect format, e.g. Telphone numbers which have non-numerical characters in them
  • Missing data that is required – e.g. null values
  • Data that does not make sense, e.g. date of birth that is beyond the range of what you would typically expect to see
  • Duplicate values for the same piece of data. The problem here is that this can inflate the no of data errors, and not give a true count of the actual errors.

In the below video we utilise r programming code using the above functions, but also use an if statement to check if an unwanted character is in the data set first before proceeding to remove it and return the cleansed data.

Some of the strategies to help counteract data errors could include:

  • Eliminate manual inputs
  • Controls at point of entry for data, e.g. for dates only allow date formats in the field.
  • Reduce duplication of the data across multiple systems, reduces the no of places that data differences can occur.
  • If integrating different systems with the same data into one network, perform a data cleanse beforehand, reduces the work needed afterwards to clean up the problems that brings.

 

How to data cleanse a database table

In Data Analytics, that is a very relevant question, and something I look to implement in most projects, sometimes it is too easy to click the shortcut icon to your spreadsheet application!

Here we are looking to bring a little automation into these videos. Building on How to import data from files and Removing characters from an imported CSV file this video connects to a Microsoft Azure cloud database table, brings in the data with errors on it, fixes the errors and displays the correct output on the screen.

What can this do for organisations?

There are several benefits to automating this step:

  • Less manual intervention if there is a need to fix data issues.
  • Better productivity.
  • Better data flows with no errors and quicker reporting.

 

Moving away from files

The process of moving away from files and into automation has several steps:

  • Be clear on your data needs.
  • Understand what you are trying to achieve.
  • Build a process that is repeatable but can be updated easily.
  • Ensure that you build in data quality checks, helps deliver the better output to the users.

Thanks for stopping by!

Data Analytics Ireland

 

How to remove characters from an imported CSV file

Estimated reading time: 2 minutes

Removal of unwanted errors in your data, the easy way.
The process of importing data can take many formats, but have you been looking for a video on how you do this? Even better are you looking for a video that shows you how to import a CSV file and then data cleanse it, effectively removing any unwanted characters?

As a follow up to Python – how do I remove unwanted characters, that video focused on data cleansing the data created within the code, this video runs through several options to open a CSV file, find the unwanted characters, remove the unwanted characters from the dataset and then return the cleansed data.

How to get in amongst the problem data:

The approach here looked at three different scenarios:

(A) Using a variable that is equal to an open function, and then reading the variable to a data frame.

(B)Using a “with statement” and an open function together, and returning the output to a data frame.

(C) Using read_csv to quickly and efficiently read in the CSV file and then cleanse the dataset.

Some minor bumps in the road that need some thought

There where some challenges with this that you will see in the video:

  • Options A & B had to deploy additional code just to get the data frame the way we wanted.
  •  The additional lines of code made for more complexity if you were trying to manage it.

In the end, read_csv was probably the best way to approach this; it required less code and would have been easier to manage in the longer run.

 

As always thanks for watching, please subscribe to our YouTube channel, and follow us on social media!

Data Analytics Ireland

 

How to remove unwanted characters

Estimated reading time: 2 minutes

Removing the unwanted, that is holding you up.
A situation has arisen of you having information, which has erroneous data inside it, what do you do?

Data issues are a common scenario faced by many data analytics professionals and the industry as a whole. Data quality now has become more critical, especially as we move more processes online and the digital landscape increases.

Most pieces of data go through a process of been transferred somewhere between systems to be used or reports rely on the accuracy of them. If the data in the source system has quality issues, the problem if not addressed before going somewhere else can then push the data quality issues more throughout an organisation, like a spiders web it expands further.

The next step, looking to fix the problem and planning for it.

To combat this problem professionals need to come up with a plan on how to tackle this, either:

  • Fix at source
  • Take the data in before moving it on, and investigate the problems.
  •  Reject the file or part thereof.

All three options above have scenarios around them with costs and implications, depending on the industry, you need to pick the most appropriate way to handle. As an example, in the banking industry payment files can sometimes have data in them that is rejected entirely or in part.

But the bank may decide they will only discard the records with the wrong data and process everything else.

How to go about it and how regular expressions can help

In this video, we look to go through an example of how to cleanse a data set;

(A) We use a list to check what problems we need to find.

(B) Using functions again to process through the data to find the problem and extract them.

(C) Regular expressions also appear as they look to find the special characters in the data set.

The concept of regular expressions is used extensively across several programming languages; it is a good way to test data and find erroneous values. If you are thinking about machine learning, it is quite important to get a more thorough knowledge of how they work. Here is a good link for further reading if you need more information Regular Expression how to

Thanks for watching and if you like, please share and subscribe through the buttons on this page!

Data Analytics Ireland

YouTube channel lists – Python Data Cleansing

Ever had a process where you received a set of data, and it took a bit of effort to cleanse the data, so it looks the way you want it?

The world of data processing and data exchanged between servers and organisations needs careful attention, one person’s idea of clean data might not be other persons, hence the difference between the two can lead to data issues.

Experian had an excellent article What is data cleansing? in that they talk about several factors about data:

  • It could be incorrect
  • and incomplete
  •  and Duplicated

 

One of the things they highlighted also is that under GDPR, organisations have to deal with more focus on data been accurate, complete and up to date.

We are putting together several videos in this area over time, so you will start to see them as they go up.

Please like and share through the social media buttons shared on the page here, thanks for watching!

Data Analytics Ireland

 

 

 

data cleansing in a business environment

I was on LinkedIn recently, and looking at my profile, saw a post I had posted about six years ago around Data Cleansing. One thing that struck me was that the topics I brought up then are as relevant then as they are now, and with Big Data now mainstream many companies are wondering how to manage all this data in an ever-changing landscape. So I thought I would share it again.

The Business  case

(A)Test Data meets Industry requirements.

In some industries, it is a legal requirement to have all your data displaying the correct format and description. Any pieces of information not included should be removed based on business rules. Today companies operate across multiple platforms electronic, print, video, etc., a process needs to be in place to make sure the data is in sync!

(B)Check for unwanted words appearing

Branding and reputation are critical, and businesses large and small need a mechanism to understand what information was written online in conjunction with any of their profile. Data Cleansing can be the first point of call to unwanted words that will damage the brand.

The Technical case

(A)Remove unwanted characters such as !”£$%^&*@’;:#~?>< etc.

When presented with a set of data from another source, they may be in a raw format, and if looking to grouping the words or numbers in your dataset, this can sometimes lead to wrong grouping.

(B)Group Data

Sometimes you will need to group specific names or numbers to see how often they appeared. Having this issue can become problematic if an initial review of the data was not started, as with point A above. So you want to check for Facebook in your dataset and if there are six occurrences of it 4* Facebook! and 2*Facebook, your grouping will be incorrect, giving you the wrong analysis.

(C)Prepare Data

The main reason for cleansing data is to have it ready to be processed. Often or not, the process is an initial step before further processing starts. The hard processing will have built-in controls to make sure the data is in the correct format if not, they will fail. This step would be crucial in an automated process.

(D)Check for Null Values

Sometimes a system can be set up to process data or receive in data from a third party vendor. It might be imperative that specific fields should not be empty or should be empty depending on the business need. The initial analysis to identify those values through the data cleansing process should help to mitigate any problems before the data gets loaded into systems that have strict controls on them.