What does a data analyst do?

Estimated reading time: 4 minutes

Livestream #2 – What does a data analyst do?

You are probably sitting there hearing about big data and databases, data analytics and machine learning and wonder where a data analyst fits in?

Here we will look to break it down step by step.

Sometimes a data analyst can be confused with a business analyst; there are subtle differences:

  • Business Analyst: Their role is to document the user’s requirements in a document that is descriptive of what the user wants.
    • In this case, a document that all parties can agree to is created, and it can be used as part of the project sign off.
  • Data Analyst: On the other hand, a data analyst will take the business requirements and translate them into data deliverables.
    • They use the document to ensure the project has the right data to meet the project objectives in the right place at the right time.

Data Mapping

In different data projects there will be a need to reconcile the data between systems, a data analysis will help here.

In a data mapping exercise, the data analyst will be expected to look at one or more sources and map them to a destination system.

  • This ensures a match between the two datasets.
  • Which results in the ability to reconcile the two systems.
  • Allows the ability to use data in multiple systems, knowing the consistency is in place.
  • Consistency of the data types between the systems.
  • It ensures that data validation errors are kept to a minimum.

Often a Data Analyst will build a traceability matrix, which tracks the data item from creation through to consumption.

Data Quality

In most companies, there will be teams (depending on their size) dedicated to this, and their input will be pivotal to existing and future data use.

It is an important task that could impact internal and external reporting and a company’s ability to make decisions accurately.

Some of the areas that might be looked at include:

(A) Investigate duplicate data – There could be a number of reasons this has to be checked:

  • Data manually entered multiple times.
  • An automated process ran multiple times.
  • A change to an IT system has unknowingly duplicated data.

(B) Finding errors – This could be completed in conjunction with data reporting outlined below.

  • Normally companies will clearly have rules that pick up the data errors that are not expected.
  • A data analyst will analyse why these errors are occurring.

(C) Checking for missing data.

  • Data feeds have failed. A request to reload the data will be required.
  • Data that was not requested as part of the business requirements confirm that this is the case.

(D) Enhancing the data with additional information – Is there additional information that can be added that can enrich the dataset?

(E) Checking data is in the correct format – There are scenarios where this can go wrong, and example is a date field is populated with text.

Data Reporting

In some of the areas above, we touched on the importance of the quality of data.

Ultimately there may be a need to track:

  • Data Quality – Build reports to capture the quality of data based on predefined business measurements.
  • Real-time Reporting – No new customers or customers who have left an organisation.
  • Track Targets – Is the target set by the business been met daily, weekly, monthly?
  • Management Reporting – Build reports that provide input to management packs that provide an overview of how the business performs.

Data Testing

Organisations go through change projects where new data is being introduced or enhanced.

As a result the data analyst will have a number of tasks to complete:

  • Write Test Scripts – Write all scripts for record counts, transformations and table to table comparisons.
  • Datatype Validation – Ensures all new data will be the same as the other data where it is stored.
  • No loss of data – Check all data is imported correctly with no data truncated.
  • Record count – Write an SQL script that would complete a source to the destination reconciliation.
  • Data Transformation – Ensure any transformations are applied correctly.

Supporting data projects

Ad hoc projects are common , and sometimes become a priority for businses as they deal with requirements that result as part of an immediate business need.

Data Analysts will be called upon to support projects where there is a need to ensure the data required is of a standard that meets the project deliverables:

Some common areas where this might occur includes:

  • Extract data where it has been found to have been corrupted.
  • Investigate data changes, to analyse where a data breach may have occurred.
  • An external regulatory body has requested information to back up some reports submitted.
  • A customer has requested all the company’s information on them; usually the case for a GDPR request.

how to use case statements in SQL

In recent posts on SQL here on this website, we have covered off a number of ways to extract your data from wildcards to filters.

Here we are going to build on that learning, but bring in a new dimension in the use of case statements.

These are very handy to use when you may want to apply specific data values to a piece of data.

We have several posts on machine learning; in effect, this could be used to add data needed to run the model.

Lets start looking at the SQLite database we have been using for other posts.

The table we will work off is shown below, we have used in other posts.

What we are going to look at, is the column name, and assign a binary number, 1 or 0 if the string in this column begins with the letter “A”.

Case statements work on the basis when a value meets certain criteria.

They normally are constructed as follows:

CASE

… criteria you want to meet

End

The power of this is that you are starting to add additional attributes about the string in a new column.

An example could be if a column value is the incorrect length, can be used as a data quality check. See my example below.

Below in the output when we apply a value of 1 to a new column if it starts with “A”.

Important to remember that the new column country_check only shows in the output.

On the above we have just one case statements, but what if we want more than one in the query?

If we want to apply multiple case statements to a query, just add a new when line, with your criteria.

The wild card values below will return any value between 1-5, they can be any combination of 1-5.

What if we want to have multiple case statements on different columns?

As can be seen below we have selected two columns to apply our case statement to.

We just have added a check on the column iso_code.

As before the new column will not be saved to the database.

The video to show how this is applied can be found below:

In conclusion we have shown examples of how to apply case statements in SQL, with the following pointers:

  • Multiple case statements can be created.
  • Wildcards can be used to return different variations of your data.
  • Outputs are populated into a new column, though this is not saved to the table.
  • You can test for multiple scenarios.
  • Using case statements allows you to understand your data better with additional attributes.

We have lots of posts on this website that will help you build your data analytics skills.

Python Tutorial: How to validate data using tuples

Do you want to validate with Tuples, that is easy, making changes not easy.

In our recent video Python – how do I remove unwanted characters lists were used as a lookup to validate data that we need to be check for invalid data items. The most apparent difference between the two is that tuples are immutable, hence changing their values is not possible, making using them in real-time code a bit hazardous.

So why would you use Tuples?

That is a good question and sometimes not too obvious when you try to put examples down on paper, but here are some cases:

  • You want a set of values that will never change, no matter what.
  •  Use as a lookup that the program can check against, these could be called anywhere in your code.
  •  Make sure that you only process what is in the tuple; any additional data can be reported as erroneous, a form of error control.

Getting around the change limitations (well kind of)

This video looks at a simple few steps to take in a set of data, validate the id column aginst a tuple set of values and then show the differences on a separate output.

The code is then rerun after we add the original tuple to the error values found, to give a new tuple. As a result, the new output will show up with no errors.

To sum it all up

In a nutshell, Tuples are limited in what they can do, probably the best thing for them is:

  • Use your code as a reference for re-occurring values that need to be validated.
  •  Don’t use in your code to have updated tuples, use lists instead as you can update them in real-time.

How to remove characters from an imported CSV file

Estimated reading time: 2 minutes

Removal of unwanted errors in your data, the easy way.
The process of importing data can take many formats, but have you been looking for a video on how you do this? Even better are you looking for a video that shows you how to import a CSV file and then data cleanse it, effectively removing any unwanted characters?

As a follow up to Python – how do I remove unwanted characters, that video focused on data cleansing the data created within the code, this video runs through several options to open a CSV file, find the unwanted characters, remove the unwanted characters from the dataset and then return the cleansed data.

How to get in amongst the problem data:

The approach here looked at three different scenarios:

(A) Using a variable that is equal to an open function, and then reading the variable to a data frame.

(B)Using a “with statement” and an open function together, and returning the output to a data frame.

(C) Using read_csv to quickly and efficiently read in the CSV file and then cleanse the dataset.

Some minor bumps in the road that need some thought

There where some challenges with this that you will see in the video:

  • Options A & B had to deploy additional code just to get the data frame the way we wanted.
  •  The additional lines of code made for more complexity if you were trying to manage it.

In the end, read_csv was probably the best way to approach this; it required less code and would have been easier to manage in the longer run.

 

As always thanks for watching, please subscribe to our YouTube channel, and follow us on social media!

Data Analytics Ireland

 

How to remove unwanted characters

Estimated reading time: 2 minutes

Removing the unwanted, that is holding you up.
A situation has arisen of you having information, which has erroneous data inside it, what do you do?

Data issues are a common scenario faced by many data analytics professionals and the industry as a whole. Data quality now has become more critical, especially as we move more processes online and the digital landscape increases.

Most pieces of data go through a process of been transferred somewhere between systems to be used or reports rely on the accuracy of them. If the data in the source system has quality issues, the problem if not addressed before going somewhere else can then push the data quality issues more throughout an organisation, like a spiders web it expands further.

The next step, looking to fix the problem and planning for it.

To combat this problem professionals need to come up with a plan on how to tackle this, either:

  • Fix at source
  • Take the data in before moving it on, and investigate the problems.
  •  Reject the file or part thereof.

All three options above have scenarios around them with costs and implications, depending on the industry, you need to pick the most appropriate way to handle. As an example, in the banking industry payment files can sometimes have data in them that is rejected entirely or in part.

But the bank may decide they will only discard the records with the wrong data and process everything else.

How to go about it and how regular expressions can help

In this video, we look to go through an example of how to cleanse a data set;

(A) We use a list to check what problems we need to find.

(B) Using functions again to process through the data to find the problem and extract them.

(C) Regular expressions also appear as they look to find the special characters in the data set.

The concept of regular expressions is used extensively across several programming languages; it is a good way to test data and find erroneous values. If you are thinking about machine learning, it is quite important to get a more thorough knowledge of how they work. Here is a good link for further reading if you need more information Regular Expression how to

Thanks for watching and if you like, please share and subscribe through the buttons on this page!

Data Analytics Ireland