What does a data analyst do?

Estimated reading time: 4 minutes

Livestream #2 – What does a data analyst do?

You are probably sitting there hearing about big data and databases, data analytics and machine learning and wonder where a data analyst fits in?

Here we will look to break it down step by step.

Sometimes a data analyst can be confused with a business analyst; there are subtle differences:

  • Business Analyst: Their role is to document the user’s requirements in a document that is descriptive of what the user wants.
    • In this case, a document that all parties can agree to is created, and it can be used as part of the project sign off.
  • Data Analyst: On the other hand, a data analyst will take the business requirements and translate them into data deliverables.
    • They use the document to ensure the project has the right data to meet the project objectives in the right place at the right time.

Data Mapping

In different data projects there will be a need to reconcile the data between systems, a data analysis will help here.

In a data mapping exercise, the data analyst will be expected to look at one or more sources and map them to a destination system.

  • This ensures a match between the two datasets.
  • Which results in the ability to reconcile the two systems.
  • Allows the ability to use data in multiple systems, knowing the consistency is in place.
  • Consistency of the data types between the systems.
  • It ensures that data validation errors are kept to a minimum.

Often a Data Analyst will build a traceability matrix, which tracks the data item from creation through to consumption.

Data Quality

In most companies, there will be teams (depending on their size) dedicated to this, and their input will be pivotal to existing and future data use.

It is an important task that could impact internal and external reporting and a company’s ability to make decisions accurately.

Some of the areas that might be looked at include:

(A) Investigate duplicate data – There could be a number of reasons this has to be checked:

  • Data manually entered multiple times.
  • An automated process ran multiple times.
  • A change to an IT system has unknowingly duplicated data.

(B) Finding errors – This could be completed in conjunction with data reporting outlined below.

  • Normally companies will clearly have rules that pick up the data errors that are not expected.
  • A data analyst will analyse why these errors are occurring.

(C) Checking for missing data.

  • Data feeds have failed. A request to reload the data will be required.
  • Data that was not requested as part of the business requirements confirm that this is the case.

(D) Enhancing the data with additional information – Is there additional information that can be added that can enrich the dataset?

(E) Checking data is in the correct format – There are scenarios where this can go wrong, and example is a date field is populated with text.

Data Reporting

In some of the areas above, we touched on the importance of the quality of data.

Ultimately there may be a need to track:

  • Data Quality – Build reports to capture the quality of data based on predefined business measurements.
  • Real-time Reporting – No new customers or customers who have left an organisation.
  • Track Targets – Is the target set by the business been met daily, weekly, monthly?
  • Management Reporting – Build reports that provide input to management packs that provide an overview of how the business performs.

Data Testing

Organisations go through change projects where new data is being introduced or enhanced.

As a result the data analyst will have a number of tasks to complete:

  • Write Test Scripts – Write all scripts for record counts, transformations and table to table comparisons.
  • Datatype Validation – Ensures all new data will be the same as the other data where it is stored.
  • No loss of data – Check all data is imported correctly with no data truncated.
  • Record count – Write an SQL script that would complete a source to the destination reconciliation.
  • Data Transformation – Ensure any transformations are applied correctly.

Supporting data projects

Ad hoc projects are common , and sometimes become a priority for businses as they deal with requirements that result as part of an immediate business need.

Data Analysts will be called upon to support projects where there is a need to ensure the data required is of a standard that meets the project deliverables:

Some common areas where this might occur includes:

  • Extract data where it has been found to have been corrupted.
  • Investigate data changes, to analyse where a data breach may have occurred.
  • An external regulatory body has requested information to back up some reports submitted.
  • A customer has requested all the company’s information on them; usually the case for a GDPR request.

TypeError: cannot unpack non-iterable int object

Often when working on a data analytics project it requires you to split data out into its constituent parts.

There are a number of reasons for this, it can be confusing when you get errors as with the title of this post.

Before we explain this error and what it means, lets us first explain unpacking

Unpacking basically means splitting something up into a number of its parts that make it up.

To demonstrate if you take a,b,c = 123, and look to unpack it, it throws out the error, but why?

Well pure and simple, we have three values on the left “a,b,c”, looking for three values on the right.

a,b,c = 123
print(a)

Output:
 a,b,c = 123
TypeError: cannot unpack non-iterable int object

If you would like to fix this problem change the right hand side to have three values.

a,b,c = 1,2,3
print(a)
print(b)
print(c)
print(a,b,c)

Output:
1
2
3
1 2 3

Process finished with exit code 0

In essence, what is going on is that an evaluation checking that both sides have the same amount of values.

It is important to remember, the code above we used to show the error is an integer, which cannot be unpacked.

So if you take 123 for an example, which we used here it cannot be split into say 100 and 10 and 13.

In this case, even though when they are added up to 123, integers cannot be unpacked.

For this reason in the code for our solution, the difference is that the values used are tuples as follows:

a,b,c = 1,2,3
print(a)
print(b)
print(c)
print(a,b,c)
print(type((a,b,c)))

Or 
a,b,c = (1,2,3)
print(a)
print(b)
print(c)
print(a,b,c)
print(type((a,b,c)))

yield the same result:

1
2
3
1 2 3
<class 'tuple'>

Process finished with exit code 0

So in summary:

When unpacking there are a number of things to remember:

  • Integers on their own cannot be unpacked.
  • You need to make sure that if you have a number of variables, that you have the same number of integers if they the values.
    • This will make it a tuple and unpacking can then happen.

What is a Primary Key and Foreign Key

When working with databases in your data analytics projects, most likely you will deal with more than one table. As a result in certain scenarios there will be a link between the tables

This will ensure that you can use data from the two or more tables that can be joined together. But how would you do that?

In order to join two tables, you will need columns in both tables that have values that are unique and of similar values.

In other words you need to create what is called a primary key, its characteristics are as follows:

  • Its values are unique.
  • They cannot be null.

So where does a foreign key come in? Say we have two tables Table A and Table B.

Table “Customer Table” will have a column that has a primary key call Customer_No

Table “Sales Table” also has a column with a primary key called Sales_No.

  • Now we cant join the two tables, because the primary keys are different, and will have different values
  • But in the Sales Table, we will have an additional column called Customer_No, and it will hold some of the unique values that are held in the Customer Table, Customer_No column.
  • This Customer_No in the Sales Table is called the foreign key. It is unique, has null values and can be connected to a primary key column Customer_No on the Customer Table.

In this way tables can be connected where values in columns are not always the primary key of a table.

So let’s look at tables in SQLite and see how this works in practice.

Below you have a Customer Table and the Sales Table

In both tables, we have a primary key, though they are not the same primary key. As a result, you would not be able to join the tables as the primary keys contained in them do not have the same values, and this is how you would normally join.

In saying that, the two tables can be joined as the foreign key of Sales is related to the primary key of Customer. If invoice_no was not on Sales , then you could make the customer_no in sales the primary key.

So lets look at the tables below with data in them.

Normally when you try to join two tables, it is on the primary key. On the below invoice_no is the primary key of the table Sales, and customer_no is the primary key of the customer table.

While their values in each column are unique, they are not the same , so a join would fail as per the below:

But if you change on the join b.invoice_no TO b.customer_no, it will now bring back all the values asked for. The reason?

The primary key on one is linked to the foreign key on the other, with similar unique values.

As a result, you can still have a primary key column, not in another table and join to that table as long as there is a foreign key column with the same values.

This helps to maintain the table structure, without having to remove primary keys.

how to create and drop a table in SQL

If you are working on data analytics , and have access to databases that you can load data into and manipulate, then read on.

There maybe a reason for you have your own partitioned table with data in it, but you need to know how to create a database.

The creation of a table can be achieved as follows;

  • Decide on the table name.
  • Within the table you will have columns, they need to be named.
  • Next for each column name a datatype needs to be decided upon.

It is important to note that the data type you decide is very important before you create the table.

This is because once you start loading data into the table if you decide to change the data type for a column, it could change the data contained within that column.

As a result a review of the data to be loaded should be completed, looking at the type of data, its length and any attributes of the data that need careful consideration.

Lets create our table and talk through the different aspects of the SQL used.

So in the below we have the create table statement

The SQL is broken down as follows:

  1. The first line creates the table within the schema. It has its own unique area in there.
  2. Lines 2,3,4 create the columns with their datatypes.

When this is run , the schema is updated, one thing to note it is empty, and requires an insert statement to populate the columns.

And if we do a select all from the table, it returns empty:

How to drop a table that you have created?

Sometimes you may have a need to drop tables from your schema.

Before proceeding there are a couple of things to consider:

  • Have you backed up the data contained within it?
  • You understand the deletion is permanent ( unless you can rebuild from backup)
  • Any batch runs that run off the table will fail once it is deleted.
  • Any tables that have queries that join to the table been removed will fail.
  • This might be part of a wider database cleanup and may improve performance.

Below is the one line of code ( yes you are reading that correct), that will delete the table.

In most roles the ability to drop tables will be restricted by the database administrator.

After the deletion it will look like this. As you can see the table no longer exists.

In conclusion the deletion of a database table is very easy to do , but comes with caveats as outlined above, so proceed with caution.

Tkinter GUI tutorial python – how to clean excel data

Estimated reading time: 2 minutes

Tkinter is an application within Python that allows users to create GUI or graphical user interfaces to manage data in a more user-friendly way.

We are building our data analytics capability here, and looking to provide the user with the functionality they use in their work or college projects.

We have tested this code over 100,000 records sitting on the Microsoft OneDrive network so in a way, for this reason, its speeds were quite good.

As a result over five tests, they all were under 100s from start to finish.

data cleansing data cleansing fixed

In this Tkinter GUI tutorial python, you will be shown how to find the data errors, clean them and then export the final result to excel.

We will take you through the following:

  • Creation of the Tkinter interface.
  • Methods/ functions to find errors.
  • Methods/functions to clean the data.
  • Exporting the clean data to an excel file.

 

To sum up:

The video walks through the creation of a Tkinter window using a canvas and a frame to store the data frame.

Then it looks at importing the data through pd.read_excel, to load the data into a pandas data frame.

Next, there is a function and or method that will extract the errors through str.extract , which is loaded into separate columns

Finally, I have exported the clean dataset using rawdata.to_excel , and saved the file as a separate new spreadsheet.

What is data analytics?

Estimated reading time: 4 minutes

In recent years data analytics and the importance of data analysis has gone up significantly.

As a result of this, the quantities of data now been processed has increased.

Here in Data Analytics Ireland, we will look to explain the concept, please share or link to this article so others can see its contents.

To begin with, the reason for this is that the digital economy has taken off, in particular:

(A) Automation of tasks has become easier.

(B) Less use of paper, becoming a greener economy.

(C) Technology improvements have meant that storage and big data processing make the process of delivering services easier.

(D) Career opportunities for professionals with good skills have increased.

(E) A wide range of open source and paid tools are now easily available that help to process and report on the data.

(F) Entry requirements are easy, and this coupled with an ability to quickly acquire knowledge and skills helps entry whether you want to be full time or part-time.

(G)Knowledge and skills have improved as access to online learning has improved significantly.

As a result of all this:

(A) Large data volumes need to be analysed.

(B) Consumers’ habits about how they use a service or the information they look for now has a digital footprint.

(C) Now once a consumer has used a service ( whether purchased or not), the ability to understand their habits can be captured to deliver as follows:

  1. The services they want.
  2. The products they want.
  3. Quicker turnaround time.

How can this help with all the data that is captured and stored?

So in this article, we have already outlined what the background is as to how the industry has evolved to where data analytics is now.

As outlined, all information traditionally would not have been stored in a format that was easily accessible.

Step 1 – Data Capture

To understand what you want to analyse, and help draw conclusions accurately, a data analyst will work with their technical colleagues to ensure that the correct data is captured.

Data capture of raw data, can happen in a number of ways:

  • User input.
  • Interaction with a website or application.
  • Consuming a service.
  • Requesting a service be completed.
  • Social media interaction.
  • In a lot of cases now this happening in real time.

Once the completeness and accuracy are fulfilled, your data quality will become less of an issue.

Step 2 – Analysing

(A) You create visual charts of it; this allows the viewer of the information to get an initial view of the information without looking at the underlying data. Sometimes this will show patterns in data or clusters or the types of data you capture.

(B) Using data science statistics to see if they can explain the data. This could show information such as how data is correlated or otherwise. Also, probabilities could be calculated to show what outcomes might happen in the future.

(C) Data analysts might also need to understand how to build a machine learning model to use complex algorithms, to explain the data better, sometimes patterns that are not immediately understood can be unearthed and investigated further.

Step 3 – Presenting

An emphatic NO is an answer, but this is where some of the visualisation tools come in!

Visually presenting data points, very quickly allows a viewer of the information to come to a decision quickly, and the tools that are outlined below will help with that process.

These are a handful that will allow the data to be sliced and diced, there are many more out there, but they all allow data to be drilled down into and get to a real understanding what is going on.

Some of the tools include TableauPower BI, and Python( it has libraries that do a nice job)

Step 4 – Decision Making

So after all this analysing, there needs to be decisions made:

(A) Do you have the data in order and in the correct format?

(B) In a place that it can be accessed and reviewed.

(C) Relevant to when the decision needs to be made.

From the outset, as part of the work of performing the data analytics, an assessment needs to be made as to how often a decision will need to be made, with what data, and when.

At this point, the decision-makers should have a set of data ready for them to look over and reliably make a decision

Based on the data returned, if they can’t make a decision then possibly, steps 1-3 should be reviewed and revisited.

Often what happens, is what information that was required to make a change needs to be updated, or improved upon.

It is the job of the teams that manage the data sets to source that data and or change how they present it, to now reflect the decision that needs to be made.

how to create an instance of a class

Estimated reading time: 1 minute

Here in how to create an instance of a class, as described herein, how to create a class in Python, we will further explore the instance of class and how this can be used within a program to assign values to an object. This allows that object to inherit those values contained within the class, making it easier to have consistency regards functionality and data.

This video covers off

(a) creating an instance of a class

(B) Using the __init__ within the class

(C) define the constructor method __init__

(D) Creating an object that calls a class and uses the class to process some piece of data.

What are the benefits of this?

  • You only need to create one class that holds all the attributes required.
  • That class can be called from anywhere within a program, once an instance of it is created.
  • You can update the class, and once completed, those new values will become available to an instance of that class.
  • Makes for better management of objects and their properties, not multiple different versions contained within a program

 

 

How to create a class in Python

Estimated reading time: 1 minute

How to create a class in Python: In this video explaining classes will be the main topic on how they are constructed,  explain how to create an instance of a class.

When talking about classes, they can also be referred to as object-orientated programming.

Also, we look at what class attributes are and how they can be used to assign key data that can be called anywhere within a program.

The steps involve the following:

(a) Create a class

(B) Assign attributes to the class

(C) Create a method within the class ( similar to a function)

(D) Create an instance of a class to call its attributes and methods.

This video is a follow on from object oriented programming – Python Classes explained