How to Generate Random Integers Between 0 and 9

Estimated reading time: 3 minutes

You may have a number of machine learning projects where you need a way to generate random integers to test your model assumptions.

There are plenty of ways to achieve this, here we will take you through a number of different steps.

To demonstrate this we take you through a number of different scenarios, but before we do that:

What are the uses of using one or more of the below?

  1. Computer simulation – look to generate random numbers as part of a simulation in a computer program.
  2. Statistcial sampling – If you are conducting some analysis on a set of data, you use the below to generate samples in a range.
  3. Introduce randomness – Helps with testing in different scenarios to ensure that does not allow any bias be introduced.
  4. Cryptography – It can be used as part of of security testing to make data secure in transmission or on a database.
  5. Stress testing – Pick extreme values to pass to a program and see how it reacts and if it returns an error.

You can use approach of the randrange function

In this function, this creates a randomly selected value that you need, in a range you define:

from random import randrange
print(randrange(1,10,1))

Result = prints a random number starting at 1, ending at 10, in increments of 1

Use the randint function?

import random

print(random.randint(0,9))

Result = This produces a random number integer value between 0 and 9

What about the concept of randbelow function?

from secrets import randbelow

print(randbelow(9))

Result = Returns a random number between 0 and 9

As a result you could try the numpy random.randit functionality as it is also popular

import numpy as np
print(np.random.randint(10, size=(1, 10)))
Result = prints ten integer values, anything between 1 and 10

Sample Output:
[[6 8 3 9 4 1 5 3 2 7]]


Using the range function in a for loop is an option

n={} # Creates a dictionary of key value pairs
for i in range(10):
    n[i]=i
    print(n)

Result = Iterates and creates a dictionary as it moves through each step.

Sampele Output:
{0: 0}
{0: 0, 1: 1}
{0: 0, 1: 1, 2: 2}
{0: 0, 1: 1, 2: 2, 3: 3}
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4}
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7}
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}

Use random.randrange in a for loop

from random import randrange
for i in range(10):
    print(random.randrange(0, 10))

Result = Loops ten times to create ten random numbers

Sample Output:
9
1
8
1
7
7
6
3
9
7

In this instance, while the focus of this post is on integers, we have two examples using floating points.

You can generate random floating points using random.uniform

import random
print(random.uniform(0.0, 9.0))

Result = Returns a random floating point number.

Sample Output:
4.436111014119316

An alternative is to generate random floating point values in a range

import random
for i in range(10):
    print(random.random())

Result = Returns ten random floating point numbers.

Sample Output:
8.80776539498274
0.8975905667079721
0.8467133530607382
0.757433819303854
0.88431126761758
0.9077189321042094
0.4826489161220854
0.7689417823093723
0.505382717614604
0.3102908090040003
0.01832993383665016

So in summary there are various different ways to create random numbers either a single one or a group of random numbers.

As a result of this, it is a very handy way to assist a programmer test a machine learning project.

On the other hand, if you need to try and break your program with values that are at the extreme of a range it can be accommodated this way as well.

How to Automate Testing With Python | unittest

Estimated reading time: 5 minutes

So you are working with large amounts of data, and the data is having transformations applied to it or quite simply you need to see as you move data, has it changed in any way.

The question that arises is if I have lots of validations to complete, how can I quickly see if they pass or fail?

Here we will take you through unittest, a Python testing framework that will enable a quick way to validate data you are working with.

Step 1 – Define your test scenarios

Like in any testing, the first step is to define what you want to test and what is your expected outcome. Here we are testing data that was transformed into a different value.

The test then looks at the new values and compares them to a list of values that are expected in the output.

As can be seen, the above two rows will be what we will use to create our test cases, which will be outlined below.

While the below may seem very strict, they emphasise exactly what we are testing for. Either all data passes the test or it does not pass.

In reality, these rules can be amended/relaxed depending on what you need to test for, this is purely for illustrative purposes:

Step 2 – Create our python file that will pass in the transformed data

This is the first file we need, it will be the source file that will be used to transform the data and then pass to the testing module later on.

In this instance, it is called “automate_testing_python.py”

I have created a dictionary here, but you can read in a CSV file or any file that you wish. Just make sure that it is passed to the data frame.

import pandas as pd

data = {'Name_before':['Joe','Liz','Pat'],'Name_after':['Joseph','Elizabeth','Patrick']} #===> Creating a dictionary

df = pd.DataFrame(data) #===> Creating a dataframe


def main():

    transfromations_applied = data['Name_after'] #===> Defines the column you want to test against
    return transfromations_applied #===> Returns the values that will be used in the automation script

The function “main” above defines clearly what column holds the data to be passed to the automated test, in this instance ” transfromations_applied “

Step3 – Create a test script file

In the below there are a number of steps to follows:

  1. Install the package HTMLTestrunner in whatever solution you are using.
  2. Install the Unittest library to the same program. This will perform the tests asked and provide the result outputs.
  3. Next we are importing what was returned from “main” function in the file “automate_testing_python.py”
  4. The next step is the creation of the class, this serves the following purposes:
    1. Defines the exact tests we want to carry out.
    2. In the first method, it clearly defines what values are expected and what should be tested against.
    3. The second method is purely checking that all the values are alphabetic.
  5. The last line is running a file that will run the tests and then return the values to the output html file.
from HTMLTestRunner import HTMLTestRunner
import unittest

from automate_testing_python import main #===> Imports the data that was returned


#This is the code that performs the tests
#Each method is a specific test
class test_values(unittest.TestCase):


def test_ValueEquals(self):

ExpectedValues = ['Joseph','Elizabeth','Patrick']
values_received = main()
self.assertListEqual(ExpectedValues,values_received)

def test_Valueisalphabetic(self):

values_received = main()
for c in values_received:
self.assertTrue(c.isalpha())



if __name__ == "__main__":
HTMLTestRunner.main()

Step 4 – Create the File that will generate the output HTML file with the results.

As a result of the above steps we have created the data, brought it in to test against predefined test scenarios.

Now, all we have to do is present the findings in an HTML file.

In this instance, the output will be easier to understand, and quicker to digest.

In the below code, it follows the below steps:

(A) The first and second lines import the libraries we will need to run this bit of code.

(B) The third line is receiving the values that were generated from the class in step 3.

(C) All the next four lines do, is pull in the data, attach it to the myreport.html file

(D) The last two lines do the following:

  1. Tell the program to take the data received and then use the HTMLTestRunner package to process it, and create the template output.
from HTMLTestRunner import HTMLTestRunner
import unittest
from test_automate_testing_python import test_values

if __name__=='__main__':
    suite=unittest.makeSuite(test_values)
    filename='C:\\Users\haugh\OneDrive\dataanalyticsireland\YOUTUBE\how_to_automate_testing_with_python\myreport.html'
    fp=open(filename,'w')
    runner=HTMLTestRunner.HTMLTestRunner(fp,title=u'Data Analytics Ireland Test Output',description=u'This shows the testing output result')
    runner.run(suite)

Step 5 – Run the file html_report.py

Running this file will also open and run the code in the other files referenced in this post.

As a result, when the program is finished the output will look like this:

Earlier in the program we created it a dictionary (STEP 1) like this:

data = {‘Name_before’:[‘Joe’,’Liz’,’Pat’],’Name_after’:[‘Joseph’,’Elizabeth’,’Patrick’]} #===> Creating a dictionary

and then in Step2 in our class “test_values” we defined the expected outputs.

def test_ValueEquals(self): ExpectedValues = [‘Joseph’,‘Elizabeth’,‘Patrick’]

In this instance, the values are the same, the above output will mark the two tests as a Pass.

But if the values passed in are:

data = {‘Name_before’:[‘Joe’,’Liz’,’Pat’],’Name_after’:[‘Joseph’,’Elizabeth’,‘Patrick1’]} #===> Creating a dictionary

and are compared to

def test_ValueEquals(self): ExpectedValues = [‘Joseph’,‘Elizabeth’,‘Patrick’]

The following output will appear, as the two tests will fail:

So we have created two test cases here, but we could add loads more and then have the automation quickly return the test results for us.

ValueError: cannot convert float NaN to integer

Estimated reading time: 2 minutes

Sometimes in your data analytics project, you will be working with float data types and integers, but the value NaN may also appear, which will give you headaches if you don’t know how to fix the problem at hand.

A NaN is defined as “Not a Number” and represents missing values in the data. If you are familiar with SQL, it is a similar concept to NULLS.

So how does this error occur in Python?

Let’s look at some logic below:

NaN =float('NaN')
print(type(NaN))
print(NaN)

Result:
<class 'float'>
nan

As can be seen, we have a variable called ‘NaN’, and it is a data type ‘Float’

One of the characteristics of NaN is that it is a special floating-point value and cannot be converted to any other type than float; thus, when you look at the example below, it shows us this exactly and why you would get the error message we are trying to solve.

NaN =float('NaN')
print(type(NaN))
print(NaN)

a= int(NaN)

print(a)

Result:

Traceback (most recent call last):
  File "ValueError_cannot_convert_float_NaN_to_integer.py", line 5, in <module>
    a= int(NaN)
ValueError: cannot convert float NaN to integer

In the variable ‘a’ we are trying to make that an integer number from the NaN variable, which, as we know, is a floating-point value and cannot be converted to any other type than float.

How do we fix this problem?

The easiest way to fix this is to change the ‘NaN’ actual value to an integer as per the below:

NaN =float(1)
print(type(NaN))
print(NaN)

a= int(NaN)

print(a)
print(type(a))

Result:
<class 'float'>
1.0
1
<class 'int'>

So, in summary, if you come across this error:

  1. Check to see if you have any ‘Nan’ values in your data.
  2. If you do replace them with an integer value, or a value that you need for your project, that should solve your problem.

TypeError: List Indices Must Be Integers Or Slices, Not Tuple

Estimated reading time: 3 minutes

When working with Python lists in your data analytics projects, when you trying to read the data, a common problem occurs if you have a list of lists, and it is not properly formatted.

In this instance, Python will not be able to read one or more lists and as a result, will throw this error.

In order to understand how this problem occurs, we need to understand how to create a list of lists.

How to create a lists of lists

Let’s look at a simple list:

a = [1,2,3]
print(a)
print(type(a))

Result:
[1, 2, 3]
<class 'list'>

Let’s create a second list called b:

b = [4,5,6]
print(b)
print(type(b))

Result:
[4, 5, 6]
<class 'list'>

So if we want to join the lists together into one list ( hence a list of lists) then:

a = [1,2,3]
b = [4,5,6]

list_of_lists = []
list_of_lists.append(a)
list_of_lists.append(b)
print(list_of_lists)
print(type(list_of_lists))

Result:
[[1, 2, 3], [4, 5, 6]]
<class 'list'>

So as can be seen the two lists are contained within a master list called “list_of_lists”.

So why does the error list indices must be integers or slices, not tuple occur?

Reason 1 ( Missing commas between lists)

If you manually type them in and forget the comma between the lists this will cause your problem:

a=[[1,2,3][4,5,6]]
print(a)

Result (Error):
Traceback (most recent call last):
  line 10, in <module>
    a=[[1,2,3][4,5,6]]
TypeError: list indices must be integers or slices, not tuple

But if you put a comma between the two lists then it returns no error:

a=[[1,2,3],[4,5,6]]
print(a)

Result (no error):
[[1, 2, 3], [4, 5, 6]]
Process finished with exit code 0

Reason 2 ( Comma in the wrong place)

Sometimes you have a list, and you only want to bring back some elements of the list, but not others:

In the below, we are trying to bring back the first two elements of the list.

a= [1,2,3,4,5]
print(a[0:,2])

Result:
Traceback (most recent call last):
   line 14, in <module>
    print(a[0:,2])
TypeError: list indices must be integers or slices, not tuple

The reason that the same error happens is the additional comma in a[0:,2], causes the error to appear as Python does not know how to process it.

This is easily fixed by removing the additional comma as follows:

a= [1,2,3,4,5]
print(a[0:2])

Result:
[1, 2]
Process finished with exit code 0

So why is there a mention of tuples in the error output?

The final piece of the jigsaw needs to understand why there is a reference to a tuple in the error output?

If we return to a looking at a list of lists and look at their index values:

a=[[1,2,3],[4,5,6]]
z = [index for index, value in enumerate(a)]
print(z)

Result:
[0, 1]
Process finished with exit code 0

As can be seen, the index values are 0,1, which is correct.

As before removing the comma gives the error we are trying to solve for:

a=[[1,2,3][4,5,6]]
z = [index for index, value in enumerate(a)]
print(z)

Result:
Traceback (most recent call last):
  line 16, in <module>
    a=[[1,2,3][4,5,6]]
TypeError: list indices must be integers or slices, not tuple

BUT the reference to the tuple comes from the fact that when you pass two arguments (say an index value) a Tuple is created, but in this instance, as the comma is missing the tuple is not created and the error is called.

This stems from the __getitem__ for a list built-in class cannot deal with tuple arguments that are not integers ( i.e. 0,1 like we have returned here) , so the error is thrown, it is looking for the two index values to pass into a tuple.

How to Create an XML file from Excel using Python

Estimated reading time: 3 minutes

Are you working on a data analytics project where you need to feed your data to a location that is able to process an XML file?

The ability to get your data into a structured format like XML has many benefits:

(A) You can transfer the data to a web service for processing.

(B) Multiple different formats of your raw data can be standardised, enabling quick conversion and processing.

(C) XML files can be read by multiple different programs, as long as you deliver them in the correct format.

(D) The receiver of data can easily read the XML file and store it on their database.

The ability to use this method to read and transfer data is a very powerful way to help a data analyst process large data sets.

In fact, if you are using cloud-based applications to analyse the information you are storing, this will quickly enable you to deliver the data.

What packages do I need in Python?

The first step is to import the following:

import pandas as pd
from lxml import etree as et

Next we want to read in the source data

In this instance, we are reading an excel file

raw_data = pd.read_excel(r'Link to where your data is stored including the full file name')

Now we want to start building the XML structure

The FIRST STEP is to define the root

root = et.Element('root')

The root is the parent of all the data items (tags) contained in the XML file and is needed as part of the structure

The SECOND STEP is to define the tag names that will store each row of the source data

for row in raw_data.iterrows(): ==> This is a loop that takes runs through each record and populates for each tag.
    root_tags = et.SubElement(root, 'ExportData') #=== > Root name
# These are the tag names for each row (SECTION 1)
    Column_heading_1 = et.SubElement(root_tags, 'Name')
    Column_heading_2 = et.SubElement(root_tags, 'Area')
    Column_heading_3 = et.SubElement(root_tags, 'NoPurchases')
    Column_heading_4 = et.SubElement(root_tags, 'Active')

###These are the values that will be populated for each row above
# The values inside the [] are the raw file column headings.(SECTION 2)
    Column_heading_1.text = str(row[1]['Name'])
    Column_heading_2.text = str(row[1]['Area'])
    Column_heading_3.text = str(row[1]['No Purchases'])
    Column_heading_4.text = str(row[1]['Active'])

The raw file looks like this:

NameAreaNo PurchasesActive
JohnDublin2Y
MaryGalway3N
JoeLimerick4N
JimmyKilkenny55Y
JenniferBelfast6N
SusanWaterford3Y
JakeCork1Y
BobbyDundalk11N
SarahSligo9N
CianEnnis8Y
Raw file data that will be imported into the XML file

The THIRD STEP is to create the XML file and populate it with the data from the source file

# This Section outputs the data to an xml file
# Unless you tell it otherwise it saves it to the same folder as the script.
tree = et.ElementTree(root) ==> The variable tree is to hold all the values of "root"
et.indent(tree, space="\t", level=0) ===> This just formats in a way that the XML is readable
tree.write('output.xml', encoding="utf-8") ==> The data is saved to an XML file

The XML output should look like the below

<root>
	<ExportData>
		<Name>John</Name>
		<Area>Dublin</Area>
		<NoPurchases>2</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Mary</Name>
		<Area>Galway</Area>
		<NoPurchases>3</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Joe</Name>
		<Area>Limerick</Area>
		<NoPurchases>4</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Jimmy</Name>
		<Area>Kilkenny</Area>
		<NoPurchases>55</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Jennifer</Name>
		<Area>Belfast</Area>
		<NoPurchases>6</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Susan</Name>
		<Area>Waterford</Area>
		<NoPurchases>3</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Jake</Name>
		<Area>Cork</Area>
		<NoPurchases>1</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Bobby</Name>
		<Area>Dundalk</Area>
		<NoPurchases>11</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Sarah</Name>
		<Area>Sligo</Area>
		<NoPurchases>9</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Cian</Name>
		<Area>Ennis</Area>
		<NoPurchases>8</NoPurchases>
		<Active>Y</Active>
	</ExportData>
</root>

Additional XML data can be added

  1. Add more rows – All you need to do is add onto the source file and save. When you rerun the logic it will read in the extra information.
  2. Add more columns – All you need to do is go to the second step above add in a tag name to SECTION 1. Seperately you will need to add an additional column with data to the source file, and then add that column name to SECTION 2 as well

How to Add Formulas to Excel using Python

Estimated reading time: 3 minutes

You may be working on automating some exports to Excel using Python to compare files or just pure and simple adding formulas to an Excel file before you open it up.

Here we explain adding formulas to your Excel output using Numpy or adding the calculations to specific cells in the output.

Adding formulas to specific cells

First of all, let’s look at the normal spreadsheet with some calculations, these have the formulas typed in. The ultimate objective is to have the Python code do this for us, one less step.

As can be seen, the cells have the formulas in them, but this would be a very time-consuming process if you had to do it multiple times, in multiple spreadsheets.

To get around this we can write the Python logic as follows:

  1. Create three lists and three dataframes as follows.
datasetA_list = np.array([1,2,3,4,5,6,7,8,9,10])

datasetB_list = np.array([9,8,65,43,3,21,3,2,1,7])

dataset_list = ('sum','average','median','standard deviation','count','correlation')


datasetA = pd.DataFrame(datasetA_list,columns=['ValueA'])
datasetB = pd.DataFrame(datasetB_list,columns=['ValueB'])
dataset_list_calcs = pd.DataFrame(dataset_list, columns=['Calcs'])

2. Next create a path to where you are going to store the data as follows:

path = 'output.xlsx'

3. In this next step create the workbook and location where the data will be stored. This will load the headings created in step 1 to a particular location on the spreadsheet.

workbook = pd.ExcelWriter(path, engine='openpyxl')
workbook.book = load_workbook(path)
workbook.sheets = dict((ws.title,ws) for ws in workbook.book.worksheets)

datasetA.to_excel(workbook,sheet_name="Sheet1", startrow=1,index=False, header=True,)
datasetB.to_excel(workbook,sheet_name="Sheet1", startrow=1, startcol=2,index=False, header=True)
dataset_list_calcs.to_excel(workbook,sheet_name="Sheet1", startrow=1, startcol=4,index=False, header=True)

4. Load the formulas into cells besides their relevant headings. This should line post these formulas beside the relevant heading created in step 1.

###Creating calculations for datasetA

sheet = workbook.sheets['Sheet1']
sheet['E2'] = 'CalcsA'
sheet['F3'] = '=SUM(A3:A12)'
sheet['F4'] = '=AVERAGE(A3:A12)'
sheet['F5'] = '=MEDIAN(A3:A12)'
sheet['F6'] = '=STDEV(A3:A12)'
sheet['F7'] = '=COUNT(A3:A12)'
sheet['F8'] = '=CORREL(A3:A12,C3:C12)'

###Creating calculations for datasetB

sheet = workbook.sheets['Sheet1']
sheet['H2'] = 'CalcsB'
sheet['H3'] = '=SUM(C3:C12)'
sheet['H4'] = '=AVERAGE(C3:C12)'
sheet['H5'] = '=MEDIAN(C3:C12)'
sheet['H6'] = '=STDEV(C3:C12)'
sheet['H7'] = '=COUNT(C3:C12)'
sheet['H8'] = '=CORREL(A3:A12,C3:C12)'

Use Numpy to create the calculations

a. Create the calculations that you will populate into the spreadsheet, using Numpy

a = np.sum(datasetA_list)
b = np.average(datasetA_list)
c = np.median(datasetA_list)
d = np.std(datasetA_list,ddof=1) ## Setting DDOF = 0 will give a differnt figure, this corrects to match the output.
f = np.count_nonzero(datasetA_list)
g = np.corrcoef(datasetA_list,datasetB_list)

b. Create the headings and assign them to particular cells

sheet['E14'] = 'Numpy Calculations'
sheet['E15'] = 'Sum'
sheet['E16'] = 'Average'
sheet['E17'] = 'Median'
sheet['E18'] = 'Standard Deviation'
sheet['E19'] = 'Count'
sheet['E20'] = 'Correlation'

c. Assign the variables in step a to a set of cells

sheet['F15'] = a
sheet['F16'] = b
sheet['F17'] = c
sheet['F18'] = d
sheet['F19'] = f
sheet['F20'] = str(g)

d. Save the workbook and close it – This step is important, and always include.

workbook.save()
workbook.close()

And the final output looks like…

How To Compare CSV Files for Differences

Estimated reading time: 5 minutes

Often a lot of data analytics projects involve comparisons, some of it could be to look at data quality problems, other times it could be to check that data you are loading has been saved to the database with no errors.

As a result of this, there is a need to quickly check all your data is correct. But rather r than do visual comparisons, wouldn’t it be nice to use Python to quickly resolve this?

Luckily for you in this blog post, we will take you through three ways to quickly get answers, they could be used together or on their own.

Let’s look at the data we want to compare

We have two CSV files, with four columns in them:

The objective here is to compare the two and show the differences in the output.

Import the files to a dataframe.

import pandas as pd
import numpy as np
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')

The above logic is very straightforward, it looks for the files in the same folder as the python script and looks to import the data from the CSV files to the respective data frames.

The purpose of this is that the following steps will use these data frames for comparison.

Method 1 – See if the two data frames are equal

In the output for this, it shows differences through a boolean value in this instance “True” or “False”.


array1 = np.array(df1) ===> Storing the data in an array will allow the equation below to show the differences.
array2 = np.array(df2)

df_CSV_1 = pd.DataFrame(array1, columns=['No','Film','Year','Length (min)'])
df_CSV_2 = pd.DataFrame(array2, columns=['No','Film','Year','Length (min)'])

df_CSV_1.index += 1 ===> This resets the index to start at 1 not 0, helps with the output when trying to understand the differences.
df_CSV_2.index += 1

df_CSV_1.index += 1 ===> This resets the index to start at 1 not 0, helps with the output when trying to understand the differences.
df_CSV_2.index += 1

print(df_CSV_1.eq(df_CSV_2).to_string(index=True)) ===> This shows the differences between the two arrays.

Your output will look like this, and as can be seen on lines 3 and 13 are false, these are the yellow values in the CSV2 file that are different to the CSV1 file values, all other data is the same which is correct.

The obvious advantage of the below is that you can quickly what is different and on what line, but now what values are different, we will explore that in the next methods.

        No  Film   Year  Length (min)
1   True  True   True          True
2   True  True   True          True
3   True  True  False          True
4   True  True   True          True
5   True  True   True          True
6   True  True   True          True
7   True  True   True          True
8   True  True   True          True
9   True  True   True          True
10  True  True   True          True
11  True  True   True          True
12  True  True   True          True
13  True  True  False          True
14  True  True   True          True
15  True  True   True          True
16  True  True   True          True
17  True  True   True          True
18  True  True   True          True
19  True  True   True          True
20  True  True   True          True

Method 2 – Find and print the values only that are different

So in the first approach, we could see there are differences, but not what lines are different between the two files.

In the below code it will again look at the data frames but this time print the values from the CSV1 file that have different values in the CSV2 file.

a = df1[df1.eq(df2).all(axis=1) == False] ===> This compares the data frames, but only returns the rows from DF1 that have a different value in one of the columns on DF2

a.index += 1 ===>This resets the index to start at 1, not 0, which helps with the output when trying to understand the differences. 

print(a.to_string(index=False))

As a result, the output from this as expected is:

No        Film  Year  Length (min)
  3    Parasite  2019           132
 13  Midsommar   2019           148

Method 3 – Show your differences and the value that are different

The final way to look for any differences between CSV files is to use some of the above but show where the difference is.

In the below code, the first line compares the two years between the two sets of data, and then applies a true to the column if they match, otherwise a false.


df1['Year_check_to_DF2'] = np.where(df1['Year'] == df2['Year'], 'True', 'False')
df1.index += 1 #resets the index to start from one.

df2_year = df2['Year'] ===> We create this column to store the DF2 year value.
df2_year = pd.Series(df2_year) #Series is a one-dimensional labeled array capable of holding data of any type.

df1 = df1.assign(df2_year=df2_year.values) = This adds the DF2 year value to the DF1 data frame
print(df1.to_string(index=False))

In this instance, this returns the below output. As can be seen, it allows us to complete a line visual of what is different.

So in summary we have completed a comparison of what is different between files.

There are a number of practical applications of this:

(A) If you are loading data to a database table, the uploaded data can be exported and compared to the original CSV file uploaded.

(B) Data Quality – Export key data to a CSV file and compare.

(C) Data Transformations – Compare two sets of data, and ensure transformations have worked as expected. In this instance, differences are expected, but as long as they are what you programmed for, then the validation has worked.

If you like this blog post, there are others that may interest you:

How to Compare Column Headers in CSV to a List in Python

How to count the no of rows and columns in a CSV file

How to group your data in Tableau

Estimated reading time: 3 minutes

Have you learnt how to connect to your data in Tableau and now want to understand how to group your Tableau data?

Here we go through a number of steps to help you understand better how to approach this, and what benefits it will bring to your data analytics project.

Why would I group in Tableau?

When you are working with large data sets , sometimes it easier to understand its meaning when the data is stored with simialr data items. Grouping the data has the following benefits:

(A) It allows a quick summary of data, and how large that data set is.

(B) Also groupings can alert to small subsets of data you may have not been aware of.

(C) Another benefit is that groups can be shown that have errors, and fixing them will put them in with the correct data.

(D) You can visually see groups, using Tableau will then you to keep them together.

Grouping by using a field in the data pane

The main way to group is when you are in the data pane, right click on the field you want to group by , then click create group.

For this example we can choose a number of values within channel, that we want to group by, here we pick all the items that have the value web.

You will notice that even before we click apply, it shows there are some data quality issues around the name that they are not consistent. You could use this to run metrics to catch these problems and count the no that occur.

When they are fixed then these should not appear anymore.

The output of this appears like this:

And on the screen , with the grouping now assigned, everything for Channel with web in it, is on one area:

Finally sometimes within your group, you may want an “other” category. The purpose of this is to catch items that dont fall into the group you have assigned, and sometimes they may come in later to the dataset as it expands.

You can achieve this as follows:

Giving in the output:

So in summary grouping can help you to identify a no of similar items to keep together, and also it is very useful to track data quality items as they arise and are fixed.

What does a data analyst do?

Estimated reading time: 4 minutes

Livestream #2 – What does a data analyst do?

You are probably sitting there hearing about big data and databases, data analytics and machine learning and wonder where a data analyst fits in?

Here we will look to break it down step by step.

Sometimes a data analyst can be confused with a business analyst; there are subtle differences:

  • Business Analyst: Their role is to document the user’s requirements in a document that is descriptive of what the user wants.
    • In this case, a document that all parties can agree to is created, and it can be used as part of the project sign-off.
  • Data Analyst: On the other hand, a data analyst will take the business requirements and translate them into data deliverables.
    • They use the document to ensure the project has the right data to meet the project objectives in the right place at the right time.

Data Mapping

In different data projects, there will be a need to reconcile the data between systems, a data analysis will help here.

In a data mapping exercise, the data analyst will be expected to look at one or more sources and map them to a destination system.

  • This ensures a match between the two datasets.
  • Which results in the ability to reconcile the two systems.
  • Allows the ability to use data in multiple systems, knowing the consistency is in place.
  • Consistency of the data types between the systems.
  • It ensures that data validation errors are kept to a minimum.

Often a Data Analyst will build a traceability matrix, which tracks the data item from creation through to consumption.

Data Quality

In most companies, there will be teams (depending on their size) dedicated to this, and their input will be pivotal to existing and future data use.

Data quality is an important task that could impact internal and external reporting and a company’s ability to make decisions accurately.

Some of the areas that might be looked at include:

(A) Investigate duplicate data – There could be a number of reasons this has to be checked:

  • Data was manually entered multiple times.
  • An automated process ran multiple times.
  • A change to an IT system has unknowingly duplicated data.

(B) Finding errors – This could be completed in conjunction with the data reporting outlined below.

  • Normally companies will clearly have rules that pick up the data errors that are not expected.
  • A data analyst will analyse why these errors are occurring.

(C) Checking for missing data.

  • Data feeds have failed. A request to reload the data will be required.
  • Data that was not requested as part of the business requirements confirm that this is the case.

(D) Enhancing the data with additional information – Is there additional information that can be added that can enrich the dataset?

(E) Checking data is in the correct format – There are scenarios where this can go wrong, and an example is a date field is populated with text.

Data Reporting

In some of the areas above, we touched on the importance of the quality of data.

Ultimately there may be a need to track:

  • Data Quality – Build reports to capture the quality of data based on predefined business measurements.
  • Real-time Reporting – No new customers or customers who have left an organisation.
  • Track Targets – Is the target set by the business been met daily, weekly, or monthly?
  • Management Reporting – Build reports that provide input to management packs that provide an overview of how the business performs.

Data Testing

Organisations go through change projects where new data is being introduced or enhanced.

As a result, the data analyst will have a number of tasks to complete:

  • Write Test Scripts – Write all scripts for record counts, transformations and table-to-table comparisons.
  • Datatype Validation – Ensures all new data will be the same as the other data where it is stored.
  • No loss of data – Check all data is imported correctly with no data truncated.
  • Record count – Write an SQL script that would complete a source-to-the-destination reconciliation.
  • Data Transformation – Ensure any transformations are applied correctly.

Supporting data projects

Ad hoc projects are common, and sometimes become a priority for businesses as they deal with requirements that result as part of an immediate business need.

Data Analysts will be called upon to support projects where there is a need to ensure the data required is of a standard that meets the project deliverables:

Some common areas where this might occur include:

  • Extract data where it has been found to have been corrupted.
  • Investigate data changes, to analyse where a data breach may have occurred.
  • An external regulatory body has requested information to back up some reports submitted.
  • A customer has requested all the company’s information on them; usually the case for a GDPR request.