How to show percentage differences between files in Python

Estimated reading time: 5 minutes

In our previous post on how to compare CSV files for differences we showed how you could see the differences, but what if you wanted to see if a record was 100% matched or not?

Here we are going to use SequenceMatcher which is a python class that allows you to compare files and return if they are matched as a percentage.

Let’s look at the code

Import statements and read in the CSV files. Here we do the usual of importing the libraries we need and read in the two CSV files we use:

Note we also set up the data frame settings here as well, so we can see the data frame properly. See further down.

import pandas as pd
import numpy as np
import math
from difflib import SequenceMatcher
pd.set_option('max_columns', None)
pd.set_option('display.width', None)


#Read in the CSV files
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')

The two CSV files look like this:

CSV1

CSV2

Next, we are going to create an array. Creating an array allows the comparison to be completed as it creates indexes, and the indexes in each array can be compared and then the percentage difference can be calculated.

#Create an array for both dataframes
array1 = np.array(df1)
array2 = np.array(df2)

Our next step is to transfer the arrays to a data frame, change all integer values to a string, and then join both data frames into one.

In this instance the changing of values into a string allows those values to be iterated over, otherwise, you will get an error ” TypeError: ‘int’ object is not iterable

#Transfer the arrays to a dataframe
df_CSV_1 = pd.DataFrame(array1, columns=['No1','Film1','Year1','Length1'])
df_CSV_2 = pd.DataFrame(array2, columns=['No2','Film2','Year2','Length2'])

#Change all the values to a string, as numbers cannot be iterated over.
df_CSV_1['Year1'] = df_CSV_1['Year1'].astype('str')
df_CSV_2['Year2'] = df_CSV_2['Year2'].astype('str')
df_CSV_1['Length1'] = df_CSV_1['Length1'].astype('str')
df_CSV_2['Length2'] = df_CSV_2['Length2'].astype('str')

#join the dataframes
df = pd.concat([df_CSV_1,df_CSV_2], axis=1)

We are now moving to the main part of the program, which gives us the answers we need. Here we create a function that does the calculations for us:

#Create a function to calculate the differences and show as a ratio.
def create_ratio(df, columna, columnb):
    return SequenceMatcher(None,df[columna],df[columnb]).ratio()

Next we calculate the differences and format the output

#Here we use apply which will pull in the data that needs to be passed to the fuction above.
df['Film_comp'] = df.apply(create_ratio,args=('Film1','Film2'),axis=1)
df['Year_comp'] = df.apply(create_ratio,args=('Year1','Year2'),axis=1)
df['Length_comp'] = df.apply(create_ratio,args=('Length1','Length2'),axis=1)

#This creates the values we are looking for
df['Film_comp'] = round(df['Film_comp'].astype('float'),2)*100
df['Year_comp'] = round(df['Year_comp'].astype('float'),2)*100
df['Length_comp'] = round(df['Length_comp'].astype('float'),2)*100

#this removes the decimal point that is added as a result of using the datatype 'Float'
df['Film_comp'] = df['Film_comp'].astype('int')
df['Year_comp'] = df['Year_comp'].astype('int')
df['Length_comp'] = df['Length_comp'].astype('int')
#Print the output
print(df)

And the final output looks like this:

An explanation of the output

As can be seen, the last three columns are the percentages of the match obtained, with 100 been an exact match.

For index value 1 there is Joker in the first file, but Jokers is in the second file.

The ratio is calculated as follows:

Joker is length 5, Jokers is length 6 = 11 characters in length

So the logic looks at the sequence, iterating through the sequence it can see that the first 10 characters are in the same order, but the 11th character is not, as a result, it calculates the following:

(10/11) * 100 = 90.90

Finally, the round function sets the value we are looking for to 91.

On the same line, we shall compare the year:

2019 and 2008 are a total of eight characters.

In the sequence, the first two of each match, and as they are also found, give us four, and the ratio is as follows:

4/8 *100 = 50

For Index 20 we also compared the film name, in total there are 17 characters, but the program ignores what they call junk, so the space is not included, for that reason the ratio only calculates over sixteen characters.

In order to understand this better I have compiled the below:

Index 1Index 1
201920088JokerJokers11
2Correct Spot1JCorrect Spot1
0Correct Spot1oCorrect Spot1
1Incorrect spot0KCorrect Spot1
9Incorrect spot0eCorrect Spot1
rCorrect Spot1
2Found1JFound1
0Found1oFound1
1Found0KFound1
9Found0eFound1
rFound1
Total in Comparison8Total in Comparison11
Ratio0.50Ratio0.91
Index 20
The DirtThe Dirty16
TCorrect Spot1
HCorrect Spot1
ECorrect Spot1
Correct Spot1
TFound1
HFound1
EFound1
DCorrect Spot1
ICorrect Spot1
RCorrect Spot1
TCorrect Spot1
DFound1
IFound1
RFound1
TFound1
Total in Comparison16
Ratio0.94

How to Create an XML file from Excel using Python

Estimated reading time: 3 minutes

Are you working on a data analytics project where you need to feed your data to a location that is able to process an XML file?

The ability to get your data into a structured format like XML has many benefits:

(A) You can transfer the data to a web service for processing.

(B) Multiple different formats of your raw data can be standardised, enabling quick conversion and processing.

(C) XML files can be read by multiple different programs, as long as you deliver them in the correct format.

(D) The receiver of data can easily read the XML file and store it on their database.

The ability to use this method to read and transfer data is a very powerful way to help a data analyst process large data sets.

In fact, if you are using cloud-based applications to analyse the information you are storing, this will quickly enable you to deliver the data.

What packages do I need in Python?

The first step is to import the following:

import pandas as pd
from lxml import etree as et

Next we want to read in the source data

In this instance, we are reading an excel file

raw_data = pd.read_excel(r'Link to where your data is stored including the full file name')

Now we want to start building the XML structure

The FIRST STEP is to define the root

root = et.Element('root')

The root is the parent of all the data items (tags) contained in the XML file and is needed as part of the structure

The SECOND STEP is to define the tag names that will store each row of the source data

for row in raw_data.iterrows(): ==> This is a loop that takes runs through each record and populates for each tag.
    root_tags = et.SubElement(root, 'ExportData') #=== > Root name
# These are the tag names for each row (SECTION 1)
    Column_heading_1 = et.SubElement(root_tags, 'Name')
    Column_heading_2 = et.SubElement(root_tags, 'Area')
    Column_heading_3 = et.SubElement(root_tags, 'NoPurchases')
    Column_heading_4 = et.SubElement(root_tags, 'Active')

###These are the values that will be populated for each row above
# The values inside the [] are the raw file column headings.(SECTION 2)
    Column_heading_1.text = str(row[1]['Name'])
    Column_heading_2.text = str(row[1]['Area'])
    Column_heading_3.text = str(row[1]['No Purchases'])
    Column_heading_4.text = str(row[1]['Active'])

The raw file looks like this:

NameAreaNo PurchasesActive
JohnDublin2Y
MaryGalway3N
JoeLimerick4N
JimmyKilkenny55Y
JenniferBelfast6N
SusanWaterford3Y
JakeCork1Y
BobbyDundalk11N
SarahSligo9N
CianEnnis8Y
Raw file data that will be imported into the XML file

The THIRD STEP is to create the XML file and populate it with the data from the source file

# This Section outputs the data to an xml file
# Unless you tell it otherwise it saves it to the same folder as the script.
tree = et.ElementTree(root) ==> The variable tree is to hold all the values of "root"
et.indent(tree, space="\t", level=0) ===> This just formats in a way that the XML is readable
tree.write('output.xml', encoding="utf-8") ==> The data is saved to an XML file

The XML output should look like the below

<root>
	<ExportData>
		<Name>John</Name>
		<Area>Dublin</Area>
		<NoPurchases>2</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Mary</Name>
		<Area>Galway</Area>
		<NoPurchases>3</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Joe</Name>
		<Area>Limerick</Area>
		<NoPurchases>4</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Jimmy</Name>
		<Area>Kilkenny</Area>
		<NoPurchases>55</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Jennifer</Name>
		<Area>Belfast</Area>
		<NoPurchases>6</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Susan</Name>
		<Area>Waterford</Area>
		<NoPurchases>3</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Jake</Name>
		<Area>Cork</Area>
		<NoPurchases>1</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Bobby</Name>
		<Area>Dundalk</Area>
		<NoPurchases>11</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Sarah</Name>
		<Area>Sligo</Area>
		<NoPurchases>9</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Cian</Name>
		<Area>Ennis</Area>
		<NoPurchases>8</NoPurchases>
		<Active>Y</Active>
	</ExportData>
</root>

Additional XML data can be added

  1. Add more rows – All you need to do is add onto the source file and save. When you rerun the logic it will read in the extra information.
  2. Add more columns – All you need to do is go to the second step above add in a tag name to SECTION 1. Seperately you will need to add an additional column with data to the source file, and then add that column name to SECTION 2 as well

How to check if a file is empty

Estimated reading time: 2 minutes

Ever wondered how to go about checking if a file is empty?

A problem you may come across in Data Analytics that when you are importing a file as outlined in this post Python – How to import data from files is how do we know if the files are empty or not before import?

In the world of data, there are several reasons to check :

  • You have an automated process relying on the import not been empty.
  •  A process that preceded you receiving the file did not work.
  • The amount of time and effort to investigate the problem causes undue work to fix.

The nuts and bolts of it all

Here we have a video that looks at different scenarios on how to bring in files. The following functionality appears in this video:

  • os.path.getsize – This looks to get the file size attached to the file. * Please see note below
  • pd.read_csv
  • pd.read_excel

The add on bits

*One note about os.path.getsize, which we found:

  • It only works in the logic provided if the size of the file is zero KB.
  •  CSV and XLSX files even though they where created empty, when saved had a file size greater than zero.
  •  TXT files, when created empty and saved, had a file size of zero.

 

Hope this video helps explain further how empty files can be checked in python before they are processed.

Thanks!

Data Analytics Ireland

How to remove characters from an imported CSV file

Estimated reading time: 5 minutes

Removal of unwanted errors in your data, the easy way.
The process of importing data can take many formats, but have you been looking for a video on how you do this? Even better are you looking for a video that shows you how to import a CSV file and then data cleanse it, effectively removing any unwanted characters?

As a follow up to Python – how do I remove unwanted characters, that video focused on data cleansing the data created within the code, this video runs through several options to open a CSV file, find the unwanted characters, remove the unwanted characters from the dataset and then return the cleansed data.

How to get in amongst the problem data:

The approach here looked at three different scenarios:

(A) Using a variable that is equal to an open function, and then reading the variable to a data frame.

(B)Using a “with statement” and an open function together, and returning the output to a data frame.

(C) Using read_csv to quickly and efficiently read in the CSV file and then cleanse the dataset.

Some minor bumps in the road that need some thought

There where some challenges with this that you will see in the video:

  • Options A & B had to deploy additional code just to get the data frame the way we wanted.
  •  The additional lines of code made for more complexity if you were trying to manage it.

In the end, read_csv was probably the best way to approach this; it required less code and would have been easier to manage in the longer run.

As always thanks for watching, please subscribe to our YouTube channel, and follow us on social media!

Data Analytics Ireland

The code used in this video is here:

##########   How to import a  CSV dataset and remove unwanted characters ###########


#Numbers need to be put between "" to be looped through
import pandas as pd

l = ['$','*','€','&','£','/','@']
print("Error list we will be checking against in all three scenarios")
print(l)

############ Scenario 1  - Using the Open Function #####################

#can't use engine ='python' here , as it is an invalid keyword argument for open()
my_file = open("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv", "r").read().split('\n')
print("Senario 1 printed as a dataframe")
df = pd.DataFrame(my_file)
df.columns = ['Temp']
df[['Number','Error']] = df['Temp'].str.split(',',expand=True)
#Drop the column Temp ,0 for rows and 1 for columns.
df = df.drop('Temp', 1)
#Dropping the first row as it has the headers in it.
df = df.drop(0)
#Resetting the index to start at value 0
df = df.reset_index(drop=True)
display(df)


def run_scenario1(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df['Error'] = df['Number'].str.extract('(\D+|^(0)$)')   
    #This line removes anything with a null value
    df.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        #print("Error list to check against")
        #print(i[0])
        #print(i[1])
        #print(i[2])
        #print(i[3])
        #print(i[4])
        #print(i[5])
        #print(i[6])
    print("Scenario 1 output with fixes")
    display(df)


run_scenario1(l)



############ Scenario 2  - Using a with statement #####################

#Once Python exits the with block, it closes the file
#can't use engine ='python' here , as it is an invalid keyword argument for open()
with open("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv") as f:
    emptylist=[]
    for line in f:
        emptylist.append(line)
df1 = pd.DataFrame(emptylist)
df1.columns = ['Temp']
df1[['Number','Error']] = df1['Temp'].str.split(',',expand=True)
#Drop the column Temp ,0 for rows and 1 for columns.
df1 = df1.drop('Temp', 1)
#Dropping the first row as it has the headers in it.
df1 = df1.drop(0)
#Resetting the index to start at value 0
df1 = df1.reset_index(drop=True)
#Replacing the Error field value that was created above witha value of 0.
df1['Error'] = df1['Error'].str.replace('0\n','0')
print("Scenario 2 printed as a dataframe")
display(df1)  
        

#Function to loop through the dataset and see if any of the list values below are found and replace them
def run_scenario2(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df1['Error'] = df1['Number'].str.extract('(\D+|^(0)$)')   
    #This line removes anything with a null value
    df1.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df1['Fix']= df1['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        #print("Error list to check against")
        #print(i[0])
        #print(i[1])
        #print(i[2])
        #print(i[3])
        #print(i[4])
        #print(i[5])
        #print(i[6])
    print("Scenario 2 output with fixes")
    display(df1)

   
        
run_scenario2(l)


############ Scenario 3  - Using read CSV #####################

#putting in engine ='python' removes any need to do additional code to display in a dataframe correctly unlike above
df2 = pd.read_csv("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv",engine ='python')
print("Senario 3 printed as a dataframe")
display(df2)


def run_scenario2(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df2['Error'] = df2['Number'].str.extract('(\D+|^(0)$)')   
    #This line removes anything with a null value
    df1.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df2['Fix']= df1['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        
    print("Scenario 3 output with fixes")
    display(df2)
   
        
run_scenario2(l)

YouTube channel lists – Python working with files

Have you ever wondered how to work with files and get access to them and look at their contents?

In today’s environment, there is a frequent need to work with files, import do some work on them and then either save them somewhere or pass them onwards for further processing.

Working with files can happen in several ways:

  • You import them from a local or network drive.
  • The files are transmitted electronically to be stored on a destination server and then processed.
  • They could be received via email and need to be extracted and processed.

The videos in this data analytics series will hopefully help to explain the different concepts and ways of working with files; you may need to address or write a programme to address.

https://www.youtube.com/embed?listType=playlist&list=PL2nlwZUZ5tFLTwTqKHi8cHY8ZH34HQSJM

We hope you like and please subscribe through the social media buttons on the page if you want to hear more from us!

Data Analytics Ireland

 

Python Tutorial: How to import data from files

Estimated reading time: 1 minute

Is there a need for you to be quickly open files, and import the data into a data frame?

In this post and video on Python, we will look at several options for you to do this as well as some additional things to consider.

The import of files covered here is as follows:

  • Reading data from a CSV file.
  • Reading data from a TXT file.
  • Reading data from an XLSX file.

On importation there are many things to consider, here are a few to consider:

(A) The file format

(B) How the data looks within the file.

(C) Special requirements to get the data looking correct when loaded.

In this file importing example dealing with tab delimiters, headers and sorting are referenced.  Here are some different ways to approach it a little differently if you are looking for alternatives CSV File Reading and Writing

Thanks for watching, please follow us by clicking on the links to the right!

Need to check if a file is empty? Have a look here Python – How to check if a file is empty

Thanks!

Data Analytics Ireland