How to show percentage differences between files in Python

Estimated reading time: 5 minutes

In our previous post on how to compare CSV files for differences we showed how you could see the differences, but what if you wanted to see if a record was 100% matched or not?

Here we are going to use SequenceMatcher which is a python class that allows you to compare files and return if they are matched as a percentage.

Lets look at the code

Import statements and read in the CSV files. Here we do the usual of importing the libraries we need and read in the two CSV files we use:

Note we also set up the data frame settings here as well, so we can see the data frame properly. See further down.

import pandas as pd
import numpy as np
import math
from difflib import SequenceMatcher
pd.set_option('max_columns', None)
pd.set_option('display.width', None)


#Read in the CSV files
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')

The two CSV files look like this:

CSV1

CSV2

Next, we are going to create an array. Creating an array allows the comparison to be completed as it creates indexes, and the indexes in each array can be compared and then the percentage difference can be calculated.

#Create an array for both dataframes
array1 = np.array(df1)
array2 = np.array(df2)

Our next step is to transfer the arrays to a data frame, change all integer values to a string, then join both data frames into one.

In this instance the changing of values into a string allows those values to be iterated over, otherwise, you will get an error ” TypeError: ‘int’ object is not iterable

#Transfer the arrays to a dataframe
df_CSV_1 = pd.DataFrame(array1, columns=['No1','Film1','Year1','Length1'])
df_CSV_2 = pd.DataFrame(array2, columns=['No2','Film2','Year2','Length2'])

#Change all the values to a string, as numbers cannot be iterated over.
df_CSV_1['Year1'] = df_CSV_1['Year1'].astype('str')
df_CSV_2['Year2'] = df_CSV_2['Year2'].astype('str')
df_CSV_1['Length1'] = df_CSV_1['Length1'].astype('str')
df_CSV_2['Length2'] = df_CSV_2['Length2'].astype('str')

#join the dataframes
df = pd.concat([df_CSV_1,df_CSV_2], axis=1)

We are now moving to the main part of the program, which gives us the answers we need. Here we create a function that does the calculations for us:

#Create a function to calculate the differences and show as a ratio.
def create_ratio(df, columna, columnb):
    return SequenceMatcher(None,df[columna],df[columnb]).ratio()

Next we calculate the differences and format the output

#Here we use apply which will pull in the data that needs to be passed to the fuction above.
df['Film_comp'] = df.apply(create_ratio,args=('Film1','Film2'),axis=1)
df['Year_comp'] = df.apply(create_ratio,args=('Year1','Year2'),axis=1)
df['Length_comp'] = df.apply(create_ratio,args=('Length1','Length2'),axis=1)

#This creates the values we are looking for
df['Film_comp'] = round(df['Film_comp'].astype('float'),2)*100
df['Year_comp'] = round(df['Year_comp'].astype('float'),2)*100
df['Length_comp'] = round(df['Length_comp'].astype('float'),2)*100

#this removes the decimal point that is added as a result of using the datatype 'Float'
df['Film_comp'] = df['Film_comp'].astype('int')
df['Year_comp'] = df['Year_comp'].astype('int')
df['Length_comp'] = df['Length_comp'].astype('int')
#Print the output
print(df)

And the final output looks like this:

An explanation of the output

As can be seen, the last three columns are the percentages of the match obtained, 100 been an exact match.

For index value 1 there is Joker in the first file, but Jokers is in the second file.

The ratio is calculated as follows:

Joker is length 5, Jokers is length 6 = 11 characters in length

So the logic looks at the sequence, iterating through the sequence it can see that the first 10 characters are in the same order, but the 11th character is not, as a result, it calculates the following:

(10/11) * 100 = 90.90

Finally, the round function sets the value we are looking for to 91.

On the same line, we shall compare the year:

2019 and 2008 are a total of eight characters.

In the sequence, the first two of each match, and as they are also found, gives us four, and the ratio is as follows:

4/8 *100 = 50

For Index 20 we also compared the film name, in total there are 17 characters, but the program ignores what they call junk, so the space is not included, for that reason the ratio only calculates over sixteen characters.

In order to understand this better I have compiled the below:

Index 1Index 1
201920088JokerJokers11
2Correct Spot1JCorrect Spot1
0Correct Spot1oCorrect Spot1
1Incorrect spot0KCorrect Spot1
9Incorrect spot0eCorrect Spot1
rCorrect Spot1
2Found1JFound1
0Found1oFound1
1Found0KFound1
9Found0eFound1
rFound1
Total in Comparison8Total in Comparison11
Ratio0.50Ratio0.91
Index 20
The DirtThe Dirty16
TCorrect Spot1
HCorrect Spot1
ECorrect Spot1
Correct Spot1
TFound1
HFound1
EFound1
DCorrect Spot1
ICorrect Spot1
RCorrect Spot1
TCorrect Spot1
DFound1
IFound1
RFound1
TFound1
Total in Comparison16
Ratio0.94

How to Compare Column Headers in CSV to a List in Python

Estimated reading time: 3 minutes

So you have numerous different automation projects in Python. In order to ensure a clean and smooth straight-through processing, checks need to be made to ensure what was received is in the right format.

Most but not all files used in an automated process will be in the CSV format. It is important there that the column headers in these files are correct so you can process the file correctly.

This ensures a rigorous process that has errors limited.

How to compare the headers

The first step would be to load the data into a Pandas data frame:

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> Include the headers
print(df)

The actual original file is as follows:

Next we need to make sure that we have a list that we can compare to:

header_list = ['Name','Address_1','Address_2','Address_3','Address_4','City','Country']

The next step will allow us to save the headers imported in the file to a variable:

import_headers = df.axes[1] #==> 1 is to identify columns
print(import_headers)

Note that the axis chosen was 1, and this is what Python recognises as the column axes.

Finally we will apply a loop as follows:

a = [i for i in import_headers if i not in header_list]
print(a)

In this loop, the variable “a” is taking the value “i” which represents each value in the import_headers variable and through a loop checks each one against the header_list to see if it is in it.

It then prints out the values not found.

Pulling this all together gives:

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> Include the headers
print(df)

#Expected values to receive in CSV file
header_list = ['Name','Address_1','Address_2','Address_3','Address_4','City','Country']

import_headers = df.axes[1] #==> 1 is to identify columns
print(import_headers)


a = [i for i in import_headers if i not in header_list]
print(a)

Resulting in the following output:

As can be seen the addresses below where found not to be valid, as they where not contained within our check list “header_list”