How to show percentage differences between files in Python

Estimated reading time: 5 minutes

In our previous post on how to compare CSV files for differences we showed how you could see the differences, but what if you wanted to see if a record was 100% matched or not?

Here we are going to use SequenceMatcher which is a python class that allows you to compare files and return if they are matched as a percentage.

Let’s look at the code

Import statements and read in the CSV files. Here we do the usual of importing the libraries we need and read in the two CSV files we use:

Note we also set up the data frame settings here as well, so we can see the data frame properly. See further down.

import pandas as pd
import numpy as np
import math
from difflib import SequenceMatcher
pd.set_option('max_columns', None)
pd.set_option('display.width', None)


#Read in the CSV files
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')

The two CSV files look like this:

CSV1

CSV2

Next, we are going to create an array. Creating an array allows the comparison to be completed as it creates indexes, and the indexes in each array can be compared and then the percentage difference can be calculated.

#Create an array for both dataframes
array1 = np.array(df1)
array2 = np.array(df2)

Our next step is to transfer the arrays to a data frame, change all integer values to a string, and then join both data frames into one.

In this instance the changing of values into a string allows those values to be iterated over, otherwise, you will get an error ” TypeError: ‘int’ object is not iterable

#Transfer the arrays to a dataframe
df_CSV_1 = pd.DataFrame(array1, columns=['No1','Film1','Year1','Length1'])
df_CSV_2 = pd.DataFrame(array2, columns=['No2','Film2','Year2','Length2'])

#Change all the values to a string, as numbers cannot be iterated over.
df_CSV_1['Year1'] = df_CSV_1['Year1'].astype('str')
df_CSV_2['Year2'] = df_CSV_2['Year2'].astype('str')
df_CSV_1['Length1'] = df_CSV_1['Length1'].astype('str')
df_CSV_2['Length2'] = df_CSV_2['Length2'].astype('str')

#join the dataframes
df = pd.concat([df_CSV_1,df_CSV_2], axis=1)

We are now moving to the main part of the program, which gives us the answers we need. Here we create a function that does the calculations for us:

#Create a function to calculate the differences and show as a ratio.
def create_ratio(df, columna, columnb):
    return SequenceMatcher(None,df[columna],df[columnb]).ratio()

Next we calculate the differences and format the output

#Here we use apply which will pull in the data that needs to be passed to the fuction above.
df['Film_comp'] = df.apply(create_ratio,args=('Film1','Film2'),axis=1)
df['Year_comp'] = df.apply(create_ratio,args=('Year1','Year2'),axis=1)
df['Length_comp'] = df.apply(create_ratio,args=('Length1','Length2'),axis=1)

#This creates the values we are looking for
df['Film_comp'] = round(df['Film_comp'].astype('float'),2)*100
df['Year_comp'] = round(df['Year_comp'].astype('float'),2)*100
df['Length_comp'] = round(df['Length_comp'].astype('float'),2)*100

#this removes the decimal point that is added as a result of using the datatype 'Float'
df['Film_comp'] = df['Film_comp'].astype('int')
df['Year_comp'] = df['Year_comp'].astype('int')
df['Length_comp'] = df['Length_comp'].astype('int')
#Print the output
print(df)

And the final output looks like this:

An explanation of the output

As can be seen, the last three columns are the percentages of the match obtained, with 100 been an exact match.

For index value 1 there is Joker in the first file, but Jokers is in the second file.

The ratio is calculated as follows:

Joker is length 5, Jokers is length 6 = 11 characters in length

So the logic looks at the sequence, iterating through the sequence it can see that the first 10 characters are in the same order, but the 11th character is not, as a result, it calculates the following:

(10/11) * 100 = 90.90

Finally, the round function sets the value we are looking for to 91.

On the same line, we shall compare the year:

2019 and 2008 are a total of eight characters.

In the sequence, the first two of each match, and as they are also found, give us four, and the ratio is as follows:

4/8 *100 = 50

For Index 20 we also compared the film name, in total there are 17 characters, but the program ignores what they call junk, so the space is not included, for that reason the ratio only calculates over sixteen characters.

In order to understand this better I have compiled the below:

Index 1Index 1
201920088JokerJokers11
2Correct Spot1JCorrect Spot1
0Correct Spot1oCorrect Spot1
1Incorrect spot0KCorrect Spot1
9Incorrect spot0eCorrect Spot1
rCorrect Spot1
2Found1JFound1
0Found1oFound1
1Found0KFound1
9Found0eFound1
rFound1
Total in Comparison8Total in Comparison11
Ratio0.50Ratio0.91
Index 20
The DirtThe Dirty16
TCorrect Spot1
HCorrect Spot1
ECorrect Spot1
Correct Spot1
TFound1
HFound1
EFound1
DCorrect Spot1
ICorrect Spot1
RCorrect Spot1
TCorrect Spot1
DFound1
IFound1
RFound1
TFound1
Total in Comparison16
Ratio0.94

How To Compare CSV Files for Differences

Estimated reading time: 5 minutes

Often a lot of data analytics projects involve comparisons, some of it could be to look at data quality problems, other times it could be to check that data you are loading has been saved to the database with no errors.

As a result of this, there is a need to quickly check all your data is correct. But rather r than do visual comparisons, wouldn’t it be nice to use Python to quickly resolve this?

Luckily for you in this blog post, we will take you through three ways to quickly get answers, they could be used together or on their own.

Let’s look at the data we want to compare

We have two CSV files, with four columns in them:

The objective here is to compare the two and show the differences in the output.

Import the files to a dataframe.

import pandas as pd
import numpy as np
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')

The above logic is very straightforward, it looks for the files in the same folder as the python script and looks to import the data from the CSV files to the respective data frames.

The purpose of this is that the following steps will use these data frames for comparison.

Method 1 – See if the two data frames are equal

In the output for this, it shows differences through a boolean value in this instance “True” or “False”.


array1 = np.array(df1) ===> Storing the data in an array will allow the equation below to show the differences.
array2 = np.array(df2)

df_CSV_1 = pd.DataFrame(array1, columns=['No','Film','Year','Length (min)'])
df_CSV_2 = pd.DataFrame(array2, columns=['No','Film','Year','Length (min)'])

df_CSV_1.index += 1 ===> This resets the index to start at 1 not 0, helps with the output when trying to understand the differences.
df_CSV_2.index += 1

df_CSV_1.index += 1 ===> This resets the index to start at 1 not 0, helps with the output when trying to understand the differences.
df_CSV_2.index += 1

print(df_CSV_1.eq(df_CSV_2).to_string(index=True)) ===> This shows the differences between the two arrays.

Your output will look like this, and as can be seen on lines 3 and 13 are false, these are the yellow values in the CSV2 file that are different to the CSV1 file values, all other data is the same which is correct.

The obvious advantage of the below is that you can quickly what is different and on what line, but now what values are different, we will explore that in the next methods.

        No  Film   Year  Length (min)
1   True  True   True          True
2   True  True   True          True
3   True  True  False          True
4   True  True   True          True
5   True  True   True          True
6   True  True   True          True
7   True  True   True          True
8   True  True   True          True
9   True  True   True          True
10  True  True   True          True
11  True  True   True          True
12  True  True   True          True
13  True  True  False          True
14  True  True   True          True
15  True  True   True          True
16  True  True   True          True
17  True  True   True          True
18  True  True   True          True
19  True  True   True          True
20  True  True   True          True

Method 2 – Find and print the values only that are different

So in the first approach, we could see there are differences, but not what lines are different between the two files.

In the below code it will again look at the data frames but this time print the values from the CSV1 file that have different values in the CSV2 file.

a = df1[df1.eq(df2).all(axis=1) == False] ===> This compares the data frames, but only returns the rows from DF1 that have a different value in one of the columns on DF2

a.index += 1 ===>This resets the index to start at 1, not 0, which helps with the output when trying to understand the differences. 

print(a.to_string(index=False))

As a result, the output from this as expected is:

No        Film  Year  Length (min)
  3    Parasite  2019           132
 13  Midsommar   2019           148

Method 3 – Show your differences and the value that are different

The final way to look for any differences between CSV files is to use some of the above but show where the difference is.

In the below code, the first line compares the two years between the two sets of data, and then applies a true to the column if they match, otherwise a false.


df1['Year_check_to_DF2'] = np.where(df1['Year'] == df2['Year'], 'True', 'False')
df1.index += 1 #resets the index to start from one.

df2_year = df2['Year'] ===> We create this column to store the DF2 year value.
df2_year = pd.Series(df2_year) #Series is a one-dimensional labeled array capable of holding data of any type.

df1 = df1.assign(df2_year=df2_year.values) = This adds the DF2 year value to the DF1 data frame
print(df1.to_string(index=False))

In this instance, this returns the below output. As can be seen, it allows us to complete a line visual of what is different.

So in summary we have completed a comparison of what is different between files.

There are a number of practical applications of this:

(A) If you are loading data to a database table, the uploaded data can be exported and compared to the original CSV file uploaded.

(B) Data Quality – Export key data to a CSV file and compare.

(C) Data Transformations – Compare two sets of data, and ensure transformations have worked as expected. In this instance, differences are expected, but as long as they are what you programmed for, then the validation has worked.

If you like this blog post, there are others that may interest you:

How to Compare Column Headers in CSV to a List in Python

How to count the no of rows and columns in a CSV file

How to Compare Column Headers in CSV to a List in Python

Estimated reading time: 3 minutes

So you have numerous different automation projects in Python. In order to ensure a clean and smooth straight-through processing, checks need to be made to ensure what was received is in the right format.

Most but not all files used in an automated process will be in the CSV format. It is important there that the column headers in these files are correct so you can process the file correctly.

This ensures a rigorous process that has errors limited.

How to compare the headers

The first step would be to load the data into a Pandas data frame:

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> Include the headers
print(df)

The actual original file is as follows:

Next we need to make sure that we have a list that we can compare to:

header_list = ['Name','Address_1','Address_2','Address_3','Address_4','City','Country']

The next step will allow us to save the headers imported in the file to a variable:

import_headers = df.axes[1] #==> 1 is to identify columns
print(import_headers)

Note that the axis chosen was 1, and this is what Python recognises as the column axes.

Finally we will apply a loop as follows:

a = [i for i in import_headers if i not in header_list]
print(a)

In this loop, the variable “a” is taking the value “i” which represents each value in the import_headers variable and through a loop checks each one against the header_list to see if it is in it.

It then prints out the values not found.

Pulling this all together gives:

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> Include the headers
print(df)

#Expected values to receive in CSV file
header_list = ['Name','Address_1','Address_2','Address_3','Address_4','City','Country']

import_headers = df.axes[1] #==> 1 is to identify columns
print(import_headers)


a = [i for i in import_headers if i not in header_list]
print(a)

Resulting in the following output:

As can be seen the addresses below where found not to be valid, as they where not contained within our check list “header_list”

Import a CSV file with an SQL query

Estimated reading time: 3 minutes

In many of the SQL programming softwatre tools, there will most likely be an option for you to import your data torugh a screen they have provided.

In SQL Server Management Studio the below screen is available to pick your file to import, and it quickly guides you through the steps.

But what if you wanted to do this through code, as you may have lots of different files to be loaded at different times?

Below we will take you through the steps.

Check the file you want to import

Below we have created a CSV file, with the relevant headers and data. It is important to have the headers correct as this is what we will use in a step further down.

Create the code that will import the data

drop table sales.dbo.test_import_csv; ===> drops the existing table, to allow a refreshed copy be made.

create table sales.dbo.test_import_csv(
customer_name varchar(10),
age int,
place_of_birth varchar(10)); ===> These steps allow you to recreate the table with the column names and data type.



insert into sales.dbo.test_import_csv
select * FROM
OPENROWSET(BULK 'C:\Users\haugh\OneDrive\dataanalyticsireland\YOUTUBE\SQL\how_to_import_a_csv _file_in_sql\csv_file_import.csv',
formatfile = 'C:\Users\haugh\OneDrive\dataanalyticsireland\YOUTUBE\SQL\how_to_import_a_csv _file_in_sql\csv_file_import_set.txt',
FIRSTROW=2,
FORMAT='CSV'
) as tmp ===> These lines open the file, read in the data to the table created.

There are two important parts to the above code:

(A) OPENROWSET – This enables the connection to the CSV, read it and then insert the information into the database table.

(B) Formatfile – This is an important part of the code. Its purpose is to tell openrowset how many columns are in the CSV file. Its contents are below:

As can bee seen it outlines each column within the file and its name, and the exact column name in the header.

Also the last line indicates that this is the last column and the program should only read as far as this.

Running the code and the output

When the code is run, the following is the output:

Eleven rows where inserted, these are the now of rows in the CSV, excluding the headers.

The database table was created:

And finally, the table was populated with the data:

So in summary we have demonstrated that the ability to easily import a CSV file, with the above steps.

In essence you could incorporate this into an automated process.

How to count the no of rows and columns in a CSV file

So you are working on a number of different data analytics projects, and as part of some of them, you are bringing data in from a CSV file.

One area you may want to look at is How to Compare Column Headers in CSV to a List in Python, but that could be coupled with this outputs of this post.

As part of the process if you are manipulating this data, you need to ensure that all of it was loaded without failure.

With this in mind, we will look to help you with a possible automation task to ensure that:

(A) All rows and columns are totalled on loading of a CSV file.

(B) As part of the process, if the same dataset is exported, the total on the export can be counted.

(C) This ensures that all the required table rows and columns are always available.

Python Code that will help you with this

So in the below code, there are a number of things to look at.

Lets look at the CSV file we will read in:

In total there are ten rows with data. The top row is not included in the count as it is deemed a header row. There are also seven columns.

This first bit just reads in the data, and it automatically skips the header row.

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> reads in all the rows, but skips the first one as it is a header.


Output with first line used:
Number of Rows: 10
Number of Columns: 7

Next it creates two variables that count the no of rows and columns and prints them out.

Note it used the df.axes to tell python to not look at the individual cells.

total_rows=len(df.axes[0]) #===> Axes of 0 is for a row
total_cols=len(df.axes[1]) #===> Axes of 1 is for a column
print("Number of Rows: "+str(total_rows))
print("Number of Columns: "+str(total_cols))

And bringing it all together

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> reads in all the rows, but skips the first one as it is a header.

total_rows=len(df.axes[0]) #===> Axes of 0 is for a row
total_cols=len(df.axes[1]) #===> Axes of 0 is for a column
print("Number of Rows: "+str(total_rows))
print("Number of Columns: "+str(total_cols))

Output:
Number of Rows: 10
Number of Columns: 7

In summary, this would be very useful if you are trying to reduce the amount of manual effort in checking the population of a file.

As a result it would help with:

(A) Scripts that process data doesn’t remove rows or columns unnecessarily.

(B) Batch runs who know the size of a dataset in advance of processing can make sure they have the data they need.

(C) Control logs – databases can store this data to show that what was processed is correct.

(D) Where an automated run has to be paused, this can help with identifying the problem and then quickly fixing.

(E) Finally if you are receiving agreed data from a third party it can be used to alert them of too much or too little information was received.

Here is another post you should read!

How to change the headers on a CSV file

How to change the headers on a CSV file

Problem statement

You are working away on some data analytics projects and you receive files that have incorrect headings on them. The problem is without opening the file, how can you change the headers on the columns within it?

To start off, lets look at file we want to change , below is a screen shot of the data with its headers contained inside:

So as you can see we have names and addresses. But what it if we want to change the address1 ,address2,address3,address4 to something different?

This could be for a number of reasons:

(A) You are going to use those columns as part of an SQL statement to insert into a database table, so you need to change the headers so that SQL statement won’t fail.

(B) Some other part of your code is using that data, but requires the names to be corrected so that does not fail.

(C) Your organisation has a naming convention that requires all column names to be a particular structure.

(D) All data of a similar type has to be of the same format, so it can be easily identified.

What would be the way to implement this in Python, then?

Below you will see the code I have used for this, looking to keep it simple:

import pandas as pd
#df = pd.read_csv("csv_import.csv",skiprows=1) #==> use to skip first row (header if required)
df = pd.read_csv("csv_import.csv") #===> Include the headers
correct_df = df.copy()
correct_df.rename(columns={'Name': 'Name', 'Address1': 'Address_1','Address2': 'Address_2','Address3': 'Address_3','Address4': 'Address_4'}, inplace=True)
print(correct_df)
#Exporting to CSV file
correct_df.to_csv(r'csv_export', index=False,header=True)

As can be seen there are eight rows in total. The steps are as follows:

  1. Import the CSV file .

2. Make a copy of the dataframe.

3. In the new dataframe, use the rename function to change any of the column headers you require, Address1, Address2, Address3, Address4.

4. Once the updates are completed then re-export the file with the corrected headers to a folder you wish.

As a result of the above steps, the output will appear like this:

And there you go. If you had an automated process, you could incorporate this in to ensure there was no failures on the loading of any data.

Another article that may interest you? How to count the no of rows and columns in a CSV file