Skip to content
  • YouTube
  • FaceBook
  • Twitter
  • Instagram

Data Analytics Ireland

Data Analytics and Video Tutorials

  • Home
  • Contact
  • About Us
    • Latest
    • Write for us
    • Learn more information about our website
  • Useful Links
  • Glossary
  • All Categories
  • Faq
  • Livestream
  • Toggle search form
  • how to remove unwanted characters from your data R Programming
  • TypeError: ‘NoneType’ object is not iterable class
  • R Tutorial: How to pass data between functions R Programming
  • Python Tutorial: Pandas groupby columns ( video 2) Python
  • ValueError: cannot convert float NaN to integer Null values
  • planning your machine learning model machine learning
  • how to remove spaces from a string regular expressions
  • How to Create Multiple XML Files From Excel With Python Python

How To Compare CSV Files for Differences

Posted on August 7, 2021October 24, 2022 By admin 4 Comments on How To Compare CSV Files for Differences

Estimated reading time: 5 minutes

Often a lot of data analytics projects involve comparisons, some of it could be to look at data quality problems, other times it could be to check that data you are loading has been saved to the database with no errors.

As a result of this, there is a need to quickly check all your data is correct. But rather r than do visual comparisons, wouldn’t it be nice to use Python to quickly resolve this?

Luckily for you in this blog post, we will take you through three ways to quickly get answers, they could be used together or on their own.

Let’s look at the data we want to compare

We have two CSV files, with four columns in them:

The objective here is to compare the two and show the differences in the output.

Import the files to a dataframe.

import pandas as pd
import numpy as np
df1 = pd.read_csv('CSV1.csv')
df2 = pd.read_csv('CSV2.csv')

The above logic is very straightforward, it looks for the files in the same folder as the python script and looks to import the data from the CSV files to the respective data frames.

The purpose of this is that the following steps will use these data frames for comparison.

Method 1 – See if the two data frames are equal

In the output for this, it shows differences through a boolean value in this instance “True” or “False”.


array1 = np.array(df1) ===> Storing the data in an array will allow the equation below to show the differences.
array2 = np.array(df2)

df_CSV_1 = pd.DataFrame(array1, columns=['No','Film','Year','Length (min)'])
df_CSV_2 = pd.DataFrame(array2, columns=['No','Film','Year','Length (min)'])

df_CSV_1.index += 1 ===> This resets the index to start at 1 not 0, helps with the output when trying to understand the differences.
df_CSV_2.index += 1

df_CSV_1.index += 1 ===> This resets the index to start at 1 not 0, helps with the output when trying to understand the differences.
df_CSV_2.index += 1

print(df_CSV_1.eq(df_CSV_2).to_string(index=True)) ===> This shows the differences between the two arrays.

Your output will look like this, and as can be seen on lines 3 and 13 are false, these are the yellow values in the CSV2 file that are different to the CSV1 file values, all other data is the same which is correct.

The obvious advantage of the below is that you can quickly what is different and on what line, but now what values are different, we will explore that in the next methods.

        No  Film   Year  Length (min)
1   True  True   True          True
2   True  True   True          True
3   True  True  False          True
4   True  True   True          True
5   True  True   True          True
6   True  True   True          True
7   True  True   True          True
8   True  True   True          True
9   True  True   True          True
10  True  True   True          True
11  True  True   True          True
12  True  True   True          True
13  True  True  False          True
14  True  True   True          True
15  True  True   True          True
16  True  True   True          True
17  True  True   True          True
18  True  True   True          True
19  True  True   True          True
20  True  True   True          True

Method 2 – Find and print the values only that are different

So in the first approach, we could see there are differences, but not what lines are different between the two files.

In the below code it will again look at the data frames but this time print the values from the CSV1 file that have different values in the CSV2 file.

a = df1[df1.eq(df2).all(axis=1) == False] ===> This compares the data frames, but only returns the rows from DF1 that have a different value in one of the columns on DF2

a.index += 1 ===>This resets the index to start at 1, not 0, which helps with the output when trying to understand the differences. 

print(a.to_string(index=False))

As a result, the output from this as expected is:

No        Film  Year  Length (min)
  3    Parasite  2019           132
 13  Midsommar   2019           148

Method 3 – Show your differences and the value that are different

The final way to look for any differences between CSV files is to use some of the above but show where the difference is.

In the below code, the first line compares the two years between the two sets of data, and then applies a true to the column if they match, otherwise a false.


df1['Year_check_to_DF2'] = np.where(df1['Year'] == df2['Year'], 'True', 'False')
df1.index += 1 #resets the index to start from one.

df2_year = df2['Year'] ===> We create this column to store the DF2 year value.
df2_year = pd.Series(df2_year) #Series is a one-dimensional labeled array capable of holding data of any type.

df1 = df1.assign(df2_year=df2_year.values) = This adds the DF2 year value to the DF1 data frame
print(df1.to_string(index=False))

In this instance, this returns the below output. As can be seen, it allows us to complete a line visual of what is different.

So in summary we have completed a comparison of what is different between files.

There are a number of practical applications of this:

(A) If you are loading data to a database table, the uploaded data can be exported and compared to the original CSV file uploaded.

(B) Data Quality – Export key data to a CSV file and compare.

(C) Data Transformations – Compare two sets of data, and ensure transformations have worked as expected. In this instance, differences are expected, but as long as they are what you programmed for, then the validation has worked.

If you like this blog post, there are others that may interest you:

How to Compare Column Headers in CSV to a List in Python

How to count the no of rows and columns in a CSV file

CSV, Python, Python Dataframe Tags:compare CSV files, compare data frames, Data, Data Analytics, Python compare CSV values, Python Tutorial

Post navigation

Previous Post: TypeError: the first argument must be callable
Next Post: How to Add Formulas to Excel using Python

Related Posts

  • How to change the headers on a CSV file CSV
  • python constructor self and __init__explained Python
  • What are the reserved keywords in Python Definition
  • Create a HTML Table From Python Using Javascript Javascript
  • how to copy/paste special a range of cells with xlwings Python
  • How Would You Change The Name Of a Key in a Python Dictionary Python

Comments (4) on “How To Compare CSV Files for Differences”

  1. Nosam pavan says:
    November 5, 2022 at 2:43 pm

    I’m getting this error – ValueError: Shape of passed values is (120111, 54), indices imply (120111, 3) .
    I’ve specified selected columns , I tried with all methods but neither of them are working for me . can you please help me here

    Thank you !

    Reply
    1. admin says:
      January 2, 2023 at 4:23 pm

      Hi,

      It looks like what you have passed is mismatched to what the code is expecting, are you able to show the code, so I can look at further?

      Thanks,

      Data analytics Ireland

      Reply
  2. ispep says:
    February 16, 2022 at 2:08 am

    But how do you set it up if the number of rows are different lengths and the data in the rows doesn’t match.

    Reply
    1. admin says:
      February 17, 2022 at 9:58 pm

      Hi,

      One possible way to approach this could be:

      1. For each column sort it ascending or descending.
      2. Compare each column, between the the two dataframes.
      3. Return rows that do not match with their corresponding index value

      Could you use Method 2 from this video for this scenario? It returns the differences only.

      Thanks!

      Data Analytics Ireland

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Select your language!

  • हिंदी
  • Español
  • Português
  • Français
  • Italiano
  • How To Validate Cell Values In Excel Python
  • How To Add Values to a Python Dictionary Python
  • How to use parameters in Python Python
  • TypeError: type object is not subscriptable strings
  • how to use case statements in SQL SQL
  • how to select columns with SQL SQL
  • ValueError: invalid literal for int() with base 10 Value Error
  • How to data cleanse a database table Python Data Cleansing

Copyright © 2023 Data Analytics Ireland.

Powered by PressBook Premium theme

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT