How to Compare Column Headers in CSV to a List in Python

Estimated reading time: 3 minutes

So you have numerous different automation projects in Python. In order to ensure a clean and smooth straight-through processing, checks need to be made to ensure what was received is in the right format.

Most but not all files used in an automated process will be in the CSV format. It is important there that the column headers in these files are correct so you can process the file correctly.

This ensures a rigorous process that has errors limited.

How to compare the headers

The first step would be to load the data into a Pandas data frame:

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> Include the headers
print(df)

The actual original file is as follows:

Next we need to make sure that we have a list that we can compare to:

header_list = ['Name','Address_1','Address_2','Address_3','Address_4','City','Country']

The next step will allow us to save the headers imported in the file to a variable:

import_headers = df.axes[1] #==> 1 is to identify columns
print(import_headers)

Note that the axis chosen was 1, and this is what Python recognises as the column axes.

Finally we will apply a loop as follows:

a = [i for i in import_headers if i not in header_list]
print(a)

In this loop, the variable “a” is taking the value “i” which represents each value in the import_headers variable and through a loop checks each one against the header_list to see if it is in it.

It then prints out the values not found.

Pulling this all together gives:

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> Include the headers
print(df)

#Expected values to receive in CSV file
header_list = ['Name','Address_1','Address_2','Address_3','Address_4','City','Country']

import_headers = df.axes[1] #==> 1 is to identify columns
print(import_headers)


a = [i for i in import_headers if i not in header_list]
print(a)

Resulting in the following output:

As can be seen the addresses below where found not to be valid, as they where not contained within our check list “header_list”

How to change the headers on a CSV file

Problem statement

You are working away on some data analytics projects and you receive files that have incorrect headings on them. The problem is without opening the file, how can you change the headers on the columns within it?

To start off, lets look at file we want to change , below is a screen shot of the data with its headers contained inside:

So as you can see we have names and addresses. But what it if we want to change the address1 ,address2,address3,address4 to something different?

This could be for a number of reasons:

(A) You are going to use those columns as part of an SQL statement to insert into a database table, so you need to change the headers so that SQL statement won’t fail.

(B) Some other part of your code is using that data, but requires the names to be corrected so that does not fail.

(C) Your organisation has a naming convention that requires all column names to be a particular structure.

(D) All data of a similar type has to be of the same format, so it can be easily identified.

What would be the way to implement this in Python, then?

Below you will see the code I have used for this, looking to keep it simple:

import pandas as pd
#df = pd.read_csv("csv_import.csv",skiprows=1) #==> use to skip first row (header if required)
df = pd.read_csv("csv_import.csv") #===> Include the headers
correct_df = df.copy()
correct_df.rename(columns={'Name': 'Name', 'Address1': 'Address_1','Address2': 'Address_2','Address3': 'Address_3','Address4': 'Address_4'}, inplace=True)
print(correct_df)
#Exporting to CSV file
correct_df.to_csv(r'csv_export', index=False,header=True)

As can be seen there are eight rows in total. The steps are as follows:

  1. Import the CSV file .

2. Make a copy of the dataframe.

3. In the new dataframe, use the rename function to change any of the column headers you require, Address1, Address2, Address3, Address4.

4. Once the updates are completed then re-export the file with the corrected headers to a folder you wish.

As a result of the above steps, the output will appear like this:

And there you go. If you had an automated process, you could incorporate this in to ensure there was no failures on the loading of any data.

Another article that may interest you? How to count the no of rows and columns in a CSV file

How to save data frame changes to a file

Estimated reading time: 2 minutes

Changing a file is the natural step; tracking those changes are just as important.
Change is part and parcel of life, but in the technology world with the complexity  and interdependency of systems, not effectively been able to track what goes on leads to:

  • Countless hours are trying to figure out where it went wrong.
  • You do not understand what needs fixing.
  • Systems/processes that ultimately work seamlessly, slow down unnecessarily.

In data analysis, as the volumes can be quite large, the human cannot feasibly review a set of data and find out where the underlying problem is. Well, they can, but it would take so long, nothing else would get done. This article by Forbes – predictions-about-data-in-2020-and-the-coming-decade predicts the consumption of data will just be getting bigger.

 

Let the script work, see the log of changes in the output.

How do you remove characters from an imported CSV file, looked at some data cleansing techniques, but there was no way of knowing what was changed other than a visual inspection. Here we introduce the data set into a data frame, change some of the values and show the output on the screen. But more importantly, as we progress through these steps, we are saving the changes as we go along.

The reality is that in large corporate settings, visual inspections would take up too much time and resources. An IT solution to help with giving the vital information required will reduce the data errors happening and allow for a more unobstructed view of how the companies data has changed over time.

 

Where does the trail lead to next?

  • Changes made to data needs a clear way of being able to be tracked.
  • How you captured those changes on your systems, needs to be addressed.
  • Implementing better systems will help you have confidence in your data changes.

Showing audit trail of changes

hide a column from a data frame

Estimated reading time: 2 minutes

They say there is nowhere to hide, we disagree!
As an addition to How to add a column to a dataframe would you like to learn to go and hide it?! This video has several steps in it; following each one will give you a good introduction.

To start why you would like to hide a column?

  • You may not want to reveal its output as it is sensitive information.
  • The data in the column is not in the correct format, you will want to repurpose it, so it is the way you want it.
  •  The column could be a calculated column. Hence it serves as an intermediary step before your data frame is output.

Finding the best way to hide unwanted data:

In this video, we introduce several concepts to help not show a column:

  • Specify the actual columns you want to include in the data frame, by default doing this you are excluding the column or columns you don’t want to see.
  •  We use drop, to explicitly tell the data frame not to show a particular column.
  •  Also, we display a scenario whereby you have a calculated column but do not want to show its output, based on one of the reasons outlined above.
  • Finally, the index of the column can appear in the output, so we have shown through set_index how to hide it from what is displayed.

This latest in the Python Dataframe series looks to build on the knowledge in the previous examples. We hope as you learn python online, it will increase your programming skills.

Thanks for watching and don’t forget to like and share through our social media buttons to the right of this page.

Data Analytics Ireland

Python Tutorial: Add a column to a data frame

Estimated reading time: 1 minute

You have learnt addition, now learn how to add a column to a data frame!
 
In our last post on what are Python data frames, we introduced the concept,  but are you now searching how to add a column to a data frame?
 
To start, I was working away and wondering how I could accomplish this, as there were many posts about it.
 
Searching through the jungles of website articles, some topics of interest that gave me ideas whereas follows:
(A)List comprehension
(B) Lambda
(C) Numpy

 

Having tested the waters to see how you can approach:

After working through the above to:

  • Figure out how to use them.
  • Write some code to see how it all comes together.

The best thing to do was to put code into action!

Python for beginners or advanced programmers does not have to be hard!

Remember to subscribe to our channel if you like what we do!

Support is on the way:

TutorialsPoint is an excellent resource if looking to understand some other examples see this post here: TutorialsPoint: Add Column

To see a related post on how to hide a column from a data frame look no further How to hide a column from a data frame

Data Analytics Ireland

 

YouTube channel lists – Python DataFrames

Estimated reading time: 1 minute

Welcome to this new blogging website! We are all about data analytics to have a look at this page here About Data Analytics Ireland

To keep it simple we have created some lists here and on our YouTube Channel

As we progress over the next while, the website will be updated as we go along, and while there may be a  lot of video content, we will look to mix it up with different formats.

We have started with Python Data frames :

We hope you enjoy and don’t forget if you like what we are doing subscribe to our channel!

Data Analytics Ireland