Problem statement
You are working away on some data analytics projects and you receive files that have incorrect headings on them. The problem is without opening the file, how can you change the headers on the columns within it?
To start off, lets look at file we want to change , below is a screen shot of the data with its headers contained inside:
So as you can see we have names and addresses. But what it if we want to change the address1 ,address2,address3,address4 to something different?
This could be for a number of reasons:
(A) You are going to use those columns as part of an SQL statement to insert into a database table, so you need to change the headers so that SQL statement won’t fail.
(B) Some other part of your code is using that data, but requires the names to be corrected so that does not fail.
(C) Your organisation has a naming convention that requires all column names to be a particular structure.
(D) All data of a similar type has to be of the same format, so it can be easily identified.
What would be the way to implement this in Python, then?
Below you will see the code I have used for this, looking to keep it simple:
import pandas as pd
#df = pd.read_csv("csv_import.csv",skiprows=1) #==> use to skip first row (header if required)
df = pd.read_csv("csv_import.csv") #===> Include the headers
correct_df = df.copy()
correct_df.rename(columns={'Name': 'Name', 'Address1': 'Address_1','Address2': 'Address_2','Address3': 'Address_3','Address4': 'Address_4'}, inplace=True)
print(correct_df)
#Exporting to CSV file
correct_df.to_csv(r'csv_export', index=False,header=True)
As can be seen there are eight rows in total. The steps are as follows:
- Import the CSV file .
2. Make a copy of the dataframe.
3. In the new dataframe, use the rename function to change any of the column headers you require, Address1, Address2, Address3, Address4.
4. Once the updates are completed then re-export the file with the corrected headers to a folder you wish.
As a result of the above steps, the output will appear like this:
And there you go. If you had an automated process, you could incorporate this in to ensure there was no failures on the loading of any data.
Another article that may interest you? How to count the no of rows and columns in a CSV file