Estimated reading time: 5 minutes
Removal of unwanted errors in your data, the easy way.
The process of importing data can take many formats, but have you been looking for a video on how you do this? Even better are you looking for a video that shows you how to import a CSV file and then data cleanse it, effectively removing any unwanted characters?
As a follow up to Python – how do I remove unwanted characters, that video focused on data cleansing the data created within the code, this video runs through several options to open a CSV file, find the unwanted characters, remove the unwanted characters from the dataset and then return the cleansed data.
How to get in amongst the problem data:
The approach here looked at three different scenarios:
(A) Using a variable that is equal to an open function, and then reading the variable to a data frame.
(B)Using a “with statement” and an open function together, and returning the output to a data frame.
(C) Using read_csv to quickly and efficiently read in the CSV file and then cleanse the dataset.
Some minor bumps in the road that need some thought
There where some challenges with this that you will see in the video:
- Options A & B had to deploy additional code just to get the data frame the way we wanted.
- The additional lines of code made for more complexity if you were trying to manage it.
In the end, read_csv was probably the best way to approach this; it required less code and would have been easier to manage in the longer run.
As always thanks for watching, please subscribe to our YouTube channel, and follow us on social media!
Data Analytics Ireland
The code used in this video is here:
########## How to import a CSV dataset and remove unwanted characters ###########
#Numbers need to be put between "" to be looped through
import pandas as pd
l = ['$','*','€','&','£','/','@']
print("Error list we will be checking against in all three scenarios")
print(l)
############ Scenario 1 - Using the Open Function #####################
#can't use engine ='python' here , as it is an invalid keyword argument for open()
my_file = open("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv", "r").read().split('\n')
print("Senario 1 printed as a dataframe")
df = pd.DataFrame(my_file)
df.columns = ['Temp']
df[['Number','Error']] = df['Temp'].str.split(',',expand=True)
#Drop the column Temp ,0 for rows and 1 for columns.
df = df.drop('Temp', 1)
#Dropping the first row as it has the headers in it.
df = df.drop(0)
#Resetting the index to start at value 0
df = df.reset_index(drop=True)
display(df)
def run_scenario1(*l):
#This line extracts all the special characters into a new column
#Using regular expressions it finds values that should not appear
df['Error'] = df['Number'].str.extract('(\D+|^(0)$)')
#This line removes anything with a null value
df.dropna(subset=['Error'], inplace=True)
#This line reads in the list and assigns it a value i, to each element of the list.
#Each i value assigned also assigns an index number to the list value.
#The index value is then used to check whether the value associated with it is in the df['Number'] column
#and then replaces if found
for i in l:
df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
.str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
#print("Error list to check against")
#print(i[0])
#print(i[1])
#print(i[2])
#print(i[3])
#print(i[4])
#print(i[5])
#print(i[6])
print("Scenario 1 output with fixes")
display(df)
run_scenario1(l)
############ Scenario 2 - Using a with statement #####################
#Once Python exits the with block, it closes the file
#can't use engine ='python' here , as it is an invalid keyword argument for open()
with open("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv") as f:
emptylist=[]
for line in f:
emptylist.append(line)
df1 = pd.DataFrame(emptylist)
df1.columns = ['Temp']
df1[['Number','Error']] = df1['Temp'].str.split(',',expand=True)
#Drop the column Temp ,0 for rows and 1 for columns.
df1 = df1.drop('Temp', 1)
#Dropping the first row as it has the headers in it.
df1 = df1.drop(0)
#Resetting the index to start at value 0
df1 = df1.reset_index(drop=True)
#Replacing the Error field value that was created above witha value of 0.
df1['Error'] = df1['Error'].str.replace('0\n','0')
print("Scenario 2 printed as a dataframe")
display(df1)
#Function to loop through the dataset and see if any of the list values below are found and replace them
def run_scenario2(*l):
#This line extracts all the special characters into a new column
#Using regular expressions it finds values that should not appear
df1['Error'] = df1['Number'].str.extract('(\D+|^(0)$)')
#This line removes anything with a null value
df1.dropna(subset=['Error'], inplace=True)
#This line reads in the list and assigns it a value i, to each element of the list.
#Each i value assigned also assigns an index number to the list value.
#The index value is then used to check whether the value associated with it is in the df['Number'] column
#and then replaces if found
for i in l:
df1['Fix']= df1['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
.str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
#print("Error list to check against")
#print(i[0])
#print(i[1])
#print(i[2])
#print(i[3])
#print(i[4])
#print(i[5])
#print(i[6])
print("Scenario 2 output with fixes")
display(df1)
run_scenario2(l)
############ Scenario 3 - Using read CSV #####################
#putting in engine ='python' removes any need to do additional code to display in a dataframe correctly unlike above
df2 = pd.read_csv("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv",engine ='python')
print("Senario 3 printed as a dataframe")
display(df2)
def run_scenario2(*l):
#This line extracts all the special characters into a new column
#Using regular expressions it finds values that should not appear
df2['Error'] = df2['Number'].str.extract('(\D+|^(0)$)')
#This line removes anything with a null value
df1.dropna(subset=['Error'], inplace=True)
#This line reads in the list and assigns it a value i, to each element of the list.
#Each i value assigned also assigns an index number to the list value.
#The index value is then used to check whether the value associated with it is in the df['Number'] column
#and then replaces if found
for i in l:
df2['Fix']= df1['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
.str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
print("Scenario 3 output with fixes")
display(df2)
run_scenario2(l)