Estimated reading time: 3 minutes
On this website, we have posted about how to remove unwanted characters from your data and How to remove characters from an imported CSV file both will show you different approaches.
In this blog posting, we are going to approach the process by using loops within a function. Essentially we are going to pass a list and then we are going to loop through the strings to check the data against it.
Step 1 – Import the data and create a data frame
The first bit of work we need to complete is to load the data. Here we create a dictionary with their respective key-value pairs.
In order to prepare the data to be processed through the function in step 2, we then load it into a data frame.
import pandas as pd
#Create a dataset locally
data = {'Number': ["00&00$000", '111$11111','2222€2222','3333333*33','444444444£','555/55555','666666@666666'],
'Error' : ['0','0','0','0','0','0','0']}
#Create a dataframe and tell it where to get its values from
df = pd.DataFrame (data, columns = ['Number','Error'])
Step 2 – Create the function that checks for invalid data
This is the main piece of logic that gives the output. As you can see there is a list “L” that is fed to the function run.
One thing to note is that *l is passed to the function, as there is more than one value in the list, otherwise the program would not execute properly.
To start off we create a data frame, which extracts using a regular expression the characters we don’t want to have.
Next, we then need to drop a column that is generated with NAN values, as these are not required.
Then we updated the original data fame with the values that we found.
Just in case if there are any NAN values in this updated column “Error”, we remove them on the next line.
The main next is the loop that creates a new column called “Fix”. This holds the values that will be populated into the fix after the data we don’t want is removed and is data cleansed.
The data we do not need is in str.replace.
#Function to loop through the dataset and see if any of the list values below are found and replace them
def run(*l):
#This line extracts all the special characters into a new column
#Using regular expressions it finds values that should not appear
df2 = df['Number'].str.extract('(\D+|^(0)$)') # New dataframe to get extraction we need
print(df2)
df2 = df2.drop(1, axis=1) # Drops the column with NAN in it, not required
df['Error'] = df2[0] # Updates original dataframe with values that need to be removed.
#This line removes anything with a null value
df.dropna(subset=['Error'], inplace=True)
#This line reads in the list and assigns it a value i, to each element of the list.
#Each i value assigned also assigns an index number to the list value.
#The index value is then used to check whether the value associated with it is in the df['Number'] column
#and then replaces if found
for i in l:
df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
.str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
print("Error list to check against")
print(i[0])
print(i[1])
print(i[2])
print(i[3])
print(i[4])
print(i[5])
print(i[6])
print(df)
#This is the list of errors you want to check for
l = ['$','*','€','&','£','/','@']
Step 3 – Run the program
To run this program, we just execute the below code. All this does is read in the list “L” to the function “run” and then the output in step 4 is produced
run(l)
Step 4 – Output
Error list to check against
$
*
€
&
£
/
@
Number Error Fix
0 00&00$000 & 0000000
1 111$11111 $ 11111111
2 2222€2222 € 22222222
3 3333333*33 * 333333333
4 444444444£ £ 444444444
5 555/55555 / 55555555
6 666666@666666 @ 666666666666
Process finished with exit code 0