How To Check For Unwanted Characters Using Loops With Python

Estimated reading time: 3 minutes

On this website, we have posted about how to remove unwanted characters from your data and How to remove characters from an imported CSV file both will show you different approaches.

In this blog posting, we are going to approach the process by using loops within a function. Essentially we are going to pass a list and then we are going to loop through the strings to check the data against it.

Step 1 – Import the data and create a data frame

The first bit of work we need to complete is to load the data. Here we create a dictionary with their respective key-value pairs.

In order to prepare the data to be processed through the function in step 2, we then load it into a data frame.

import pandas as pd
#Create a dataset locally
data = {'Number':  ["00&00$000", '111$11111','2222€2222','3333333*33','444444444£','555/55555','666666@666666'],
        'Error' : ['0','0','0','0','0','0','0']}

#Create a dataframe and tell it where to get its values from
df = pd.DataFrame (data, columns = ['Number','Error'])

Step 2 – Create the function that checks for invalid data

This is the main piece of logic that gives the output. As you can see there is a list “L” that is fed to the function run.

One thing to note is that *l is passed to the function, as there is more than one value in the list, otherwise the program would not execute properly.

To start off we create a data frame, which extracts using a regular expression the characters we don’t want to have.

Next, we then need to drop a column that is generated with NAN values, as these are not required.

Then we updated the original data fame with the values that we found.

Just in case if there are any NAN values in this updated column “Error”, we remove them on the next line.

The main next is the loop that creates a new column called “Fix”. This holds the values that will be populated into the fix after the data we don’t want is removed and is data cleansed.

The data we do not need is in str.replace.

#Function to loop through the dataset and see if any of the list values below are found and replace them
def run(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df2 = df['Number'].str.extract('(\D+|^(0)$)') # New dataframe to get extraction we need
    print(df2)
    df2 = df2.drop(1, axis=1) # Drops the column with NAN in it, not required

    df['Error'] = df2[0] # Updates original dataframe with values that need to be removed.
    #This line removes anything with a null value
    df.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        print("Error list to check against")
        print(i[0])
        print(i[1])
        print(i[2])
        print(i[3])
        print(i[4])
        print(i[5])
        print(i[6])
    print(df)

#This is the list of errors you want to check for
l = ['$','*','€','&','£','/','@']

Step 3 – Run the program

To run this program, we just execute the below code. All this does is read in the list “L” to the function “run” and then the output in step 4 is produced

run(l)

Step 4 – Output

Error list to check against
$
*
€
&
£
/
@
          Number Error           Fix
0      00&00$000     &       0000000
1      111$11111     $      11111111
2      2222€2222     €      22222222
3     3333333*33     *     333333333
4     444444444£     £     444444444
5      555/55555     /      55555555
6  666666@666666     @  666666666666

Process finished with exit code 0

TypeError object of type ‘int’ has no len()

I have seen this data type error come up numerous times while working on my data analytics projects, and recently decided to investigate further. On initial inspection, it can seem a bit of a funny one, but in actual fact, it is quite straight forward.

Lets break it down and see what is going on

So in the below code, there are a number of things:

On line 1 we have a variable that is an integer. If we think about this logically, something that is a single numeric number cannot have a length.

An integer by design is purely to count up a number of apples or no of people, it cannot be viewed as having a length as it is descriptive of the number of occurrences of an object.

data = 100
print(type(data))
print(len(data))

Output Error:
<class 'int'>
Traceback (most recent call last):
  File "object of type int.py", line 3, in <module>
    print(len(data))
TypeError: object of type 'int' has no len()

So for it to in anyway allow a length to be calculated, the object needs to be one of the following data types:

  • List
  • String
  • Tuple
  • Dictionary

Opposite to an integer, these are datatypes that have values that would be more appropriate to having values that a length can be calculated on.

data = "100"
print(type(data))
print("Length of string is: ", len(data))

data = [100,200,300]
print(type(data))
print("Length of list is: ", len(data))

data = (100,200,300)
print(type(data))
print("Length of tuple is: ", len(data))

data = {"Age": 1, "Name": 2}
print(type(data))
print("Length of dictionary is: ", len(data))

And the output is:
<class 'str'>
Length of string is:  3
<class 'list'>
Length of list is:  3
<class 'tuple'>
Length of tuple is:  3
<class 'dict'>
Length of dictionary is:  2

In summary, to understand this error and fix it:

An integer describes the number of things that exist for an object, they are actually not the actual object in existence.

Anything that can have a length method applied to it actually exists and can be counted. In the above four examples, they are actually values that you could describe as existing as you can count each one of them.

The explanation here hopefully clears up the matter, if you have any questions leave a comment and I will answer for you!