How To Check For Unwanted Characters Using Loops With Python

Estimated reading time: 3 minutes

On this website, we have posted about how to remove unwanted characters from your data and How to remove characters from an imported CSV file both will show you different approaches.

In this blog posting, we are going to approach the process by using loops within a function. Essentially we are going to pass a list and then we are going to loop through the strings to check the data against it.

Step 1 – Import the data and create a data frame

The first bit of work we need to complete is to load the data. Here we create a dictionary with their respective key-value pairs.

In order to prepare the data to be processed through the function in step 2, we then load it into a data frame.

import pandas as pd
#Create a dataset locally
data = {'Number':  ["00&00$000", '111$11111','2222€2222','3333333*33','444444444£','555/55555','666666@666666'],
        'Error' : ['0','0','0','0','0','0','0']}

#Create a dataframe and tell it where to get its values from
df = pd.DataFrame (data, columns = ['Number','Error'])

Step 2 – Create the function that checks for invalid data

This is the main piece of logic that gives the output. As you can see there is a list “L” that is fed to the function run.

One thing to note is that *l is passed to the function, as there is more than one value in the list, otherwise the program would not execute properly.

To start off we create a data frame, which extracts using a regular expression the characters we don’t want to have.

Next, we then need to drop a column that is generated with NAN values, as these are not required.

Then we updated the original data fame with the values that we found.

Just in case if there are any NAN values in this updated column “Error”, we remove them on the next line.

The main next is the loop that creates a new column called “Fix”. This holds the values that will be populated into the fix after the data we don’t want is removed and is data cleansed.

The data we do not need is in str.replace.

#Function to loop through the dataset and see if any of the list values below are found and replace them
def run(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df2 = df['Number'].str.extract('(\D+|^(0)$)') # New dataframe to get extraction we need
    print(df2)
    df2 = df2.drop(1, axis=1) # Drops the column with NAN in it, not required

    df['Error'] = df2[0] # Updates original dataframe with values that need to be removed.
    #This line removes anything with a null value
    df.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        print("Error list to check against")
        print(i[0])
        print(i[1])
        print(i[2])
        print(i[3])
        print(i[4])
        print(i[5])
        print(i[6])
    print(df)

#This is the list of errors you want to check for
l = ['$','*','€','&','£','/','@']

Step 3 – Run the program

To run this program, we just execute the below code. All this does is read in the list “L” to the function “run” and then the output in step 4 is produced

run(l)

Step 4 – Output

Error list to check against
$
*
€
&
£
/
@
          Number Error           Fix
0      00&00$000     &       0000000
1      111$11111     $      11111111
2      2222€2222     €      22222222
3     3333333*33     *     333333333
4     444444444£     £     444444444
5      555/55555     /      55555555
6  666666@666666     @  666666666666

Process finished with exit code 0

ValueError: cannot convert float NaN to integer

Estimated reading time: 2 minutes

Sometimes in your data analytics project, you will be working with float data types and integers, but the value NaN may also appear, which will give you headaches if you don’t know how to fix the problem at hand.

A NaN is defined as “Not a Number” and represents missing values in the data. If you are familiar with SQL, it is a similar concept to NULLS.

So how does this error occur in Python?

Let’s look at some logic below:

NaN =float('NaN')
print(type(NaN))
print(NaN)

Result:
<class 'float'>
nan

As can be seen, we have a variable called ‘NaN’, and it is a data type ‘Float’

One of the characteristics of NaN is that it is a special floating-point value and cannot be converted to any other type than float; thus, when you look at the example below, it shows us this exactly and why you would get the error message we are trying to solve.

NaN =float('NaN')
print(type(NaN))
print(NaN)

a= int(NaN)

print(a)

Result:

Traceback (most recent call last):
  File "ValueError_cannot_convert_float_NaN_to_integer.py", line 5, in <module>
    a= int(NaN)
ValueError: cannot convert float NaN to integer

In the variable ‘a’ we are trying to make that an integer number from the NaN variable, which, as we know, is a floating-point value and cannot be converted to any other type than float.

How do we fix this problem?

The easiest way to fix this is to change the ‘NaN’ actual value to an integer as per the below:

NaN =float(1)
print(type(NaN))
print(NaN)

a= int(NaN)

print(a)
print(type(a))

Result:
<class 'float'>
1.0
1
<class 'int'>

So, in summary, if you come across this error:

  1. Check to see if you have any ‘Nan’ values in your data.
  2. If you do replace them with an integer value, or a value that you need for your project, that should solve your problem.