How To Check For Unwanted Characters Using Loops With Python

Estimated reading time: 3 minutes

On this website, we have posted about how to remove unwanted characters from your data and How to remove characters from an imported CSV file both will show you different approaches.

In this blog posting, we are going to approach the process by using loops within a function. Essentially we are going to pass a list and then we are going to loop through the strings to check the data against it.

Step 1 – Import the data and create a data frame

The first bit of work we need to complete is to load the data. Here we create a dictionary with their respective key-value pairs.

In order to prepare the data to be processed through the function in step 2, we then load it into a data frame.

import pandas as pd
#Create a dataset locally
data = {'Number':  ["00&00$000", '111$11111','2222€2222','3333333*33','444444444£','555/55555','666666@666666'],
        'Error' : ['0','0','0','0','0','0','0']}

#Create a dataframe and tell it where to get its values from
df = pd.DataFrame (data, columns = ['Number','Error'])

Step 2 – Create the function that checks for invalid data

This is the main piece of logic that gives the output. As you can see there is a list “L” that is fed to the function run.

One thing to note is that *l is passed to the function, as there is more than one value in the list, otherwise the program would not execute properly.

To start off we create a data frame, which extracts using a regular expression the characters we don’t want to have.

Next, we then need to drop a column that is generated with NAN values, as these are not required.

Then we updated the original data fame with the values that we found.

Just in case if there are any NAN values in this updated column “Error”, we remove them on the next line.

The main next is the loop that creates a new column called “Fix”. This holds the values that will be populated into the fix after the data we don’t want is removed and is data cleansed.

The data we do not need is in str.replace.

#Function to loop through the dataset and see if any of the list values below are found and replace them
def run(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df2 = df['Number'].str.extract('(\D+|^(0)$)') # New dataframe to get extraction we need
    print(df2)
    df2 = df2.drop(1, axis=1) # Drops the column with NAN in it, not required

    df['Error'] = df2[0] # Updates original dataframe with values that need to be removed.
    #This line removes anything with a null value
    df.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        print("Error list to check against")
        print(i[0])
        print(i[1])
        print(i[2])
        print(i[3])
        print(i[4])
        print(i[5])
        print(i[6])
    print(df)

#This is the list of errors you want to check for
l = ['$','*','€','&','£','/','@']

Step 3 – Run the program

To run this program, we just execute the below code. All this does is read in the list “L” to the function “run” and then the output in step 4 is produced

run(l)

Step 4 – Output

Error list to check against
$
*
€
&
£
/
@
          Number Error           Fix
0      00&00$000     &       0000000
1      111$11111     $      11111111
2      2222€2222     €      22222222
3     3333333*33     *     333333333
4     444444444£     £     444444444
5      555/55555     /      55555555
6  666666@666666     @  666666666666

Process finished with exit code 0

ValueError: pattern contains no capture groups

Estimated reading time: 2 minutes

In Python, there are a number of re-occurring value errors that you will come across.

In this particular error it is usually related to when you are running regular expressions as part of a pattern search.

So how does the problem occur?

In the below, the aim of the code is to purely create a data frame, that can then be searchable.

To search the data frame we will use str.extract

import pandas as pd
rawdata = [['Joe', 'Jim'],
           ['Jane', 'Jennifer'],
           ['Ann','Alison']]
datavalue = pd.DataFrame(data=rawdata, columns=['A', 'B'])

We then add the below code to complete the extract of the string “Joe”.

a = datavalue['A'].str.extract('Joe')
print(a)

But it gives the below error, what we are trying to solve for:

ValueError: pattern contains no capture groups
Process finished with exit code 1

But why did the error occur , and how can we fix it?

In essence when you try to complete a str.extract, the value you are looking for should be enclosed in brackets i.e ()

In the above, it views ‘Joe’ as an incorrect value to be passed into the str.extract function, and returns the error.

So to fix this problem, we would change this line to:

a = datavalue['A'].str.extract('(Joe)')

As a result the program runs without error, and returns the below result:

     0
0  Joe
1  NaN
2  NaN

The full corrected code to be used is then:

import pandas as pd
rawdata = [['Joe', 'Jim'],
           ['Jane', 'Jennifer'],
           ['Ann','Alison']]
datavalue = pd.DataFrame(data=rawdata, columns=['A', 'B'])
a = datavalue['A'].str.extract('(Joe)')
print(a)

Regular expressions python

Estimated reading time: 3 minutes

Regular expressions explained

Regular expressions are a set of characters usually in a particular sequence that helps find a match/pattern for a specific piece of data in a dataset.

The purpose is to allow a uniform of set characters that can be reused multiple times, based on the requirements of the user, without having to build each time.

The patterns are similar to those that you would find in Perl.

How are regular expressions built?

To start, in regular expressions, there are metacharacters, which are characters that have a special meaning. Their values are as follows:

. ^ $ * + ? { } [ ] \ | ( )

.e = All occurrences which have one “e”, and value before that e. There can be multiple e, eg ..e means check two characters before e.

^ =Check if a string starts with a particular pattern.

*  = Match zero or more occurrences of a pattern, at least one of the characters can be found.

+ = Looks to match exact patterns, one or more times, and if they are not precisely equal, then nothing is returned.

? =Check if a string after ? exists in a pattern and returns it. If a value before the ? is directly beside the value after ? then returns both values.

—> e.g. t?e is the search pattern. “The” is the string. The result will return only the value e, but if the string is “te”, then it will return te, as the letters are directly beside each other.

da{2} = Check to see if a character has a set of other characters following it. E.g. sees if d has two “a” following it.

[abc] = These are the characters you are looking for in the data. Could also use [a-c] and will give you the same result. Change to uppercase to get only those with uppercase.

\ = Denoting a backslash used to escape all metacharacters, so if they need to be found in a string, they can be. Used to escape $ in a string so they can be found as a literal value.

| = This is used when you want an “or” operator in the logic, i.e. check for one or more values from a pattern, either or both can be present.

() = Looks to group pattern searches or a partial match, to see if they are together or not.

 

Special sequences, making it easier again

\a = Matches if the specified characters are at the start of the string been searched.

\b = Matches if the specified characters are at the beginning or the end of the string been searched.

\B = Matches if the specified characters are NOT at the beginning or the end of the string been searched.

\d = Matches any digits 0-9.

\D = Matches any character is not a digit.

\s = Matches where a string contains a whitespace character.

\S = Matches where a string contains a non-whitespace character.

\w = Matches if digits or character or _ found

\W = Matches if non-digits and or characters or _found

\z = matches if the specified characters are at the end of the string.

 

 

For further references and reading materials, please see the below websites, the last one is really useful in testing any regular expressions you would like to build:

See further reading material here: regular expression RE explained

Another complementary page to the link above regular expression REGEX explained

I found this link on the internet, and would thoroughly recommend you bookmark it. It will also allow you to play around with regular expressions and test them before you put into your code, a very recommended resource Testing regular expressions