How To Check For Unwanted Characters Using Loops With Python

Estimated reading time: 3 minutes

On this website, we have posted about how to remove unwanted characters from your data and How to remove characters from an imported CSV file both will show you different approaches.

In this blog posting, we are going to approach the process by using loops within a function. Essentially we are going to pass a list and then we are going to loop through the strings to check the data against it.

Step 1 – Import the data and create a data frame

The first bit of work we need to complete is to load the data. Here we create a dictionary with their respective key-value pairs.

In order to prepare the data to be processed through the function in step 2, we then load it into a data frame.

import pandas as pd
#Create a dataset locally
data = {'Number':  ["00&00$000", '111$11111','2222€2222','3333333*33','444444444£','555/55555','666666@666666'],
        'Error' : ['0','0','0','0','0','0','0']}

#Create a dataframe and tell it where to get its values from
df = pd.DataFrame (data, columns = ['Number','Error'])

Step 2 – Create the function that checks for invalid data

This is the main piece of logic that gives the output. As you can see there is a list “L” that is fed to the function run.

One thing to note is that *l is passed to the function, as there is more than one value in the list, otherwise the program would not execute properly.

To start off we create a data frame, which extracts using a regular expression the characters we don’t want to have.

Next, we then need to drop a column that is generated with NAN values, as these are not required.

Then we updated the original data fame with the values that we found.

Just in case if there are any NAN values in this updated column “Error”, we remove them on the next line.

The main next is the loop that creates a new column called “Fix”. This holds the values that will be populated into the fix after the data we don’t want is removed and is data cleansed.

The data we do not need is in str.replace.

#Function to loop through the dataset and see if any of the list values below are found and replace them
def run(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df2 = df['Number'].str.extract('(\D+|^(0)$)') # New dataframe to get extraction we need
    print(df2)
    df2 = df2.drop(1, axis=1) # Drops the column with NAN in it, not required

    df['Error'] = df2[0] # Updates original dataframe with values that need to be removed.
    #This line removes anything with a null value
    df.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        print("Error list to check against")
        print(i[0])
        print(i[1])
        print(i[2])
        print(i[3])
        print(i[4])
        print(i[5])
        print(i[6])
    print(df)

#This is the list of errors you want to check for
l = ['$','*','€','&','£','/','@']

Step 3 – Run the program

To run this program, we just execute the below code. All this does is read in the list “L” to the function “run” and then the output in step 4 is produced

run(l)

Step 4 – Output

Error list to check against
$
*
€
&
£
/
@
          Number Error           Fix
0      00&00$000     &       0000000
1      111$11111     $      11111111
2      2222€2222     €      22222222
3     3333333*33     *     333333333
4     444444444£     £     444444444
5      555/55555     /      55555555
6  666666@666666     @  666666666666

Process finished with exit code 0

how do I merge two dictionaries in Python?

Python dictionaries which are used in data analytics frequently and by their nature are enclosed in {} and have key:value pairs, so the data in them can be retrieved easily.

There maybe a scenario where you need to merge two dictionaries, but how would you acheive this?

The good thing is that Python dictionaries are unordered and mutable, meaning that what makes them up can be changed.

Lets start off by creating two dictionaries

dict1 = {"A":"1", "B":"2", "C":"3"}
dict2 = {"D":"4", "E":"5", "F":"6"}
print("dictionary 1 is:", dict1)
print("dictionary 2 is:", dict2)
print(type(dict1))
print(type(dict2))

Its output is as follows:
dictionary 1 is: {'A': '1', 'B': '2', 'C': '3'}
dictionary 2 is: {'D': '4', 'E': '5', 'F': '6'}
<class 'dict'>
<class 'dict'>

So the objective is to get these two dictionaries into one, how is this achieved?

Approach 1 – Use PEP 448

This approach uses PEP 448 which allows * iterable unpacking operator and ** dictionary unpacking operators to be implemented.

As can be seen below , it is a quick and efficient way to quickly merge, without impacting the two dictionaries structure.

dict3 = {**dict1, **dict2}
print(dict3)
print(type(3))

With output:
{'A': '1', 'B': '2', 'C': '3', 'D': '4', 'E': '5', 'F': '6'}
<class 'int'>

Approach 2 – Update when some values not required.

You maybe faced with a situation where you only want certain values from the second dictionary.

In the below there are common keys to both dictionares, namely “A” and “B”.

What the update is doing is it keeps all the values of dictionary 1, and adds in any key value pair that is not A or B.

This scenario could be encountered where dict1 is the master dictionary and always correct, and just needs new values added that do not exist already.

dict1 = {"A":"1", "B":"2", "C":"3"}
dict2 = {"A":"2", "E":"5", "B":"6"}
dict2.update((dict1))
print(dict2)
print(type(dict2))

Resulting in:
{'A': '1', 'E': '5', 'B': '2', 'C': '3'}
<class 'dict'>

Approach 3 – Iterating over the dictionaries

In this scenario, there are a few things going on, that should be explained.

dict1.copy ===> This is done so that you have the original, as it maybe updated if there where duplicate keys.

The loop then just goes through dict2 key value pairs, and adds the key value pairs to dict3, which was originally dict1.

dict1 = {"A":"1", "B":"2", "C":"3"}
dict2 = {"D":"4", "E":"5", "F":"6"}

dict3 = dict1.copy()

for key,value in dict2.items():
    dict3[key] = value
print(dict3)

Which gives you:
{'A': '1', 'B': '2', 'C': '3', 'D': '4', 'E': '5', 'F': '6'}
<class 'dict'>

Summing it all up

In conclusion, depending on what way you would like to approach, we have outlined options.

Probably the most important thing that came out of this, is that dictionaries can be changed, as a result when applying some of the

techniques above, before proceeding be sure to check if you want to keep the original values.