Skip to content
  • YouTube
  • FaceBook
  • Twitter
  • Instagram

Data Analytics Ireland

Data Analytics and Video Tutorials

  • Home
  • Contact
  • About Us
    • Latest
    • Write for us
    • Learn more information about our website
  • Useful Links
  • Glossary
  • All Categories
  • Faq
  • Livestream
  • Toggle search form
  • supervised machine learning vs unsupervised machine learning? machine learning
  • how to copy/paste special a range of cells with xlwings Python
  • How To Run Python Validation From Javascript Javascript
  • Python tutorial: Create an input box in Tkinter Python
  • YouTube channel lists – Python Lists Python Lists
  • TypeError: cannot unpack non-iterable int object Python
  • How To Validate Cell Values In Excel Python
  • Python Tutorial: How to create charts in Excel Python Tutorial

How to remove characters from an imported CSV file

Posted on April 27, 2020November 11, 2022 By admin No Comments on How to remove characters from an imported CSV file

Estimated reading time: 5 minutes

Removal of unwanted errors in your data, the easy way.
The process of importing data can take many formats, but have you been looking for a video on how you do this? Even better are you looking for a video that shows you how to import a CSV file and then data cleanse it, effectively removing any unwanted characters?

As a follow up to Python – how do I remove unwanted characters, that video focused on data cleansing the data created within the code, this video runs through several options to open a CSV file, find the unwanted characters, remove the unwanted characters from the dataset and then return the cleansed data.

How to get in amongst the problem data:

The approach here looked at three different scenarios:

(A) Using a variable that is equal to an open function, and then reading the variable to a data frame.

(B)Using a “with statement” and an open function together, and returning the output to a data frame.

(C) Using read_csv to quickly and efficiently read in the CSV file and then cleanse the dataset.

Some minor bumps in the road that need some thought

There where some challenges with this that you will see in the video:

  • Options A & B had to deploy additional code just to get the data frame the way we wanted.
  •  The additional lines of code made for more complexity if you were trying to manage it.

In the end, read_csv was probably the best way to approach this; it required less code and would have been easier to manage in the longer run.

As always thanks for watching, please subscribe to our YouTube channel, and follow us on social media!

Data Analytics Ireland

The code used in this video is here:

##########   How to import a  CSV dataset and remove unwanted characters ###########


#Numbers need to be put between "" to be looped through
import pandas as pd

l = ['$','*','€','&','£','/','@']
print("Error list we will be checking against in all three scenarios")
print(l)

############ Scenario 1  - Using the Open Function #####################

#can't use engine ='python' here , as it is an invalid keyword argument for open()
my_file = open("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv", "r").read().split('\n')
print("Senario 1 printed as a dataframe")
df = pd.DataFrame(my_file)
df.columns = ['Temp']
df[['Number','Error']] = df['Temp'].str.split(',',expand=True)
#Drop the column Temp ,0 for rows and 1 for columns.
df = df.drop('Temp', 1)
#Dropping the first row as it has the headers in it.
df = df.drop(0)
#Resetting the index to start at value 0
df = df.reset_index(drop=True)
display(df)


def run_scenario1(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df['Error'] = df['Number'].str.extract('(\D+|^(0)$)')   
    #This line removes anything with a null value
    df.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df['Fix']= df['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        #print("Error list to check against")
        #print(i[0])
        #print(i[1])
        #print(i[2])
        #print(i[3])
        #print(i[4])
        #print(i[5])
        #print(i[6])
    print("Scenario 1 output with fixes")
    display(df)


run_scenario1(l)



############ Scenario 2  - Using a with statement #####################

#Once Python exits the with block, it closes the file
#can't use engine ='python' here , as it is an invalid keyword argument for open()
with open("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv") as f:
    emptylist=[]
    for line in f:
        emptylist.append(line)
df1 = pd.DataFrame(emptylist)
df1.columns = ['Temp']
df1[['Number','Error']] = df1['Temp'].str.split(',',expand=True)
#Drop the column Temp ,0 for rows and 1 for columns.
df1 = df1.drop('Temp', 1)
#Dropping the first row as it has the headers in it.
df1 = df1.drop(0)
#Resetting the index to start at value 0
df1 = df1.reset_index(drop=True)
#Replacing the Error field value that was created above witha value of 0.
df1['Error'] = df1['Error'].str.replace('0\n','0')
print("Scenario 2 printed as a dataframe")
display(df1)  
        

#Function to loop through the dataset and see if any of the list values below are found and replace them
def run_scenario2(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df1['Error'] = df1['Number'].str.extract('(\D+|^(0)$)')   
    #This line removes anything with a null value
    df1.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df1['Fix']= df1['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        #print("Error list to check against")
        #print(i[0])
        #print(i[1])
        #print(i[2])
        #print(i[3])
        #print(i[4])
        #print(i[5])
        #print(i[6])
    print("Scenario 2 output with fixes")
    display(df1)

   
        
run_scenario2(l)


############ Scenario 3  - Using read CSV #####################

#putting in engine ='python' removes any need to do additional code to display in a dataframe correctly unlike above
df2 = pd.read_csv("C:/dataanalyticsireland/Python Code/youtube/codeused/files/importandcleandatacsv.csv",engine ='python')
print("Senario 3 printed as a dataframe")
display(df2)


def run_scenario2(*l):
    #This line extracts all the special characters into a new column
    #Using regular expressions it finds values that should not appear
    df2['Error'] = df2['Number'].str.extract('(\D+|^(0)$)')   
    #This line removes anything with a null value
    df1.dropna(subset=['Error'], inplace=True)
    #This line reads in the list and assigns it a value i, to each element of the list.
    #Each i value assigned also assigns an index number to the list value.
    #The index value is then used to check whether the value associated with it is in the df['Number'] column 
    #and then replaces if found
    for i in l:
        df2['Fix']= df1['Number'].str.replace(i[0],"").str.replace(i[1],"").str.replace(i[2],"").str.replace(i[3],"") \
        .str.replace(i[4],"").str.replace(i[5],"").str.replace(i[6],"")
        
    print("Scenario 3 output with fixes")
    display(df2)
   
        
run_scenario2(l)

Python Tutorial, Python working with files Tags:Cleansed data, Data Cleansing, Data Quality, Import CSV, Python, Python Tutorial

Post navigation

Previous Post: How to remove unwanted characters
Next Post: How to check if a file is empty

Related Posts

  • How To Pass Data Between Functions Python Functions
  • How to data cleanse a database table Python Data Cleansing
  • How To Add Values to a Python Dictionary Python
  • How to import data into excel Python Tutorial
  • Python Overview Interview Questions automation
  • How to save data frame changes to a file Python Dataframe

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Select your language!

  • हिंदी
  • Español
  • Português
  • Français
  • Italiano
  • TypeError: ‘int’ object is not callable Python
  • How to group your data in Tableau data visualisation
  • YouTube channel lists – Python Data Cleansing Python Data Cleansing
  • create read update delete using Tkinter class
  • TypeError object of type ‘int’ has no len() Python
  • How to Generate Random Integers Between 0 and 9 Python
  • ValueError: invalid literal for int() with base 10 Value Error
  • Python tutorial: Create an input box in Tkinter Python

Copyright © 2023 Data Analytics Ireland.

Powered by PressBook Premium theme

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT