how to remove unwanted characters from your data

Using r programming functions to cleanse data
In our recent post  Python Tutorial: How do I remove unwanted characters  we walked through the concepts behind data cleansing.

We demonstrated several different approaches to data cleansing, and the use of regular expressions was also shown.

Here we look at that approach using rstudio and the following functions:

How to approach data cleansing

In removing unwanted characters, you want to ensure that you have a defined list of what should not appear and will cause you errors. It is also essential to understand the type of data errors that can occur as follows:

  • Data entry errors
  • Data columns have the incorrect format, e.g. Telphone numbers which have non-numerical characters in them
  • Missing data that is required – e.g. null values
  • Data that does not make sense, e.g. date of birth that is beyond the range of what you would typically expect to see
  • Duplicate values for the same piece of data. The problem here is that this can inflate the no of data errors, and not give a true count of the actual errors.

In the below video we utilise r programming code using the above functions, but also use an if statement to check if an unwanted character is in the data set first before proceeding to remove it and return the cleansed data.

Some of the strategies to help counteract data errors could include:

  • Eliminate manual inputs
  • Controls at point of entry for data, e.g. for dates only allow date formats in the field.
  • Reduce duplication of the data across multiple systems, reduces the no of places that data differences can occur.
  • If integrating different systems with the same data into one network, perform a data cleanse beforehand, reduces the work needed afterwards to clean up the problems that brings.

 

Python Tutorial: How to import data from files

Estimated reading time: 1 minute

Is there a need for you to be quickly open files, and import the data into a data frame?

In this post and video on Python, we will look at several options for you to do this as well as some additional things to consider.

The import of files covered here is as follows:

  • Reading data from a CSV file.
  • Reading data from a TXT file.
  • Reading data from an XLSX file.

On importation there are many things to consider, here are a few to consider:

(A) The file format

(B) How the data looks within the file.

(C) Special requirements to get the data looking correct when loaded.

In this file importing example dealing with tab delimiters, headers and sorting are referenced.  Here are some different ways to approach it a little differently if you are looking for alternatives CSV File Reading and Writing

Thanks for watching, please follow us by clicking on the links to the right!

Need to check if a file is empty? Have a look here Python – How to check if a file is empty

Thanks!

Data Analytics Ireland