Using r programming functions to cleanse data
In our recent post Python Tutorial: How do I remove unwanted characters we walked through the concepts behind data cleansing.
We demonstrated several different approaches to data cleansing, and the use of regular expressions was also shown.
Here we look at that approach using rstudio and the following functions:
How to approach data cleansing
In removing unwanted characters, you want to ensure that you have a defined list of what should not appear and will cause you errors. It is also essential to understand the type of data errors that can occur as follows:
- Data entry errors
- Data columns have the incorrect format, e.g. Telphone numbers which have non-numerical characters in them
- Missing data that is required – e.g. null values
- Data that does not make sense, e.g. date of birth that is beyond the range of what you would typically expect to see
- Duplicate values for the same piece of data. The problem here is that this can inflate the no of data errors, and not give a true count of the actual errors.
In the below video we utilise r programming code using the above functions, but also use an if statement to check if an unwanted character is in the data set first before proceeding to remove it and return the cleansed data.
Some of the strategies to help counteract data errors could include:
- Eliminate manual inputs
- Controls at point of entry for data, e.g. for dates only allow date formats in the field.
- Reduce duplication of the data across multiple systems, reduces the no of places that data differences can occur.
- If integrating different systems with the same data into one network, perform a data cleanse beforehand, reduces the work needed afterwards to clean up the problems that brings.