how to remove unwanted characters from your data

Using r programming functions to cleanse data
In our recent post  Python Tutorial: How do I remove unwanted characters  we walked through the concepts behind data cleansing.

We demonstrated several different approaches to data cleansing, and the use of regular expressions was also shown.

Here we look at that approach using rstudio and the following functions:

How to approach data cleansing

In removing unwanted characters, you want to ensure that you have a defined list of what should not appear and will cause you errors. It is also essential to understand the type of data errors that can occur as follows:

  • Data entry errors
  • Data columns have the incorrect format, e.g. Telphone numbers which have non-numerical characters in them
  • Missing data that is required – e.g. null values
  • Data that does not make sense, e.g. date of birth that is beyond the range of what you would typically expect to see
  • Duplicate values for the same piece of data. The problem here is that this can inflate the no of data errors, and not give a true count of the actual errors.

In the below video we utilise r programming code using the above functions, but also use an if statement to check if an unwanted character is in the data set first before proceeding to remove it and return the cleansed data.

Some of the strategies to help counteract data errors could include:

  • Eliminate manual inputs
  • Controls at point of entry for data, e.g. for dates only allow date formats in the field.
  • Reduce duplication of the data across multiple systems, reduces the no of places that data differences can occur.
  • If integrating different systems with the same data into one network, perform a data cleanse beforehand, reduces the work needed afterwards to clean up the problems that brings.


R tutorial – How to sort lists using rstudio

You are kidding me; we can sort lists!
Yes, we have managed to bring you the groundbreaking video that will help transform your project! This video is just an introductory view of how to complete a sort on a list, but all is not what it seems.

Couple of things to note:

  • Creating a list in Rstudio means its a list that is it.
  • You need to create a vector so that that a sort of the data can give you the desired outcome you need.

So using :

#create the list
print("Example 1")
list1 <- list(5,4,3,2,1,"a","b")
print(list1)#defaults to list order

will get you started, but there is more to this, read on.

How we got this sorted

As you will see in the below video, we have taken an initial list and converted it to a vector using the c() function in rstudio.

Some of the functions used to sort are as follows:

  • sort

The video explains the ways to use these, and some of the caveats you should watch out for as well.



Leaving you with this final thought

This video has a python equivalent, so if you want to see how we completed it there, see this blog posting How to sort Python lists

Data analytics Ireland

What is the r programming language

The R Project for Statistical Computing is beneficial to anyone who needs a statistical analysis performed on a dataset.

The language is

  • Open source so that anyone can use it.
  • And can work cross-platform for any operating system in use.
  • R can work with other similar packages; an example been Python, which can execute within R.
  • As a result, you get the power of the statistical side of R and the wide variety of functionality of Python.

An introduction to R further

To get you started, we have introduced some fundamental functionality in this video, and give a tour around some of the screens that are visible as you work through your data analytics project. Some of the things you will see:

  • Creating variables
  • Addition of variables
  • Writing variables to a CSV file
  • We are saving variables to a txt file.
  • Ensuring no headers are in the file.
  • Loading data from a file and printing its contents to screen.


How we got this far

To get started and be able to write your first R program, a couple of steps as follows:

Both installs are free and well supported, easy to download, and easy to install.

Data Analytics Ireland