how to remove unwanted characters from your data

Using r programming functions to cleanse data
In our recent post  Python Tutorial: How do I remove unwanted characters  we walked through the concepts behind data cleansing.

We demonstrated several different approaches to data cleansing, and the use of regular expressions was also shown.

Here we look at that approach using rstudio and the following functions:

How to approach data cleansing

In removing unwanted characters, you want to ensure that you have a defined list of what should not appear and will cause you errors. It is also essential to understand the type of data errors that can occur as follows:

  • Data entry errors
  • Data columns have the incorrect format, e.g. Telphone numbers which have non-numerical characters in them
  • Missing data that is required – e.g. null values
  • Data that does not make sense, e.g. date of birth that is beyond the range of what you would typically expect to see
  • Duplicate values for the same piece of data. The problem here is that this can inflate the no of data errors, and not give a true count of the actual errors.

In the below video we utilise r programming code using the above functions, but also use an if statement to check if an unwanted character is in the data set first before proceeding to remove it and return the cleansed data.

Some of the strategies to help counteract data errors could include:

  • Eliminate manual inputs
  • Controls at point of entry for data, e.g. for dates only allow date formats in the field.
  • Reduce duplication of the data across multiple systems, reduces the no of places that data differences can occur.
  • If integrating different systems with the same data into one network, perform a data cleanse beforehand, reduces the work needed afterwards to clean up the problems that brings.

 

R Tutorial: How to pass data between functions

When starting to look at functions and having tested them in Python and Javascript, it was quickly apparent how programming languages are so similar.

Except for the syntax you use in each; the programming is quite similar.

The purpose of this video is to:

  • Start on using functions from the ground up.
  • Don’t over-complicate the example; keep it easy enough to follow.

How to write the code to pass data between functions

As this is a short video, the code that went into making it is pretty straight forward

# create a function
function.a <- function(){
  newvarb <- 2
}

function.b <- function(){
  newvarb <- function.a()*2 # this takes in the value of function a and multiplies it by two
}
print(function.b()) # Prints out the value of function b

Below is the video that will take you through each line, and show the output that we are looking to achieve

How can we use this in our projects

No matter what programming language you use or choose to learn, the concept of functions will appear in some shape or form. Their ability to quickly run a repeatable process and return a value, which can be called from anywhere in a program allows the programmer to reduce their coding time swiftly and reduce repetitive tasks that only need to run once.

This video has an equivalent in Python, and you can see it here  Python Functions – passing data between them

Data Analytics Ireland

R tutorial – How to sort lists using rstudio

You are kidding me; we can sort lists!
Yes, we have managed to bring you the groundbreaking video that will help transform your project! This video is just an introductory view of how to complete a sort on a list, but all is not what it seems.

Couple of things to note:

  • Creating a list in Rstudio means its a list that is it.
  • You need to create a vector so that that a sort of the data can give you the desired outcome you need.

So using :

#create the list
print("Example 1")
list1 <- list(5,4,3,2,1,"a","b")
print(list1)#defaults to list order
print(typeof(list1))

will get you started, but there is more to this, read on.

How we got this sorted

As you will see in the below video, we have taken an initial list and converted it to a vector using the c() function in rstudio.

Some of the functions used to sort are as follows:

  • sort
  • sort.int

The video explains the ways to use these, and some of the caveats you should watch out for as well.

 

 

Leaving you with this final thought

This video has a python equivalent, so if you want to see how we completed it there, see this blog posting How to sort Python lists

Data analytics Ireland

R – How to check a file exists and is not empty

Here is another way to check if a file is empty, this time in R!
In the recent past, we posted about How to check if a file is empty in Python, this post looks to build on that and show you in R Programming how to get the same effect.

When looking at the two programming languages, that while the concepts are pretty much the same, the syntax used to achieve the outcome can be slightly different; hence you have to remember which language you are in when writing the code!

Ensure the files exists and open it, gotta be confident it there 🙂

As a preliminary step we check to see if the file exists using:

if(file.exists("emptyfilea.txt")
... rest of your code

 

Then we go and see if the file is empty or not. The full code can be viewed here.

Additional information to help you along the way.

I have referenced this site before, it has some very useful explanations of different functions and methods, for files Check R Documentation

Data Analytics Ireland

 

R – How to open a file

In R studio have you been searching high and low on how to open a file?! Well, search no more as we have a video that will answer your questions.

Today we are going to cover off the following ways in which you can access files:

  • read.table
  • read_excel
  • read.csv
  • readLines

Whether you have a txt, CSV or XLSX file, this video will help you get to your information so that you can complete data analysis.

What can those functions do for you, let’s delve a bit further?

From the documentation found here R Documentation – read table, you will be able to see that this function creates a data frame based on the file you have opened. It also allows you to test the feature.

If R Documentation – read excel is your thing then you will see here that it does what it says on the tin, and also has some additional validations available if you need them.

If you are looking for an almost identical function to read.table, then read.csv is the one except for defaults. Additional information can be found here at R Documentation – more on read.csv

Last but not least, R Documentation – readLines is a useful way to open a file and can be used to read some or all of a text within a file.

Wrapping it up

This blog post has described some excellent choice in how to open a file, and indeed, the documentation above will help explain it further. There is a Python equivalent of this, and you can source it at the following link Python – How to import data from files

Data Analytics Ireland

 

What is the r programming language

The R Project for Statistical Computing is beneficial to anyone who needs a statistical analysis performed on a dataset.

The language is

  • Open source so that anyone can use it.
  • And can work cross-platform for any operating system in use.
  • R can work with other similar packages; an example been Python, which can execute within R.
  • As a result, you get the power of the statistical side of R and the wide variety of functionality of Python.

An introduction to R further

To get you started, we have introduced some fundamental functionality in this video, and give a tour around some of the screens that are visible as you work through your data analytics project. Some of the things you will see:

  • Creating variables
  • Addition of variables
  • Writing variables to a CSV file
  • We are saving variables to a txt file.
  • Ensuring no headers are in the file.
  • Loading data from a file and printing its contents to screen.

[embedyt] https://www.youtube.com/watch?v=uZWF1RCbXHU[/embedyt]

How we got this far

To get started and be able to write your first R program, a couple of steps as follows:

Both installs are free and well supported, easy to download, and easy to install.

Data Analytics Ireland

 

 


Notice: ob_end_flush(): failed to send buffer of zlib output compression (0) in /home/dataana1/public_html/wp-includes/functions.php on line 4979