Python Tutorial: How to validate data using tuples

Do you want to validate with Tuples, that is easy, making changes not easy.

In our recent video Python – how do I remove unwanted characters lists were used as a lookup to validate data that we need to be check for invalid data items. The most apparent difference between the two is that tuples are immutable, hence changing their values is not possible, making using them in real-time code a bit hazardous.

So why would you use Tuples?

That is a good question and sometimes not too obvious when you try to put examples down on paper, but here are some cases:

  • You want a set of values that will never change, no matter what.
  •  Use as a lookup that the program can check against, these could be called anywhere in your code.
  •  Make sure that you only process what is in the tuple; any additional data can be reported as erroneous, a form of error control.

Getting around the change limitations (well kind of)

This video looks at a simple few steps to take in a set of data, validate the id column aginst a tuple set of values and then show the differences on a separate output.

The code is then rerun after we add the original tuple to the error values found, to give a new tuple. As a result, the new output will show up with no errors.

To sum it all up

In a nutshell, Tuples are limited in what they can do, probably the best thing for them is:

  • Use with your code as a reference for re-occurring values that need to be validated.
  •  Don’t use in your code to have updated tuples, use lists instead as you can update them in real-time.

 

How to remove characters from an imported CSV file

Estimated reading time: 2 minutes

Removal of unwanted errors in your data, the easy way.
The process of importing data can take many formats, but have you been looking for a video on how you do this? Even better are you looking for a video that shows you how to import a CSV file and then data cleanse it, effectively removing any unwanted characters?

As a follow up to Python – how do I remove unwanted characters, that video focused on data cleansing the data created within the code, this video runs through several options to open a CSV file, find the unwanted characters, remove the unwanted characters from the dataset and then return the cleansed data.

How to get in amongst the problem data:

The approach here looked at three different scenarios:

(A) Using a variable that is equal to an open function, and then reading the variable to a data frame.

(B)Using a “with statement” and an open function together, and returning the output to a data frame.

(C) Using read_csv to quickly and efficiently read in the CSV file and then cleanse the dataset.

Some minor bumps in the road that need some thought

There where some challenges with this that you will see in the video:

  • Options A & B had to deploy additional code just to get the data frame the way we wanted.
  •  The additional lines of code made for more complexity if you were trying to manage it.

In the end, read_csv was probably the best way to approach this; it required less code and would have been easier to manage in the longer run.

 

As always thanks for watching, please subscribe to our YouTube channel, and follow us on social media!

Data Analytics Ireland

 

How to remove unwanted characters

Estimated reading time: 2 minutes

Removing the unwanted, that is holding you up.
A situation has arisen of you having information, which has erroneous data inside it, what do you do?

Data issues are a common scenario faced by many data analytics professionals and the industry as a whole. Data quality now has become more critical, especially as we move more processes online and the digital landscape increases.

Most pieces of data go through a process of been transferred somewhere between systems to be used or reports rely on the accuracy of them. If the data in the source system has quality issues, the problem if not addressed before going somewhere else can then push the data quality issues more throughout an organisation, like a spiders web it expands further.

The next step, looking to fix the problem and planning for it.

To combat this data, professionals need to come up with a plan on how to tackle this, either:

  • Fix at source
  • Take the data in before moving it on, and investigate the problems.
  •  Reject the file or part thereof.

All three options above have scenarios around them with costs and implications, depending on the industry, you need to pick the most appropriate way to handle. As an example, in the banking industry payment files can sometimes have data in them that is rejected entirely or in part. But the bank may decide they will only discard the records with the wrong data and process everything else.

How to go about it and how regular expressions can help

In this video, we look to go through an example of how to cleanse a data set;

(A) We use a list to check what problems we need to find.

(B) Using functions again to process through the data to find the problem and extract them.

(C) Regular expressions also appear as they look to find the special characters in the data set.

The concept of regular expressions is used extensively across several programming languages; it is a good way to test data and find erroneous values. If you are thinking about machine learning, it is quite important to get a more thorough knowledge of how they work. Here is a good link for further reading if you need more information Regular Expression how to

Thanks for watching and if you like, please share and subscribe through the buttons on this page!

Data Analytics Ireland