How to data cleanse a database table

In Data Analytics, that is a very relevant question, and something I look to implement in most projects, sometimes it is too easy to click the shortcut icon to your spreadsheet application!

Here we are looking to bring a little automation into these videos. Building on How to import data from files and Removing characters from an imported CSV file this video connects to a Microsoft Azure cloud database table, brings in the data with errors on it, fixes the errors and displays the correct output on the screen.

What can this do for organisations?

There are several benefits to automating this step:

  • Less manual intervention if there is a need to fix data issues.
  • Better productivity.
  • Better data flows with no errors and quicker reporting.

 

Moving away from files

The process of moving away from files and into automation has several steps:

  • Be clear on your data needs.
  • Understand what you are trying to achieve.
  • Build a process that is repeatable but can be updated easily.
  • Ensure that you build in data quality checks, helps deliver the better output to the users.

Thanks for stopping by!

Data Analytics Ireland

 

How to remove unwanted characters

Estimated reading time: 2 minutes

Removing the unwanted, that is holding you up.
A situation has arisen of you having information, which has erroneous data inside it, what do you do?

Data issues are a common scenario faced by many data analytics professionals and the industry as a whole. Data quality now has become more critical, especially as we move more processes online and the digital landscape increases.

Most pieces of data go through a process of been transferred somewhere between systems to be used or reports rely on the accuracy of them. If the data in the source system has quality issues, the problem if not addressed before going somewhere else can then push the data quality issues more throughout an organisation, like a spiders web it expands further.

The next step, looking to fix the problem and planning for it.

To combat this data, professionals need to come up with a plan on how to tackle this, either:

  • Fix at source
  • Take the data in before moving it on, and investigate the problems.
  •  Reject the file or part thereof.

All three options above have scenarios around them with costs and implications, depending on the industry, you need to pick the most appropriate way to handle. As an example, in the banking industry payment files can sometimes have data in them that is rejected entirely or in part. But the bank may decide they will only discard the records with the wrong data and process everything else.

How to go about it and how regular expressions can help

In this video, we look to go through an example of how to cleanse a data set;

(A) We use a list to check what problems we need to find.

(B) Using functions again to process through the data to find the problem and extract them.

(C) Regular expressions also appear as they look to find the special characters in the data set.

The concept of regular expressions is used extensively across several programming languages; it is a good way to test data and find erroneous values. If you are thinking about machine learning, it is quite important to get a more thorough knowledge of how they work. Here is a good link for further reading if you need more information Regular Expression how to

Thanks for watching and if you like, please share and subscribe through the buttons on this page!

Data Analytics Ireland

YouTube channel lists – Python Data Cleansing

Ever had a process where you received a set of data, and it took a bit of effort to cleanse the data, so it looks the way you want it?

The world of data processing and data exchanged between servers and organisations needs careful attention, one person’s idea of clean data might not be other persons, hence the difference between the two can lead to data issues.

Experian had an excellent article What is data cleansing? in that they talk about several factors about data:

  • It could be incorrect
  • and incomplete
  •  and Duplicated

 

One of the things they highlighted also is that under GDPR, organisations have to deal with more focus on data been accurate, complete and up to date.

We are putting together several videos in this area over time, so you will start to see them as they go up.

Please like and share through the social media buttons shared on the page here, thanks for watching!

Data Analytics Ireland