How to Create an XML file from Excel using Python

Estimated reading time: 3 minutes

Are you working on a data analytics project where you need to feed your data to a location that is able to process an XML file?

The ability to get your data into a structured format like XML has many benefits:

(A) You can transfer the data to a web service for processing.

(B) Multiple different formats of your raw data can be standardised, enabling quick conversion and processing.

(C) XML files can be read by multiple different programs, as long as you deliver them in the correct format.

(D) The receiver of data can easily read the XML file and store it on their database.

The ability to use this method to read and transfer data is a very powerful way to help a data analyst process large data sets.

In fact, if you are using cloud-based applications to analyse the information you are storing, this will quickly enable you to deliver the data.

What packages do I need in Python?

The first step is to import the following:

import pandas as pd
from lxml import etree as et

Next we want to read in the source data

In this instance, we are reading an excel file

raw_data = pd.read_excel(r'Link to where your data is stored including the full file name')

Now we want to start building the XML structure

The FIRST STEP is to define the root

root = et.Element('root')

The root is the parent of all the data items (tags) contained in the XML file and is needed as part of the structure

The SECOND STEP is to define the tag names that will store each row of the source data

for row in raw_data.iterrows(): ==> This is a loop that takes runs through each record and populates for each tag.
    root_tags = et.SubElement(root, 'ExportData') #=== > Root name
# These are the tag names for each row (SECTION 1)
    Column_heading_1 = et.SubElement(root_tags, 'Name')
    Column_heading_2 = et.SubElement(root_tags, 'Area')
    Column_heading_3 = et.SubElement(root_tags, 'NoPurchases')
    Column_heading_4 = et.SubElement(root_tags, 'Active')

###These are the values that will be populated for each row above
# The values inside the [] are the raw file column headings.(SECTION 2)
    Column_heading_1.text = str(row[1]['Name'])
    Column_heading_2.text = str(row[1]['Area'])
    Column_heading_3.text = str(row[1]['No Purchases'])
    Column_heading_4.text = str(row[1]['Active'])

The raw file looks like this:

NameAreaNo PurchasesActive
JohnDublin2Y
MaryGalway3N
JoeLimerick4N
JimmyKilkenny55Y
JenniferBelfast6N
SusanWaterford3Y
JakeCork1Y
BobbyDundalk11N
SarahSligo9N
CianEnnis8Y
Raw file data that will be imported into the XML file

The THIRD STEP is to create the XML file and populate it with the data from the source file

# This Section outputs the data to an xml file
# Unless you tell it otherwise it saves it to the same folder as the script.
tree = et.ElementTree(root) ==> The variable tree is to hold all the values of "root"
et.indent(tree, space="\t", level=0) ===> This just formats in a way that the XML is readable
tree.write('output.xml', encoding="utf-8") ==> The data is saved to an XML file

The XML output should look like the below

<root>
	<ExportData>
		<Name>John</Name>
		<Area>Dublin</Area>
		<NoPurchases>2</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Mary</Name>
		<Area>Galway</Area>
		<NoPurchases>3</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Joe</Name>
		<Area>Limerick</Area>
		<NoPurchases>4</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Jimmy</Name>
		<Area>Kilkenny</Area>
		<NoPurchases>55</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Jennifer</Name>
		<Area>Belfast</Area>
		<NoPurchases>6</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Susan</Name>
		<Area>Waterford</Area>
		<NoPurchases>3</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Jake</Name>
		<Area>Cork</Area>
		<NoPurchases>1</NoPurchases>
		<Active>Y</Active>
	</ExportData>
	<ExportData>
		<Name>Bobby</Name>
		<Area>Dundalk</Area>
		<NoPurchases>11</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Sarah</Name>
		<Area>Sligo</Area>
		<NoPurchases>9</NoPurchases>
		<Active>N</Active>
	</ExportData>
	<ExportData>
		<Name>Cian</Name>
		<Area>Ennis</Area>
		<NoPurchases>8</NoPurchases>
		<Active>Y</Active>
	</ExportData>
</root>

Additional XML data can be added

  1. Add more rows – All you need to do is add onto the source file and save. When you rerun the logic it will read in the extra information.
  2. Add more columns – All you need to do is go to the second step above add in a tag name to SECTION 1. Seperately you will need to add an additional column with data to the source file, and then add that column name to SECTION 2 as well

How to Add Formulas to Excel using Python

Estimated reading time: 3 minutes

You may be working on automating some exports to Excel using Python to compare files or just pure and simple adding formulas to an Excel file before you open it up.

Here we explain adding formulas to your Excel output using Numpy or adding the calculations to specific cells in the output.

Adding formulas to specific cells

First of all, let’s look at the normal spreadsheet with some calculations, these have the formulas typed in. The ultimate objective is to have the Python code do this for us, one less step.

As can be seen, the cells have the formulas in them, but this would be a very time-consuming process if you had to do it multiple times, in multiple spreadsheets.

To get around this we can write the Python logic as follows:

  1. Create three lists and three dataframes as follows.
datasetA_list = np.array([1,2,3,4,5,6,7,8,9,10])

datasetB_list = np.array([9,8,65,43,3,21,3,2,1,7])

dataset_list = ('sum','average','median','standard deviation','count','correlation')


datasetA = pd.DataFrame(datasetA_list,columns=['ValueA'])
datasetB = pd.DataFrame(datasetB_list,columns=['ValueB'])
dataset_list_calcs = pd.DataFrame(dataset_list, columns=['Calcs'])

2. Next create a path to where you are going to store the data as follows:

path = 'output.xlsx'

3. In this next step create the workbook and location where the data will be stored. This will load the headings created in step 1 to a particular location on the spreadsheet.

workbook = pd.ExcelWriter(path, engine='openpyxl')
workbook.book = load_workbook(path)
workbook.sheets = dict((ws.title,ws) for ws in workbook.book.worksheets)

datasetA.to_excel(workbook,sheet_name="Sheet1", startrow=1,index=False, header=True,)
datasetB.to_excel(workbook,sheet_name="Sheet1", startrow=1, startcol=2,index=False, header=True)
dataset_list_calcs.to_excel(workbook,sheet_name="Sheet1", startrow=1, startcol=4,index=False, header=True)

4. Load the formulas into cells besides their relevant headings. This should line post these formulas beside the relevant heading created in step 1.

###Creating calculations for datasetA

sheet = workbook.sheets['Sheet1']
sheet['E2'] = 'CalcsA'
sheet['F3'] = '=SUM(A3:A12)'
sheet['F4'] = '=AVERAGE(A3:A12)'
sheet['F5'] = '=MEDIAN(A3:A12)'
sheet['F6'] = '=STDEV(A3:A12)'
sheet['F7'] = '=COUNT(A3:A12)'
sheet['F8'] = '=CORREL(A3:A12,C3:C12)'

###Creating calculations for datasetB

sheet = workbook.sheets['Sheet1']
sheet['H2'] = 'CalcsB'
sheet['H3'] = '=SUM(C3:C12)'
sheet['H4'] = '=AVERAGE(C3:C12)'
sheet['H5'] = '=MEDIAN(C3:C12)'
sheet['H6'] = '=STDEV(C3:C12)'
sheet['H7'] = '=COUNT(C3:C12)'
sheet['H8'] = '=CORREL(A3:A12,C3:C12)'

Use Numpy to create the calculations

a. Create the calculations that you will populate into the spreadsheet, using Numpy

a = np.sum(datasetA_list)
b = np.average(datasetA_list)
c = np.median(datasetA_list)
d = np.std(datasetA_list,ddof=1) ## Setting DDOF = 0 will give a differnt figure, this corrects to match the output.
f = np.count_nonzero(datasetA_list)
g = np.corrcoef(datasetA_list,datasetB_list)

b. Create the headings and assign them to particular cells

sheet['E14'] = 'Numpy Calculations'
sheet['E15'] = 'Sum'
sheet['E16'] = 'Average'
sheet['E17'] = 'Median'
sheet['E18'] = 'Standard Deviation'
sheet['E19'] = 'Count'
sheet['E20'] = 'Correlation'

c. Assign the variables in step a to a set of cells

sheet['F15'] = a
sheet['F16'] = b
sheet['F17'] = c
sheet['F18'] = d
sheet['F19'] = f
sheet['F20'] = str(g)

d. Save the workbook and close it – This step is important, and always include.

workbook.save()
workbook.close()

And the final output looks like…

How to count the no of rows and columns in a CSV file

So you are working on a number of different data analytics projects, and as part of some of them, you are bringing data in from a CSV file.

One area you may want to look at is How to Compare Column Headers in CSV to a List in Python, but that could be coupled with this outputs of this post.

As part of the process if you are manipulating this data, you need to ensure that all of it was loaded without failure.

With this in mind, we will look to help you with a possible automation task to ensure that:

(A) All rows and columns are totalled on loading of a CSV file.

(B) As part of the process, if the same dataset is exported, the total on the export can be counted.

(C) This ensures that all the required table rows and columns are always available.

Python Code that will help you with this

So in the below code, there are a number of things to look at.

Lets look at the CSV file we will read in:

In total there are ten rows with data. The top row is not included in the count as it is deemed a header row. There are also seven columns.

This first bit just reads in the data, and it automatically skips the header row.

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> reads in all the rows, but skips the first one as it is a header.


Output with first line used:
Number of Rows: 10
Number of Columns: 7

Next it creates two variables that count the no of rows and columns and prints them out.

Note it used the df.axes to tell python to not look at the individual cells.

total_rows=len(df.axes[0]) #===> Axes of 0 is for a row
total_cols=len(df.axes[1]) #===> Axes of 1 is for a column
print("Number of Rows: "+str(total_rows))
print("Number of Columns: "+str(total_cols))

And bringing it all together

import pandas as pd

df = pd.read_csv("csv_import.csv") #===> reads in all the rows, but skips the first one as it is a header.

total_rows=len(df.axes[0]) #===> Axes of 0 is for a row
total_cols=len(df.axes[1]) #===> Axes of 0 is for a column
print("Number of Rows: "+str(total_rows))
print("Number of Columns: "+str(total_cols))

Output:
Number of Rows: 10
Number of Columns: 7

In summary, this would be very useful if you are trying to reduce the amount of manual effort in checking the population of a file.

As a result it would help with:

(A) Scripts that process data doesn’t remove rows or columns unnecessarily.

(B) Batch runs who know the size of a dataset in advance of processing can make sure they have the data they need.

(C) Control logs – databases can store this data to show that what was processed is correct.

(D) Where an automated run has to be paused, this can help with identifying the problem and then quickly fixing.

(E) Finally if you are receiving agreed data from a third party it can be used to alert them of too much or too little information was received.

Here is another post you should read!

How to change the headers on a CSV file

how to copy/paste special a range of cells with xlwings

Are you using Microsoft Excel in conjunction with Python for your data analytics projects, but have a need to automate certain tasks?

In this blog post we will take you through how to remove formulas in a cell , and replace them with their returned values.

This is achieved through using xlwings, a very powerful library that can be used with Python.

So what we want to do is remove the formulas in an excel sheet, normally this is achieved through “copy and paste special values” in excel.

Below is a screenshot of the before:

In order to remove the formulas we use the following code:

This code basically loads the file( input) and looks for the range F2:F5.

Then using the the xlwings functionality, it makes the old file range values equal to the new range values.

The difference is that it looks at what the vlookup returned value to the cell and not the formula.

from openpyxl import load_workbook
import xlwings as xlfile

filepath_input = r'your file path here'
filepath_output = r'your file path here'

input_workbook = load_workbook(filepath_input)
output_workbook = load_workbook(filepath_output)

ws = input_workbook['Sheet1']

### Removing formulas in the spreadsheet

oldlist = xlfile.Book(filepath_input)
newlist = xlfile.Book(filepath_output)

my_values = oldlist.sheets['Sheet1'].range('F2:F5').options(ndim=2).value

my_values1 = newlist.sheets['Sheet1'].range('F2:F5').options(ndim=2).value

newlist.sheets['Sheet1'].range('F2:F5').value = my_values1

The output is a new file , with the formulas removed!

And there you go, there are other options though.

Theoretically you don’t have to create a new sheet like I did above, that was done to show the before and after, otherwise the input file is overwritten, and if that is what you need then your problem is solved!

In rolling out this solution, there are other options out there as well, I found this the simplest to implement.

Openpyxl can be used and it was the most common suggestion , but I found its implementation not as straight forward.

Python Tutorial: How to create charts in Excel

Estimated reading time: 2 minutes

Taking it further with a python excel chart.

We have created some video content here using Python with Excel that illustrates the different ways you can leverage Python to

  • Data cleanse
  • Find unwanted characters
  • And see if the file is empty before import!

What this video is giving as output

Here we are looking to introduce charts in Excel, and how how to use Python to easily work with your data and export to an excel sheet.

Below is the final output of our two charts, this is for illustrative purposes, and is taken from the Irish Governments website as at 1st May 2020, importing the cell ranges associated with them.

A barchart of covid cases bullt in Python a line chart of covid cases built in Python

How we went about it

In the below video on creating a python excel chart, we have approached this as follows:

 

  • Created four separate data frames, they are the four regions that will feed into the creation of the graphs.
  • Separate to this, we merged the four data frames into one, to use with the bar chart.

And to finish off

If you like what this video has explained, please click to see our YouTube channel for more informative videos.

Data Analytics Ireland

 

how to validate cell values in excel

Estimated reading time: 2 minutes

Validating cells in Excel quickly – how to do it easily!
Are you working with large spreadsheets and looking to quickly at data validation exercise to save you time?

The aim would be to run your code and test it against some predefined rules you or your data analyst would have written to make sure it brings back the expected checks.

If you look at the below, this is the final output of this video, highlighting two cells that are over budget based on the companies predefined budget.

data validation example

The structure of this code can be broken down into the following steps:

  • Read in the excel file, see a previous example here How to import data into excel
  •  Run the first function,  checks if the spreadsheet cell value is over or under budget.
  •  Run the second function that takes the value from the first function and applies the colour red to the cell if it is over budget.

 

Finally

You can expand this code to incorporate more functionality, such as:

  • Change the colour of the cells, to have multiple colours returned.
  •  Update the two functions to include more business rules.
  •  You could check if the file is empty before processing as shown here How to check if a file is empty

Please subscribe to our YouTube channel, the button is the right-hand side of the page if you would like to see more like these.

Data Analytics Ireland

How to import data into excel

Estimated reading time: 2 minutes

This import will not cost you anything except running some code!
The need to productively have an all in one solution to manage your data as your code has become more critical as volumes of data become larger. Do you, as a data analyst, therefore, need to send your data into an excel file? Previously we would have posted a video How do I remove unwanted characters, and here we build on that theme, linking in with Excel.

Two techniques used here to achieve this are  XLSX writer explained, and Openpyxl explained.

The elements we cover off are:

  • Load data from a data frame and populate it into an excel file.
  •  Renaming of a sheet.
  • Creating a new sheet and giving it a name.
  • We look at properties, namely changing the colour of a tab to yellow.
  • You may need to put some text in a sheet cell, to act as a header or to show you have a total figure there.
  • And the final piece of functionality covered in this video is how to copy data from one sheet into another one.

There are several benefits to putting all the upfront work in Python:

  1. The benefits of cleansing the data or format will save time further down the road.
  2. After you receive the document, you can quickly review it without fixing errors in the data.
  3. If you are distributing the output to several people, it quickly gets them what they want, without manual intervention after the logic has completed.

I have certainly benefited from this data cleansing and importing it into excel exercise, as the two are combined now, makes it a more efficient process.

Please remember to subscribe to our channel if you like the work we are doing, thanks!

Data Analytics Ireland