So you have an Excel file that has hundreds or maybe thousands of records, but you want to convert it to an XML file, and have each row as a separate file?
This can be achieved very easily using Python, in three easy steps.
First of all, what is XML?
XML stands for “extensible markup language”, and is used extensively with websites, Webservices and where companies want to transfer data between each other in a predefined format that follows a recognized standard.
To understand more about what is xml, follow the link. One thing to note is that it may reference the financial services industry, but XML is used widely across multiple industries too, as we are now digitizing a lot of our processes, the transfer of data needs to follow a standard that everyone can work with seamlessly.
Step 1 – Let’s look at our raw data
Below is an excel file with three columns and eight rows. The objective is to create eight separate XML files for each row.
Step 2 – Import the data and create a master XML file
Below is the logic I used in How to Create an XML File from Excel using Python, I have attached comments where necessary to explain what each line is doing.
The purpose at this point is to create one file with all the data from excel called output.xml. We will then use this to create the separate files we need and then delete it.
The next steps will take you through that.
import pandas as pd
from lxml import etree as et
import os
# STEP 1 - This step creates a master XML file
raw_data = pd.read_excel(r'\Users\haugh\OneDrive\dataanalyticsireland\YOUTUBE\how_to_read_each_excel_row_into_a_separate_XML_file\excel_raw_data_city.xlsx')
root = et.Element('root')
for row in raw_data.iterrows(): #==> This is a loop that takes runs through each record and populates for each tag.
root_tags = et.SubElement(root, 'ExportData') #=== > Root name
# These are the tag names for each row (SECTION 1)
Column_heading_1 = et.SubElement(root_tags, 'City')
Column_heading_2 = et.SubElement(root_tags, 'Area')
Column_heading_3 = et.SubElement(root_tags, 'Population')
###These are the values that will be populated for each row above
# The values inside the [] are the raw file column headings.(SECTION 2)
Column_heading_1.text = str(row[1]['City'])
Column_heading_2.text = str(row[1]['Area'])
Column_heading_3.text = str(row[1]['Population'])
# This Section outputs the data to an xml file
# Unless you tell it otherwise it saves it to the same folder as the script.
tree = et.ElementTree(root) #==> The variable tree is to hold all the values of "root"
et.indent(tree, space="\t", level=0) #===> This just formats in a way that the XML is readable
tree.write('output.xml', encoding="utf-8") #==> The data is saved to an XML file
Step 3 – Loop through each row and create the separate files
In the below code, it opens the output.xml file and using the loop looks for all the data between the tags “ExportData”.
Then it copies them and then creates a new file.
Once one is done, it moves to the next set of tags and repeats the process, till it reaches the end of the output.xml file.
master_file = et.iterparse('output.xml', events=('end', ))
index = 0
for i, j in master_file:
if j.tag == 'ExportData':
index += 1
filename = format(str(index) + ".xml")
with open(filename, 'wb') as f:
f.write(et.tostring(j))
Step 4 – Delete the output.xml file – purely optional
The objective here was to create separate XML files, based on the input of the original Excel file.
We have also created an additional file called output.xml to get us to the end goal.
If we have no need for this file then it can be removed with the below code, be sure that it is at the very end of your logic.
The import statement for os.unlink is in step two.
os.unlink(r'\Users\haugh\OneDrive\dataanalyticsireland\YOUTUBE\how_to_read_each_excel_row_into_a_separate_XML_file\output.xml')