Parse huge XML file in Python
These days I am learning Data Science, since there are a lot of effective ways to parse huge file in Python, today I am going to introduce a way that I recently learned. Here is the official document for iterparser, while it doesn’t give clear examples, I will make up it.
First of all, you need to import the cElementTree
import xml.etree.cElementTree as etree
It’s a extended package for ElementTree, the only difference seems it has the advanced XML parser iterparse
To use it, you need two magic lines
for event, elem in etree.iterparse("filename", events = ("start", "end")):
if event == "end" and elem.tag == "element_name" and other conditions...:
There are four events in iterparse, to simplify, we just need to events which are start and end. Start event is the beginning we parse a XML element, and end is the end of the element
Every element we get will be saved in the elem variable, use
elem.get("entry_name")
to get the value in each entry. elem.tag is the name of the element. In another word,
elem.tag
is father,
elem.get("") is child.
elem.tag is the name of the element. In short, if we want to build a XML file, use start event, if we want to parse a XML, use end event. Otherwise, if we want to build and parse the XML together, use two conditional statement like
if event == "end"......
......
if event == "start"......
......
Finally, use
elem.clear()
to clear every element we parsed, thus we saved enough memory to parse the next element.
In general, iterparse is the most efficient way to parse huge XML file. The official says it can parse up to 40GB XML file. But make sure you have at least 8GB memory for your computer because it need OS need at least 4GB to place the tmp files.
You can get a real program from my github