Web Technologies/2021-2022/Laboratory 5

Parsing XML documents

An important issue when dealing with XML is parsing the documents. There are several parser types including:

DOM parsers:
- allow the navigation of the XML document as it were a tree.
- the main drawback is that the document needs to be completely loaded into memory before actually parsing it.
- DOM documents can be either created by parsing an XML file, or by users which want to create an XML file programmatic.
SAX parsers:
- event-driven API in which the XML document is read sequentially by using callbacks that are triggered when different element types are meet.
- overcomes the DOM’s memory problem, and is fast and efficient at reading files sequentially.
- its problem comes from the fact that it is quite difficult to read random information from inside an XML file.
FlexML parsers:
- follow the SAX approach and rely on events during the parsing process.
- it does not constitute a parsing library by itself, but instead it converts the DTD file into a parser specification usable with the classical Flex parser generator.
Pull parsers:
- use an iterator design pattern in order to sequentially read various XML items such as elements, attributes or data.
- this method allows the programmer to write recursive-descent parsers:
  - applications in which the structure of the code that handles the parsing looks like the XML they process.
  - examples of parsers from this category include: StAX13, and the .NET System.Xml.XmlReader.
Non-extractive parsers:
- a new technology in which the object oriented modeling of the XML is replaced with 64-bit Virtual Token Descriptors.
- one of the most expressive parser belonging to this category is VTD-XML.

SAX

SAX (Simple API for XML) is a serial access XML parser. A SAX parser can be found in the xml.sax module found here.

The following fragment of code shows how we could use SAX to parse an XML document:

Using Python3

from xml.sax import handler, make_parser, SAXParseException


class MyHandler(handler.ContentHandler):
    def __init__(self):
        pass
    
    def startDocument(self):
        print('Entering document')

    def endDocument(self):
        print('Leaving document')      

    def startElement(self, name, attrs):
        print(f'Found element: {name}')
        
        for attr_name, value in attrs.items():
            print(f'Found attribute: {attr_name} with value: {value}')

    def endElement(self, name):
        print(f'Leaving element: {name}')

    def characters(self, content):
        print(f'Found text node: {content}')

    def ignorableWhitespace(self, content):
        pass
        
    def processingInstruction(self, target, data):
        pass


try:
    parser = make_parser()
    parser.setContentHandler(MyHandler())
    parser.parse('queue.xml')
    print('The document is well formed')
except SAXParseException:
    print('The document is not well formed')

Links:

Python3 SAX documentation

DOM

DOM (Document Object Model) is a convention for representing XML documents. A DOM parser can be found in xml.dom.minidom found here.

DOM handles XML files as being made of the following types of nodes:

Document node
Element nodes
Attribute nodes
Leaf nodes:
- Text nodes
- Comment nodes
- Processing instruction nodes
- CDATA nodes
- Entity reference nodes
- Document type nodes
Non-tree nodes;

Using xml.dom.minidom with Python3

The following fragment of code shows how we could use DOM to traverse an XML tree:

DOM also allows users to create a new XML document or change the structure of an already existing one:

import xml.dom.minidom as mdom

try:
    dom = mdom.parse('queue.xml')
except Exception:
    print('Document is not well formed')
    exit(1)


def follow_nodes(element):
    print(f'Node name: {element.nodeName}')
    print(f'Node type: {type(element)}')
    try:
        print(f'Node local name: {element.nodeLocalName}')
    except Exception:
        pass
    print(f'Node Value: {element.nodeValue}')

    for elem in element.childNodes:
        follow_nodes(elem)

follow_nodes(dom.firstChild)

To programmatically create an XML document using the xml.dom module:

import xml.dom.minidom as mdom

def create_xml():
    doc = mdom.Document()

    root = doc.createElement('books')
    doc.appendChild(root)

    book = doc.createElement('book')
    book.setAttribute('title', 'Processing XML with Java')
    book.setAttribute('author', 'Elliotte Rusty Harold')
    book.appendChild(doc.createTextNode("A complete tutorial about writing Java programs that read and write XML documents."));
    
    root.appendChild(book)
    
    return root

with open('out.xml', 'w') as f:
    create_xml().writexml(f)

Links:

Exercises

Parse the XML created in your assignment from Laboratory 3 using both SAX and DOM. Print out the parsing time of each method (hint: use time.time() to get the start and end time).
Create the XML from your assignment in Laboratory 3 using DOM. Print the result to an XML file using the writexml method from the minimal DOM implementation.

Gabriel Iuhasz, 2019-10-30, iuhasz.gabriel@e-uvt.ro

Alexandru Munteanu, 2021-10-25, alexandru.munteanu@e-uvt.ro