Web technologies/2014-2015/Laboratory 5: Difference between revisions
Line 127: | Line 127: | ||
** Comment nodes |
** Comment nodes |
||
** Processing instruction nodes |
** Processing instruction nodes |
||
** [https://www.tutorialspoint.com/xml/xml_cdata_sections.htm CDATA] nodes |
|||
** CDATA nodes |
|||
** Entity reference nodes |
** Entity reference nodes |
||
** Document type nodes |
** Document type nodes |
Revision as of 11:06, 15 November 2016
Parsing XML documents
An important issue when dealing with XML is parsing the documents. There are several parser types including:
- DOM parsers:
- allow the navigation of the XML document as it were a tree.
- the main drawback is that the document needs to be completely loaded into memory before actually parsing it.
- DOM documents can be either created by parsing an XML file, or by users which want to create an XML file programmatic.
- SAX parsers:
- event-driven API in which the XML document is read sequentially by using callbacks that are triggered when different element types are meet.
- overcomes the DOM’s memory problem, and is fast and efficient at reading files sequentially.
- its problem comes from the fact that it is quite difficult to read random information from inside an XML file.
- FlexML parsers:
- follow the SAX approach and rely on events during the parsing process.
- it does not constitute a parsing library by itself, but instead it converts the DTD file into a parser specification usable with the classical Flex parser generator.
- Pull parsers:
- use an iterator design pattern in order to sequentially read various XML items such as elements, attributes or data.
- this method allows the programmer to write recursive-descent parsers:
- applications in which the structure of the code that handles the parsing looks like the XML they process.
- examples of parsers from this category include: StAX13, and the .NET System.Xml.XmlReader.
- Non-extractive parsers:
- a new technology in which the object oriented modeling of the XML is replaced with 64-bit Virtual Token Descriptors.
- one of the most expressive parser belonging to this category is VTD-XML.
SAX
SAX (Simple API for XML) is a serial access XML parser. A SAX parser can be found in the Xerces library found here.
The following fragment of code shows how we could use SAX to parse an XML document:
import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.IOException;
public class SAXExample {
public static void main(String[] args) throws IOException {
try {
XMLReader parser = XMLReaderFactory.createXMLReader();
ContentHandler handler = new TextExtractor();
parser.setContentHandler(handler);
parser.parse(args[0]);
System.out.println(args[0] + " is well-formed.");
}
catch (SAXException e) {
System.out.println(args[0] + " is not well-formed.");
}
}
}
import org.xml.sax.*;
import java.io.*;
public class TextExtractor implements ContentHandler {
public TextExtractor() { }
// Handle the #PCDATA i.e. the text nodes
public void characters(char[] text, int start, int length) throws SAXException {
System.out.println("Found text node: ");
System.out.println(new String(text).substring(start, start+length));
}
public void setDocumentLocator(Locator locator) {}
// Handles the start of a document event.
public void startDocument() {
System.out.println("Entering document");
}
// Handles the end of a document event.
public void endDocument() {
System.out.println("Leaving document");
}
// Handles the beginning of the scope of a prefix-URI Namespace mapping.
public void startPrefixMapping(String prefix, String uri) {}
// Handles the ending of the scope of a prefix-URI Namespace mapping.
public void endPrefixMapping(String prefix) {}
// Triggers each time a start element is found.
public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) {
System.out.println("Found element: " + localName);
System.out.println("Attributes:");
for (int i=0; i<atts.getLength(); i++) {
System.out.println("Found attribute: " + atts.getLocalName(i) + " with value: " + atts.getValue(i));
}
}
// Triggers each time an end element is found.
public void endElement(String namespaceURI, String localName, String qualifiedName) {
System.out.println("Leaving element: " + localName);
}
// Handles white characters.
public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {}
// Handles the processing instruction. For example it can be called xml with version=1.0 and a certain encoding.
public void processingInstruction(String target, String data){}
// Handles a skipped entity.
public void skippedEntity(String name) {}
}
Links:
DOM
DOM (Document Object Model) is a convention for representing XML documents. A DOM parser can be found in the Xerces library found here.
DOM handles XML files as being made of the following types of nodes:
- Document node
- Element nodes
- Attribute nodes
- Leaf nodes:
- Text nodes
- Comment nodes
- Processing instruction nodes
- CDATA nodes
- Entity reference nodes
- Document type nodes
- Non-tree nodes;
The following fragment of code shows how we could use DOM to traverse an XML tree:
import javax.xml.parsers.*; // JAXP
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
import java.io.IOException;
public class DOMExample {
public static void main(String[] args) {
DOMExample iterator = new DOMExample();
try {
// Use JAXP to find a parser
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Turn on namespace support
factory.setNamespaceAware(true);
DocumentBuilder parser = factory.newDocumentBuilder();
// Read the entire document into memory
Node document = parser.parse(args[0]);
// Process it starting at the root
iterator.followNode(document);
}
catch (SAXException e) {
System.out.println(args[0] + " is not well-formed.");
System.out.println(e.getMessage());
}
catch (IOException e) {
System.out.println(e);
}
catch (ParserConfigurationException e) {
System.out.println("Could not locate a JAXP parser");
}
}
public void followNode(Node node) throws IOException {
// Print information on node.
System.out.println("Node name:" + node.getNodeName());
System.out.println("Node type:" + node.getNodeType());
System.out.println("Node local name:" + node.getLocalName());
System.out.println("Node value:" + node.getNodeValue());
// Process the children.
NodeList children = node.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
// Recursion on child.
followNode(child);
}
}
}
DOM also allows users to create a new XML document or change the structure of an already existing one:
import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
public class DOMCreatorExample {
public static void main(String[] av) throws IOException {
DOMCreatorExample dc = new DOMCreatorExample();
Document doc = dc.makeXML();
}
public Document makeXML() {
try {
DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = fact.newDocumentBuilder();
Document doc = parser.newDocument();
Node root = doc.createElement("books");
doc.appendChild(root);
Node book = doc.createElement("book");
((Element) book).setAttribute("title", "Processing XML with Java");
((Element) book).setAttribute("author", "Elliotte Rusty Harold");
book.appendChild(doc.createTextNode("A complete tutorial about writing Java programs that read and write XML documents."));
root.appendChild(book);
return doc;
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
}
}
Links:
Exercises
- Parse the XML created in your assignment from Laboratory 3 using both SAX and DOM. Print out the parsing time of each method (hint: use System.currentTimeMillis() to get the start and end time).
- Create the XML from your assignment in Laboratory 3 using DOM. Print the result to an XML file.
Gabriel Iuhasz, 2014-10-01, iuhasz.gabriel@e-uvt.ro