Web Science/Part1: Foundations of the web/Web content/Working with XML

From Wikiversity
Jump to: navigation, search

Working with XML

Wikiversity-Mooc-Icon-Edit.svg
Wikiversity-Mooc-Icon-Ask.svg
Wikiversity-Mooc-Icon-Learning-goals.svg

Learning goals

  1. Understand the Domain Object Model and the DOM tree
  2. Understand that HTML is just a special dialect of XML
  3. Understand the relationship between HTML and XML
Wikiversity-Mooc-Icon-Edit.svg
Wikiversity-Mooc-Icon-Ask.svg
Wikiversity-Mooc-Icon-Video.svg

Video

Wikiversity-Mooc-Icon-Edit.svg
Wikiversity-Mooc-Icon-Ask.svg
Wikiversity-Mooc-Icon-Script.svg

Script

In the previous lesson we have seen several requirements that call for separation of content proper and structural or layout information. In this lesson, I will show you how to work with the previous XML file such that it becomes easy to generate structure and/or layout.

Let us first rephrase the core ideas of XML:

  1. Markup pieces of content by understandable "parentheses", e.g. <artist>Bobby McFerring</artist>, "<artist></artist>" is markup, "Bobby McFerring" is content (problem 2)
  2. Allow new applications, e.g. multimedia display instead of text display, to use new markup (problem 3)
  3. Allow for nesting of markup such that larger pieces of content may be treated in a meaningful way, e.g. "<record><artist>Bobby McFerring</artist><track>...</track><track>Koblenz</track>...</record>" (problem 2)
  4. Simplify tools by having some simple rules (problem 4):
    1. each opening tag like "<city>" requires a corresponding closing tag "</city>"
    2. proper nesting of markup: "<record><artist>Bobby McFerring</artist><track>...</track><track>Koblenz</track></record>" is allowed; "<record><artist>Bobby McFerring</artist><track>...</track><track>Koblenz</record></track>" is disallowed
  5. Each XML document comes with a preamble specifying the character enconding and XML version used, i.e. most often now: "" allowing for internationalization (problem 5) No Problem 5 was mentioned yet. --2A02:908:CE11:A701:249F:C8EE:7181:256A (discuss) 12:56, 7 November 2013 (UTC)

An XML document that follows these rules is called "well-formed". You can check the well-formedness of a document by XML validators, such as http://www.w3schools.com/xml/xml_validator.asp

Implications that arise from these rules are that:

  • Properly nested Markup gives us a data structure: a DOM tree
  • DOM tree can be navigated recursively by different applications and repurposed for different types of devices and device sizes (problem 1)

Let's look back out our running example (unfortunately I have not found a DOM viewer without installation that shows the DOM tree correctly. Below is a pretty print of the XML code, but not a proper DOM tree view, my previous uploads in the place used http://software.hixie.ch/utilities/js/live-dom-viewer/ but it delivers a buggy tree, e.g. tracktitles not being correctly sorted beneath tracks. Here is the pretty print

DOM-Tree-Example-For-WebScience-MOOC.png

And here is the buggy DOM Tree

[[1]]

Let us assume some simple rules that map our element names as follows (some of my wikientry is not displayed !!!)

from to (long name) to (short name)
<playlist>
<document>
<html>
</playlist>
</document>
</html>
<playlistTitle>
<heading>
<h1>
</playlistTitle>
</heading>
</h1>
<owner>
<emphasized>
<em>
</owner>
</emphasized>
</em>
<tracks>
<numberedList>
<ol>
</tracks>
</numberedList>
</ol>
<track>
<listItem>
<li>
</track>
</listItem>
</li>
<playDate>
<emphasized>
<em>
</playDate>
</emphasized>
</em>
<trackTitle>
<subheading>
<h3>
</trackTitle>
</subheading>
</h3>
<record>
-suppress- -suppress-
</record>
-suppress- -suppress-
<recordTitle>
<emphasized>
<em>
</recordTitle>
</emphasized>
</em>
<artist>
<strong>
<em>
</artist>
</strong>
</em>

Then executing a corresponding transformation gives us:

 <document>
	<heading>
		This is <emphasized>Rene Pickhardt</emphasized>'s playlist. 
	</heading>
	<numberedList>
		<listItem>
			<emphasized>2013-11-02 01:54 CET</emphasized>
			<subheading>Lords of the boards</subheading>
			on <emphasized>Proud like a God</emphasized> by <strong>Guano Apes</strong>
		</listItem>
		<listItem>
			<emphasized>2013-11-02 01:57 CET</emphasized>
			<subheading>Toxicity</subheading>
			on <emphasized>Toxicity</emphasized> by <strong>System of A Down</strong>
		</listItem>
		<listItem>
			<emphasized>2013-11-02 02:01 CET</emphasized>
			<subheading>B.Y.O.B</subheading>
			on <emphasized>Mezmerize</emphasized> by <strong>System of A Down</strong>
		</listItem>
	</numberedList>
 </document>

Or using our shorthand description:

 <html>
	<h1>
		This is <em>Rene Pickhardt</em>'s playlist.
	</h1>
	<ol>
 		<li>
 	 		<em>2013-11-02 01:54 CET</em>
 	 		<h3>Lords of the boards</h3> 
			on <em>Proud like a God</em> by <strong>Guano Apes</strong>
 		</li>
 		<li>
			<em>2013-11-02 01:57 CET</em>
			<h3>Toxicity</h3> 
			on <em>Toxicity</em> by <strong>System of A Down</strong>
		</li>
		<li>
			<em>2013-11-02 02:01 CET</em>
			<h3>B.Y.O.B</h3> 
			on <em>Mezmerize</em> by <strong>System of A Down</strong>
		</li>
	</ol>
 </html>

The shorthand notation is actually an example of HTML, a piece of Hypertext Markup Language as it is used in the Web. As you note this, you may also recognize that HTML, at least in its version of 4.0 and higher, is actually well-formed XML. Just like our description of records that we started with is well-formed XML. Hence, XML is not one language for structuring data and content, it is actually a meta-language that allows you to come up with infinitely many different languages for structuring data and content. In fact, you can even do it in other encodings than Latin characters, say Kanji or Arabic.

What you may have observed further is that the kind of mapping we have shown based on mapping elements onto new elements is awkward as it requires very strictly observing a certain order of XML elements in the source file. However, you can program a better transformation tool, use a generic tool like AWK or you can use existing tools for manipulating XML, such as XPath, XQuery, XSL (eXtensible Stylesheet Language) consisting of XSLT (eXtensible Stylesheet Language Transformation) and XSL FO (XSL Formatting Objects). We do not go into the details of all these tools as this would require a whole course of its own, but for you it is important to know that such languages and tools exist and you can dig out their description whenever you have a problem in this direction and build your solution based on standardized mechanisms.

The core lesson to be learned about XML here is that XML-based markup gives you a very flexible handle for:

  • selecting content
  • repurposing content
  • reformatting content

What have we not considered here?

What we will show you next are some details about HTML and about formatting HTML pages with HTML and Cascading Style Sheets before we then go towards multimedia.

Wikiversity-Mooc-Icon-Edit.svg
Wikiversity-Mooc-Icon-Ask.svg
Wikiversity-Mooc-Icon-Quiz.svg

Quiz

1. A DOM tree

is only another representation of the XML text
comes with an API to select parts of an XML document
can be mapped bijectively to the XML text

2. By what means is Web content accessed

screen reader
Braille display
touch screen
tablet
mobile
TV set
Web crawler

3. Determine well-formedness of the following XML snippets

<tracktitle id="42">Aguas di Marco</tracktitle id="42">
<tracktitle id="42"/>Aguas di Marco</tracktitle>
<track><tracktitle>Aguas di Marco</track></track>
<track><tracktitle>Aguas di Marco</track></tracktitle>
<track/>
<track><tracktitle>Aguas di Marco<trackartist>Aguas di Marco</track>

Your score is 0 / 0


Other Quiz ideas[edit]

1. Which of the following three DOM trees correctly represents the following XML snippet
Wikiversity-Mooc-Icon-Edit.svg
Wikiversity-Mooc-Icon-Ask.svg
Wikiversity-Mooc-Icon-Further-readings.svg

Further reading

  1. https://en.wikipedia.org/wiki/XML
  2. Further references (not same as further reading!) http://www.w3.org/TR/xml/
Wikiversity-Mooc-Icon-Discussion.svg

Discussion