Automatic transformation of XML namespaces/Recursive downloading the information

From Wikiversity
Jump to navigation Jump to search

The user specifies a list of nonempty lists whose members are retrieval methods (for example, a mapping from a namespace to a local file, retrieval from the namespace URL, etc.) Below described traversal is repeated (unless stuck in an infinite loop) for every outer list element. For inner list the retrieval for a particular namespace is repeated in-a-row.

Priority queue[edit]

Below there is a description of priority queue object.

We maintain a list of elements. With each list element are associated: priority (an integer).

The following methods are available:

  • append(element, priority) - adds an element to the end of the list
  • pop_front() - remove and return the first of elements with maximum priority; throw an exception if all elements are disabled (or the list is empty)
  • pop_back() - remove and return the last of elements with maximum priority; throw an exception if all elements are disabled (or the list is empty)

Traversal algorithms[edit]

A traversal algorithm is an object with two methods: put(obj) to add an object to the set of "discovered" objects and get() which either return the next object or signals an exception if there are no more such object.

See also https://softwareengineering.stackexchange.com/a/358937/45576.

Modified breadth-first search[edit]

It is the breadth-first search with priority queue instead of queue.

Modified depth-first search[edit]

It is the depth-first search with priority queue instead of stack.

The algorithm[edit]

When we speak about downloading next RDF file we mean downloading the next RDF file in the depth-first or breadth-first order (as specified by user options).

We enumerate the list of RDF assets downloaded before the main loop before enumerating any other RDF assets. For breadth-first search it means that we add this as the additional first first level of the search (with priority above all other priorities) For depth-fist search this means that we first enumerate these assets and after this do depth first search starting at each of the elements of this list (without repeating previously discovered nodes). TODO: Explain clearly.

The tree of RDF files (which we traverse in the depth-first or breadth-first order) is defined as the following:

The branches of a node (RDF file) are [TODO: Does it mean that document namespaces are processed before sources/targets?]

  1. the current XML document namespaces in order of first occurrence in document order; [TODO: We may consider more sophisticated loading order (based on precedences).]
  2. at user option these branches which correspond to current sources and/or targets of scripts of the current assets as well as transformation targets (if so specified by user options); Remark: transformation targets do not change, so they are iterated at most once.
  3. these RDF files which are linked through rdf:seeAlso in the order of the list.

So, we may load first “sources” and “targets” and only then "see also". Rationale: our purpose to find a connection between sources and targets, while following "see also" may "lead us astray" from our purpose. (For example it could lead to an infinite loop despite the task is accomplishable with loading "sources" and "targets".)

Upon reading an RDF file, the list of the namespaces, the list of transformations, and the list of relations between precedences should be updated, accordingly the grammar described above.

Implementation note: If our task is validation, updating the list of transformations and precedences is not necessary. If our task is transformation, updating the list of namespaces is not necessary.

Note: Retrieving some existing documents designated by namespace URLs gets an RDF Scheme. Examples: http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/dc/elements/1.1/. In this case :namespaceInfo in the RDFS points to a namespace information which should be downloaded instead.

As an other alternative, the RDF file pointed by the namespace URL may contain namespace and transformer info.

Several namespaces may refer to one RDF due using # in URLs.

Backward compatible feature: If at the URL we download from there is an RDDL document, download the RDF specified as an RDDL resource with xlink:role="http://portonvictor.org/ns/trans/". (This is not done recursively, that it an RDDL document linked from an other RDDL document is not downloaded.)

If a valid RDF document was not retrieved from a URL (say because of a 404 HTTP error or because invalid XML code), it should not be retrieved again (during the pipeline). However it is not forbidden to repeat failed because of timeout HTTP attempts.

Recursive retrieval may be limited to certain URLs. For example, it may be limited only to file:// URLs.