Jump to content

Applied Programming/Internet Data

From Wikiversity

This lesson introduces Internet data, including XML and JSON.

Objectives and Skills

[edit | edit source]

Objectives and skills for this lesson include:

Readings

[edit | edit source]
  1. Wikipedia: HTML
  2. Wikipedia: JSON
  3. Wikipedia: Markup language
  4. Wikipedia: Query string
  5. Wikipedia: Representational state transfer
  6. Wikipedia: Tree (data structure)
  7. Wikipedia: XML

Multimedia

[edit | edit source]
  1. Youtube: JSON overview
  2. YouTube: Python 3 Programming Tutorial - Parsing Websites with re and urllib
  3. YouTube: Nested Dictionaries with For Loops
  4. YouTube: JSON in Python
  5. YouTube: Iterating Through JSON
  6. YouTube: Python Tutorial: Datetime Module - How to work with Dates, Times, Timedeltas, and Timezones
  7. YouTube: XML Video Tutorial
  8. YouTube: JSON in Python with json Library

Examples

[edit | edit source]

Activities

[edit | edit source]

XML Activities

[edit | edit source]
  1. Review mw:Help:Export and mw:Manual:Parameters to Special:Export. Create a program that uses a link to Special:Export to export all pages in Category:Applied_Programming.
  2. Use an XML library to parse the XML and display site name, namespace (<ns>) by name, page title, and page text for each page in the category. You will need to build and use a dictionary to save namespaces as key-value pairs with the ns value and corresponding name.
  3. Display a list of all page titles and last revision date sorted in descending order by timestamp. Display the date and time in local time rather than Zulu (UTC) time.
  4. For each of the above, use separate functions for each type of processing. Reuse functions where possible, such as in sorting and searching. Avoid using global variables by passing parameters and returning results. Include appropriate data validation and parameter validation. Add program and function documentation, consistent with the documentation standards for your selected programming language.

JSON Activities

[edit | edit source]
  1. Review Wikitech:Analytics/AQS/Pageviews. Create a program that gets the top 1000 most visited articles for each month within the previous 12 months. Only one month may be requested at a time, so 12 different requests will be required. Use the all-days option to get data for all days within the month.
  2. Use a JSON library to parse the JSON and each article and views count to a dictionary of key-value pairs, summing the corresponding views counts for pages visited during multiple months. Display the top 1000 articles and views in descending order by views, and alphabetically in the case of a tie.
  3. The format for an article title is Title[/Subpage...]. The title without subpages may be considered the overall learning project. Iterate over the dictionary and use RegEx to separate titles from subpages. Create a separate dictionary with a key for for each learning project and the sum of the page and its subpage views as the value. Display the top 100 learning projects and corresponding views sorted in descending order by views, and alphabetically in the case of a tie.
  4. For each of the above, use separate functions for each type of processing. Reuse functions where possible, such as in sorting and searching. Avoid using global variables by passing parameters and returning results. Include appropriate data validation and parameter validation. Add program and function documentation, consistent with the documentation standards for your selected programming language.

Lesson Summary

[edit | edit source]
  • A query string communicates key-value pairs to a web server in order to pull up a desired resource through a GET HTTP request.[1]
    • By convention, the query string follows a question mark (?) separator; keys and values are then adjoined with an equal sign (=); multiple pairs are chained together using ampersands (&).[1]
    • Some characters cannot be part of a URL (for example, the space) and some other characters have a special meaning in a URL: for example, the character # can be used to further specify a subsection (or fragment) of a document.[1]
    • For example, given the URL https://en.wikiversity.org?title=Applied_Programming/Internet_Data#Lesson_Summary, the substring before the question mark delimiter is the location of the Internet resource (which should point to a script). The substring after the question mark is the query string. Here there is only a single key-value pair, that being a title key and its value which references this very lesson summary.[1]
      • Behind the scenes, Wikiversity has a server program that extracts this information from our query string, handling the desired action; in this case, we are simply redirected to a page, but much can be accomplished through such a request.[1]
  • Before forms were added to HTML, browsers rendered the <isindex> element as a single-line text-input control.[1]
    • The text entered into this control was sent to the server as a query string addition to a GET request for the base URL.[1]
      • When the text input into the indexed search control is submitted, it is encoded as a query string as follows: argument1+argument2+argument3...[1]
        • The query string is composed of a series of arguments by parsing the text into words at the spaces.[1]
        • The series is separated by the plus sign, '+'.[1]
  • A markup language is a way to give semantic meaning to pure text.[2]
    • Examples include typesetting instructions such as those found in troff, TeX and LaTeX, or structural markers such as XML tags. Markup instructs the software that displays the text to carry out appropriate actions, but is omitted from the version of the text that users see.[2]
  • Some markup languages, such as the widely used HTML, have pre-defined presentation semantics—meaning that their specification prescribes how to present the structured data. Others, such as XML, do not have them and are general purpose.[2]
    • Perhaps the most well-known markup language is HTML, powering the underlying technology of the web. If a phrase is marked by opening and closing <i>...</i> tags (known holistically as the <i> element), the phrase is targeted for italicization.[3]
      • By itself, HTML does nothing. The use of tags, however, gives meaning and context to words, allowing the browser's rendering engine to interpret and construct the document according to HTML's standard.[3]
  • HTML elements are the building blocks of HTML pages. With HTML constructs, images and other objects, such as interactive forms, may be embedded into the rendered page. It provides a means to create structured documents by denoting structural semantics for text such as:[3]
  • HTML elements are delineated by tags, written using angle brackets.[3]
  • HTML can embed programs written in a scripting language such as JavaScript which affect the behavior and content of web pages.[3]
  • In addition to HTML, which is strict and defines particular elements, there exists XML, which is arbitrarily standardized by a user-written schema. If you mark a string with <employee>...</employee> tags, a parser will understand this information to be referring to an employee (assuming the schema were defined this way).[4]
    • In modern word-processing systems, presentational markup is often saved in descriptive-markup-oriented systems such as XML, and then processed procedurally by implementations.[2]
  • In order for search-engine spiders to be able to rate the significance of pieces of text they find in HTML documents, and also for those creating mashups and other hybrids as well as for more automated agents as they are developed, the semantic structures that exist in HTML need to be widely and uniformly applied to bring out the meaning of published text.[3]
  • In computing, Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.[4]
    • The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages.[4]
  • XML allows the use of any of the Unicode-defined encodings, and any other encodings whose characters also appear in Unicode. XML also provides a mechanism whereby an XML processor can reliably, without any prior knowledge, determine which encoding is being used.[4]
  • The design goals of XML include, "It shall be easy to write programs which process XML documents.[4]
  • The XML specification defines an XML document as a well-formed text, meaning that it satisfies a list of syntax rules provided in the specification. Some key points in the fairly lengthy list include:[4]
    • The document contains only properly encoded legal Unicode characters.[4]
    • None of the special syntax characters such as < and & appear except when performing their markup-delineation roles.[4]
    • The start-tag, end-tag, and empty-element tag that delimit elements are correctly nested, with none missing and none overlapping.[4]
    • Tag names are case-sensitive; the start-tag and end-tag must match exactly.[4]
    • Tag names cannot contain any of the characters !"#$%&'()*+,/;<=>?@[\]^`{|}~, nor a space character, and cannot begin with "-", ".", or a numeric digit.[4]
    • A single root element contains all the other elements.[4]
  • A tree is a widely used abstract data type (ADT)—or data structure implementing this ADT—that simulates a hierarchical tree structure, with a root value and subtrees of children with a parent node, represented as a set of linked nodes.[5]
    • A node is a structure which may contain a value or condition, or represent a separate data structure (which could be a tree of its own).[5]
  • A tree data structure can be defined recursively (locally) as a collection of nodes (starting at a root node), where each node is a data structure consisting of a value, together with a list of references to nodes (the "children"), with the constraints that no reference is duplicated, and none points to the root.[5]
  • JSON is a language-independent data format.[6]
    • JSON was originally devised for JavaScript, but many languages now include their own standardized JSON libraries (including Python).[6]
    • JSON files use the extension .json; the associated media or MIME type is application/json.[6]
    • To some, JSON is a preferable alternative to XML when dealing with AJAX applications.[6]
  • JSON grew out of a need for stateful, real-time server-to-browser communication protocol without using browser plugins such as Flash or Java applets, the dominant methods used in the early 2000s.[6]
  • JSON's basic data types are as follows, number, string, boolean, array, object, and null.[6]
    • Numbers in JSON are agnostic with regard to their representation within programming languages. No differentiation is made between an integer and floating-point value: some implementations may treat 42, 42.0, and 4.2E+1 as the same number while others may not.[6]
  • JSON does not allow for comments whereas XML does.[6]
  • In JSON limited whitespace is allowed and ignored around or between syntactic elements (values and punctuation, but not within a string value).[6]
    • Only four specific characters are considered whitespace for this purpose: space, horizontal tab, line feed, and carriage return.[6]
  • JSON Schema:[6]
    • specifies a JSON-based format to define the structure of JSON data for validation, documentation, and interaction control.[6]
    • It provides a contract for the JSON data required by a given application, and how that data can be modified.[6]
    • JSON Schema is based on the concepts from XML Schema (XSD), but is JSON-based.[6]
  • JSON vs. XML[7][8]
    • JSON - quicker and requires less memory, simpler to parse, smaller in size/lightweight, more human readable, doesn't support commenting, better for data transfer, multiple data types, less secure, lacks namespace support[7][8]
    • XML - slower to parse, more verbose, handles metadata better in tag attributes, in use for longer time/more support in some language libraries, better for formatting, lacks intrinsic data type support, more secure, has namespace support[7][8]
  • REST or representation state transfer is a style by which resources abide. If a resource satisfies every architectural restraint imposed, the resource is said to be RESTful.[9]
  • REST allows clients and servers to exchange and make use of information on the Internet.[9]
    • RESTful services typically permit you to access or manipulate web resources through a predefined set of stateless operations.[9]
  • The constraints of the REST architectural style affect the following architectural properties:[9]
    • Performance – component interactions can be the dominant factor in user-perceived performance and network efficiency.[9]
    • Scalability to support large numbers of components and interactions among components.[9]
    • Simplicity of a uniform Interface.[9]
    • Modifiability of components to meet changing needs (even while the application is running).[9]
    • Visibility of communication between components by service agents.[9]
    • Portability of components by moving program code with the data.[9]
    • Reliability is the resistance to failure at the system level in the presence of failures within components, connectors, or data.[9]
  • This lesson requires you to find the current date to calculate the previous dates. In Python, you will need to use the proper datetime syntax: JournalDev: Python Datetime Syntax.[10]
  • Python’s JSON library has multiple methods for working with Internet data:
    • json.load() method loads JSON data from a file or a file-like object.[11]
    • json.loads() method loads JSON data from a string.[11]
    • json.dump() method writes JSON object to a file or a file-like object.[11]
    • json.dumps() method outputs JSON object as a string.[11]

Key Terms

[edit | edit source]
application programming interface (API)
When used in the context of web development, an API is typically defined a set of specifications, such as Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.[12]
attribute
An attribute is a markup construct consisting of a name–value pair that exists within a start-tag or empty-element tag. An example is <img src="madonna.jpg" alt="Madonna" />, where the names of the attributes are "src" and "alt", and their values are "madonna.jpg" and "Madonna" respectively.[4]
document type declaration / DTD
HTML documents are required to start with a Document Type Declaration (informally, a "doctype"). The DTD to which the DOCTYPE refers contains a machine-readable grammar specifying the permitted and prohibited content for a document.[3]
element
An element is a logical document component that either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element's content, and may contain markup, including other elements, which are called child elements.[4]
escape character
Due to the nature of XML’s syntax, there are some characters that pose problems. For example, “ < ” is reserved, and “ &lt “ would have to be used in its place.[4]
HTML
Hypertext Markup Language, a regimented standard that describes the structure and content of a document, facilitating its rendering by a web browser.[3]
JSON
JavaScript Object Notation, a format that houses data so they can be exchanged fluidly.[6]
markup language
A means of tagging or marking up textual content that is to be interpreted a certain way.[2]
MIME Type
A media type (also known as a Multipurpose Internet Mail Extensions or MIME type) is a standard that indicates the nature and format of a document, file, or assortment of bytes.[13]
node
A structure which may contain a value or a condition or represent an entirely separate data structure.[5]
path
A sequence of nodes and edges connecting a node with a descendant.[5]
query string
A variadic list of key-value pairs appended to a URL and sent to a web server for processing.[1]
REST
Representational State Transfer, a design philosophy or architectural style that defines a set of limitations for accessing resources in order to improve reliability and scalability.[14]
serializable
is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer) or transmitted (for example, across a network connection link) and reconstructed later (possibly in a different computer environment).[15]
tag
A tag is a markup construct that begins with < and ends with >. Tags come in three flavors:[4]
  • start-tag, such as <section>
  • end-tag, such as </section>
  • empty-element tag, such as <line-break />
tree
A hierarchical data structure comprised of nodes with a single element, known as the root, on the highest or topmost layer. Both HTML and XML documents are best represented as trees.[5]
web browser
receive HTML documents from a web server or from local storage and render the documents into multimedia web pages[3]
XML
Extensible Markup Language, a language that generalizes the marking up of documents, so users can define their own semantics.[4]

See Also

[edit | edit source]

References

[edit | edit source]