Digital Libraries/Protocols

Older versions of the draft developed by UNC/VT Project Team (2009-10-07 PDF)

Module name

Protocols

Scope

This module addresses the concepts, development and implementation of digital library protocols and covers the roles of protocols in Information retrieval systems (IR) and Service Oriented Architectures (SOA).

Learning objectives:

By the end of this module, the student will be able to:

a. Explain the basic concepts and methods related to DL protocols.

b. Design and implement a protocol for harvesting metadata from a digital library.

c. Design and implement a digital library service based on the principles and guidelines of DL protocols.

5S characteristics of the module:

a. Stream: Protocols enable the data in digital libraries to be harvestable. Protocols specify the flows of streams of data.

b. Structure: Protocols enable digital libraries to expand and disseminate digital objects and provide various information organization services.

c. Spaces: Protocols connect spatially separate digital libraries.

d. Scenario: Different protocols implement specific scenarios of communication among digital libraries.

e. Society: Different protocols implement and design for different use cases and communities.

Level of effort required:

a. Class time: 3 hour

b. Student time outside class: 8 hours

i. Reading before the class starts: 4 hours

ii. Homework assignment: 4 hours

Relationships with other modules:

a. 4-b: Metadata: DL protocols are harvesting metadata stored in a digital library. 4-b should be taught before this module.

b. 5-a: Architecture overview: This module should be taught around the same time in the semester.

c. 5-b: Application software: This module should be taught around the same time in the semester.

Prerequisite knowledge required:

a. Knowledge of client-server, P2P, SOA architectures

b. Understanding of client request and server response

c. Basics of other networking protocols

Introductory remedial instruction:

a. Quick overview of software engineering architectures and basics of simple networking protocols.

Body of knowledge:

1. Introduction:

This module discusses the following digital library protocols:

a. OAI-PMH

b. SOA

c. P2P

d. VIDI

e. Z39.50

f. DLIOP

Later in the module parallel and distributed information retrieval is discussed.

2. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

a. OAI-PMH version history

i. It is a protocol developed by the Open Archives Initiative.

ii. The Santa Fe Convention led to the first incarnation of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

iii. OAI-PMH 1.0 introduced the unqualified Dublin Core element set as a baseline for metadata interoperability. OAI-PMH 1.1 was a revision of the 1.0 specification taking account of changes to the emerging XML Schema specification. Both v.1.0 and 1.1 were experimental in nature.

iv. OAI-PMH 2.0 is a major revision of the protocol and it is a stable protocol.

b. Flexible deployment:

i. Multiple service providers can harvest from multiple data providers.

ii. Aggregators can sit between data providers and service providers.

iii. The harvesting approach can be complemented with searching.

c. Main ideas of OAI

i. world-wide consolidation of scholarly archives

ii. free access to the archives

iii. consistent interfaces for archives and service providers

iv. low barrier protocol / effortless implementation

d. Structure model

i. It is a simple protocol based on HTTP and XML.

ii. This protocol is carried within HTTP POST or GET methods.

e. Protocol details

i. Records

A record is the metadata of a resource in a specific format. A record has three parts: a header and metadata, both of which are mandatory, and an optional about statement.

ii. Datestamps

A datestamp is the date of last modification of a metadata record. Datestamp is a mandatory characteristic of every item. It has two possible levels of granularity: YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ.

iii. Metadata schema

OAI-PMH supports dissemination of multiple metadata formats from a repository. The properties of metadata formats are:

ID string to specify the format (metadataPrefix)
metadata schema URL (XML schema to test validity)
XML namespace URI (global identifier for metadata format)

iv. Sets

Sets enable a logical partitioning of repositories. They are optional.

v. Request format

Requests must be submitted using the GET or POST methods of HTTP, and repositories must support both methods. At least one key=value pair: verb=RequestType must be provided. Additional key=value pairs depend on the request type.

vi. Response

Responses are formatted as HTTP responses. The content type must be text/xml. HTTP-based status codes, as distinguished from OAI-PMH errors, such as 302 (redirect) and 503 (service not available) may be returned. Compression codes are optional in OAI-PMH, only identity encoding is mandatory. The response format must be well-formed XML with markup

vii. Flow control

OAI-PMH supports partitioning. Those managing a repository make the decisions on partitioning: whether to partition and how.

viii. Errors and exceptions

Repositories must indicate OAI-PMH errors by the inclusion of one or more error elements.

ix. Request types

There are six different request types:

Identify
ListMetadataFormats
ListSets
ListIdentifiers
ListRecords
GetRecord

f. Implementing protocol

i. Before implementing

Data Provider:

1. Which data do you want to deliver?

2. Which Service Providers do you want to provide with data?

Service Provider:

1. From which Data Providers do you get the metadata?

2. Which services do you want to provide, and to whom?

3. What kind of metadata do you want to provide through your services?

Other things one also needs to consider:

1. Update frequency

2. Metadata format

3. Subject schema

ii. Data Provider

Requisites:

1. Web server

2. Programming interface / API

3. Archive identifier / base URL

4. Metadata format

5. Datestamps for metadata

6. Unique identifier for each item

7. Logical set hierarchy

8. Flow control

Components:

1. Argument Parser validates OAI requests.

2. Error Generator creates XML responses with encoded error messages.

3. Database Query / Local Metadata Extraction retrieves metadata from the repository, according to the required metadata format.

4. XML Generator / Response Creation creates XML responses with encoded metadata information.

5. Flow Control realizes incomplete list sequences for 'larger' repositories. It uses resumption tokens as the control mechanism.

iii. Service Provider

Requisites:

1. An Internet-connected server

2. A database system on equivalent

3. A programming environment

Components and architecture:

1. Archive management involves the selection of repositories to be harvested.

2. Request Component creates HTTP requests and sends them to OAI repositories (Data Provider).

3. Scheduler realizes timed and regular retrieval of the associated archives.

4. Flow Control is implemented via resumption tokens, partitioning of the result list into incomplete sections with a new request to retrieve more results.

5. Update Mechanism realizes the consolidation of metadata which have been harvested earlier (merge old and new data).

6. XML Parser analyses the responses received from the repositories, with validation using the XML schema, and transforms the metadata encoded in XML into the internal data structure.

7. Normaliser transforms data in different metadata formats into a homogenous structure.

8. Database receives the output of the normaliser mapping the XML structure of the metadata into a relational database that will handle multiple values of elements. An alternative is to use an XML database.

9. Duplication Checker merges identical records from different data providers.

10. Service Module provides the actual service to the 'public'. The basis for a service provided is the harvested and stored records of the associated archives.

3. Service Oriented Architecture (SOA) for Digital Library

a. Simple definition: SOA provides methods for system development and integration where systems group functionality around business processes and package these as interoperable services

b. The development of SOA

i. It is an architecture developed with XML and Web Services. It is commonly built using Web Services standards.

ii. SOA mostly refer to a platform that could realize the concept of partitioning an enterprise into a series of autonomous services.

c. SOA deployment

i. Run over standard Web protocols: SOA uses XML and HTTP packaging.

ii. Communication: SOA uses SOAP (Simple Object Access Protocol) to exchange information, uses WSDL (Web Service Definition Language) to describe Web Services and uses UDDI (Universal Description Discovery and Integration) to register Web Services.

d. SOA for digital library

i. Services Technology is widely used in Digital Libraries: Fedora, DSpace and CiteSeer

ii. The requirement for mapping a Digital Library application/service into web service:

An API interface to that application/service
A standard access layer such as SOAP
A layer describing the service in a standard fashion encoded using WSDL

iii. Role of digital library: Service provider.

e. Case Study: Fedora

i. Fedora Repository system is a Web Service:

Access/Search (API-A) and Management (API-M) use web service
Service descriptions use WSDL
Fedora uses both SOAP and HTTP bindings
Fedora acts as mediator to these services

ii. Fedora Repository Service Interfaces

Access Service (API-A and API-A-LITE)

1. Search - search repository for objects

2. Object Reflection - what disseminations can the object provide?

3. Object Dissemination - request a view of the object’s content

Management Service (API-M)

1. Ingest - XML-encoded object submission

2. Create - interactive object creation via API requests

3. Maintain - interactive object modification via API requests

4. Validate – application of integrity rules to objects

5. Identify - generate unique object identifiers

6. Security - authentication and access control

7. Preserve - automatic content versioning and audit trail

8. Export - XML-encoded object formats

4. Peer-to-Peer Computing (P2P Computing) for Digital Library

a. Simple definition:

i. Peers act as clients and servers

ii. Peers communicate directly

b. The development of P2P

i. Unstructured P2P (such as Gnutella, Napster) to structured P2P (such as Pastry, Chord)

ii. P2P used for provide resources: services, for example, storage services, computational services, streaming media services

c. P2P for Digital Library

i. Main idea

Build Digital Library on the P2P structure, use P2P to store digital objects
Use P2P to share and route digital documents

ii. Implementation

Automatically digital documents can be discovered and routed
Uniform client-level metadata query services that are compatible with heterogeneous underlying collections
Sharing of digital documents on P2P structure rather than storing the documents on a center server

d. Case Study: The ADEPT Digital Library Architecture

i. ADEPT (Alexandria Digital Earth Prototype) architecture

A framework for building distributed digital libraries of georeferenced information
Distributed, heterogeneous, scalable

ii. Objects

Items: Items are fundamental objects in ADEPT, referring to digital objects. Items only have identity, no other innate properties
Collections: Collections are set of items, information about individual items is maintained at the collection level
Libraries: Libraries are set of collections, which expose a single standard set of interfaces to the collections

iii. ADEPT Collection Discovery Service

ADEPT takes unstructured P2P mechanism: Reason: simple, sufficiently manageable and scalable
ADEPT uses central CDS server and Register server.
Collection registry polls known library CDS server

5. VIDI

a. Simple definition: A lightweight protocol between visualization systems and digital libraries.

b. VIDI Development

i. Extended from OAI Protocol for Metadata Harvesting

ii. Based on Digital Library likes MARIAN and Visualization Systems likes Infosphere

c. Protocol

i. The labels

‘s’ means server-server structure
‘c’ means client-server structure
‘x’ means in this specific scenario, this command is unused

ii. Four scenarios from left to right

Most commands in server-server structure
Least commands in server-server structure
Most commands in client-server structure
Least commands in client-server structure

iii. Most commands in client-server structure scenarios

d. VIDI Implementation

i. Use of HTTP Request and Response

The VIDI protocol requests are expressed as HTTP requests
HTTP requests may be expressed using either HTTP GET or POST

ii. Protocol Entities Definitions: Metadata Format, Visdata format, Transformer, Result Set, Result Record, Result Set Identifier, Unique Identifier

iii. Dates and Times

The date used in VIDI protocol is encoded using “Complete date” variant of ISO8601, the format is YYYY-MM-DD
The time used in VIDI protocol is encoded using the “complete date plus hours, minutes and seconds” variant of ISO8601, the format is YYYY-MM-DDThh:mm:ssTZD

iv. VIDI Commands

Identify: Retrieve system information about a DL or VIS
ListMetaFormats: Retrieve the metadata format available from a DL
ListVisdataFormats: Retrieve the metadata format available from a VIS
ListTransformers: Retrieve the transformers that VIS supports in order to transform the metadata format to the visdata format
RequestResultSet: Transfer query data

6. Special Protocols: Z39.50

a. Introduction:

i. A client server protocol for searching and retrieving information from remote computer databases

ii. Covered by ANSI/NISO (National Information Standards Organization) standard Z39.50

iii. Accepted by ISO (International Standards Organization) as standard 23950

iv. The standard is maintained by the Library of Congress (LOC)

v. First version of Z39.50 in 1988, Z39.50 – 1992 (version 2), Z39.50 – 1995 (version 3)

b. Important components:

i. Target: Implements the abstract database and is a ready made server module

ii. Gateway: A program that has two interfaces. First, it acts as Origin to a Z39.50 Target. Second, it handles communication with a client application. Client protocol may be HTML, Telnet, Z39.50, etc. Advanced Gateway like the one in the figure below can connect to several Z39.50 targets for parallel search, serial search, merging of results. Advanced Gateways can handle several different protocols on both interfaces SQL, LDAP, HTML, DNS, etc.

iii. Origin: Normally part of graphical client which hides complexity from the user. It can access several targets simultaneously. There are clients with a “raw” Origin interface

7. Special Protocols: Stanford’s Digital Library Interoperation Protocol (DLIOP)

a. Stanford Digital Library test bed

i. Stanford Digital Library test bed was a platform for experimentation with interoperation among online services.

ii. The figure 13 shows the InfoBus, central to the architecture of the Stanford test bed. It is based on a hardware bus metaphor to suggest that services, repositories, and clients are 'plugged in', and interoperate by taking advantage of interoperability mechanisms built into the testbed.

iii. DLIOP is an asynchronous protocol, providing robustness in the face of network or server outages

iv. The approach is to use distributed objects to allow integrated access to heterogeneous service across networks.

v. The distributed approach allows the interaction of processes on different machines, with different architectures, implemented in different languages.

vi. It uses CORBA to provide communication between remote processes. Xerox PARC’s ILU, a free implementation of a CORBA superset, is used.

b. Z39.50 Interoperability Example of InfoBus: Top part of Figure 14 shows how a Z39.50 client communicates with other services (can be Z39.50 or any other Service). Bottom part of Figure 14 shows how non-Z39.50 clients consume services provided by a Z39.50 Service. The rounded rectangles in the figure are parts interfacing with the InfoBus communicating via DLIOP to provide the interoperable framework.

8. Parallel and Distributed Information Retrieval

a. Motivation

i. Exponential growth in size of online scientific data.

ii. Managing the size and growth of data demands scalable and multitasking algorithms.

b. Requirements for Distributed IR Protocols

i. Enable retrieval of data from digital collections that are geographically separated.

ii. Provide methods for linking and accessing heterogeneous collections of scientific data.

c. Features of Parallel IR

i. Computation model divides the main task into sub-tasks and executes the sub-tasks in parallel.

ii. Sub-tasks generally run on homogeneous systems and each system works on the same problem.

iii. Parallel IR uses shared memory models.

iv. Parallel IR broadcasts every request to every process.

v. Main objective is to achieve speed-up.

d. Features of Distributed IR

i. Like Parallel IR, computation model in Distributed IR divides the main task into sub-tasks and executes the sub-tasks in parallel.

ii. Distributed IR sub-tasks are run on different processing units where inter-process communication is via network protocols.

iii. Sub-tasks may run on heterogeneous systems.

iv. Distributed IR employs a procedure to select a subset of processes to broadcast a request.

v. Main objectives in Distributed IR are to achieve scalability and availability and speed, but may not be efficient.

e. Case Study on Parallel IR

i. Paper on Inverted File Partitioning Schemes in Multiple Disk Systems (IEEE transactions on Parallel and distributed systems, Vol 6, Feb 1995) by Byeong-Soo Jeong and Edward Omiecinski.

ii. Paper discusses two schemes for Parallel IR implementation.

iii. Goal of the paper is to reduce average response time by partitioning the inverted file.

iv. The paper identifies I/O time as a major cost factor in IR system.

v. It exploits the potential of I/O parallelism and balances I/O workload for better response time by partitioning and distributing files.

vi. The paper discusses two partitioning schemes: based on termid and based on document-id for inverted file systems.

f. Case Study on Distributed IR

i. Paper on Methodologies for Distributed Information Retrieval (18th international conference on Distributed Computing Systems – 1998) by Alister Moffat, Justin Zobel, Owen De Kretser and Tim Shimmin.

ii. This paper discusses three different methodologies for Distributed IR and compares their effectiveness, efficiency and response time.

iii. Three different methodologies are defined based on the global information stored at the receptionist.

Central Nothing – CN

The only global information maintained by the receptionist is a list of librarians.

Central Vocabulary – CV

Global information stored by the receptionist is the vocabularies of the sub-collections.

Central Index – CI

Receptionist has full access to the indexes of subcollections.

iv. Comparison between three different methodologies for Distributed IR (advantages and disadvantages).

Resources

Required readings for students

i. Carl Lagoze, Herbert Van de Sompel. The Open Archives Initiative: Building a Low-Barrier Interoperability Framework. JCDL, pp.54-62, First ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'01), 2001. http://www.openarchives.org/documents/jcdl2001-oai.pdf

ii. Byeong-Soo Jeong; Omiecinski, E. Inverted File Partitioning Schemes in Multiple Disk Systems. Parallel and Distributed Systems, IEEE Transactions on Volume 6, Issue 2, Feb. 1995 Page(s):142 - 153 Digital Object Identifier 10.1109/71.342125. http://csdl.computer.org/comp/trans/td/1995/02/l0142abs.htm

iii. De Kretser, O.; Moffat, A.; Shimmin, T.; Zobel, J. Methodologies for Distributed Information Retrieval. Distributed Computing Systems, 1998. Proceedings. 18th International Conference on 26-29 May 1998 Page(s):66 – 73. Digital Object Identifier 10.1109/ICDCS.1998.679488. http://csdl.computer.org/comp/proceedings /icdcs/1998/8292/00/82920066abs.htm

Required readings for instructors

i. Van de Sompel, H., and Lagoze, C. The Open Archives Initiative Protocol for Metadata Harvesting. Protocol Version 2.0 of 2002-06-14, Document Version 2004/10/12T15:31:00Z http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm

Concept map

Note: IHMC Cmap Tools is an open source client tool to create concept maps. CmapServer enables the users to collaborate and share concept maps anywhere on the internet. Both software can be downloaded freely for educational purposes from http://cmap.ihmc.us/download/index.php

Exercises / Learning activities

Homework assignment:

a. Form student groups

i. Evaluate the interoperability of DL services using the Stanford and University of Michigan Digital Library Testbeds

b. Peer-to-Peer architecture is widely used to store digital objects. So, considering a peer-to-peer structure which enables a digital object repository, such as ADEPT, what kind of digital library service should be implemented?

In-class Activities

a. In class, break students into groups of 3~4

i. Each group discusses what kind of services their digital library should provided. Does the digital library act as data provider or service provider?

ii. Discusses what kind of metadata format and subject schema will provide, and what update frequency.

iii. Discusses choice of kind of digital library architecture and the components to construct these services.

iv. Discusses what kind of protocols will be implemented and why?

v. The syntax of the underlying database is abstracted in Z39.50, what are the advantages and disadvantages of that?

Evaluation of learning outcomes

In their answers to the discussion questions, students demonstrate an understanding of

a. Different DL protocols

b. What role does DL protocols acted in DL architecture?

c. What the requirement for IR protocols?

Evaluate the DL harvesting/providing service which student created.

Glossary

a. Dublin Core (DC): A shorthand term for the Dublin Core Metadata Element Set, the primary work of the Dublin Core Metadata Initiative (DCMI); a set of metadata elements intended to be generic and universally useful for object description.

b. Interoperability: The ability of two or more systems or components to exchange information and to use the information that has been exchanged.

c. Metadata: Data about data; structured description of information objects; used to aid the understanding, administration, and use of information objects.

d. Resource discovery: Identifying previously unidentified (to the user) data objects

e. CORBA: The Common Object Requesting Broker Architecture is a standard defined by the Object Management Group (OMG) that enables software components written in multiple computer languages and running on multiple computers to work together.

Additional useful links

None

Contributors

a. Developers:

i. Ajeet Singh (Virginia Tech)

ii. Yinlin Chen (Virginia Tech)

iii. Srinivasa Santhanam (Virginia Tech)

iv. Weihua Zhu (Virginia Tech)

b. Evaluators:

i. Edward A. Fox (Virginia Tech)