Thesaurus (information retrieval)

From Wikiversity
Jump to navigation Jump to search

This article is about thesauri in the context of information retrieval. Such a thesaurus is a controlled vocabulary expanded with relations of broader, narrower and related terms, serving subject indexing and vocabulary control. A controlled vocabulary is a collection of sets of words and phrases where one item in each set is marked as a preferred term and the other items, usually synonyms, are marked as non-preferred terms.

A thesaurus is sometimes classified as one type of other knowledge organization systems (KOS), which include ontologies and classifications.

One good way of learning from this article is following the links to example entries in various thesauri below, thereby getting an impression of the actual practice exemplified. For many learners, this is much better than learning from abstract descriptions.

Thesaurus entry features[edit | edit source]

Features of an entry of a thesaurus for information retrieval, where an entry corresponds to a concept, entity or subject:

  • The preferred term denoting the concept or entity. Also called a descriptor.
  • A set of non-preferred terms denoting the concept or entity. Also called non-descriptors. These are often labeled "use for" or "used for" (UF).
  • A broader term (BT). Sometimes called broader concept. This relationship covers three distinct relationships: generic or subclass-of one, instance-of one, and part-whole one, as per ANSI/NISO Z39.19-2005. Some thesauri sometimes use a more relaxed or indefinite approach, e.g. connecting price to price policy via BT, in none of the three mentioned relationships. UNESCO guideline from 1971 envisions abbreviations BTG (generic) and BTP (partitive); ANSI/NISO Z39.19-2005 adds BTI (instance).
  • A set of narrower terms (NT). Sometimes called narrower concepts. This is an inverse of BT, and therefore also covers three distinct relationships. UNESCO guideline from 1971 envisions abbreviations NTG (generic) and NTP (partitive); ANSI/NISO Z39.19-2005 adds NTI (instance).
  • A set of related terms (RT). Sometimes called related concepts. This relationship is described in more detail in chapter 8.4 Associative Relationships of ANSI/NISO Z39.19-2005.
  • Note or scope note, defining the term or restricting its scope.
  • Definition. This is less often used; usually, a definition is placed into a scope note. Used e.g. in NCI Thesaurus, NCI Metathesaurus and PACTOLS thesaurus.

Coverage of concepts or individual objects[edit | edit source]

A thesaurus may only cover terms referring to concepts, as does e.g. Art & Architecture Thesaurus. Alternatively, it may cover terms referring to individual objects, that is, proper names, as does e.g. Getty Thesaurus of Geographic Names. It may also cover both.

Names of individual objects can be connected by BT and NT relations as well, via part-whole relationship. Thus, e.g. Barcelona has BT Catalonia.

Relation to authority files[edit | edit source]

Many authority files or things named authority files are in fact thesauri in so far as 1) they provide controlled vocabulary, and 2) they provide BT, NT and RT relationships. The fact that they are not named thesauri does not diminish their being thesauri by definition.

Form of the preferred term[edit | edit source]

Many thesauri use a plural form for countable nouns, e.g. "cats" instead of "cat", consistent with 6.5.1 Count Nouns in ANSI/NISO Z39.19-2005. By contrast, Wikidata tends to use singulars and so does WordNet.

The preferred terms should be nouns or noun phrases, as per 6.4 Grammatical Forms of Terms in ANSI/NISO Z39.19-2005. Thus, there should be e.g. an entry for "invention" rather than "invent" or "swimming" rather than "swim". By contrast, WordNet has separate entries for nouns, adjectives and verbs, e.g. swimming, colorful, and swim.

Example English thesauri[edit | edit source]

Example non-English thesauri[edit | edit source]

Thesauri with scope notes or definitions[edit | edit source]

  • Art & Architecture Thesaurus
  • Health Sciences Descriptors (DeCS)
  • International Classification of Diseases
  • Medical Subject Headings (MeSH)
  • NCI Thesaurus (definitions)
  • NCI Metathesaurus (definitions)
  • PACTOLS thesaurus (definitions)
  • RKD thesarus
  • USGS Thesaurus
  • WordNet
  • Wikidata

Standards for thesaurus development[edit | edit source]

Standards for thesaurus development are covered in the section History of the Wikipedia article, Wikipedia: Thesaurus (information retrieval). The latest standard listed is ISO 25964, published in 2011 and 2013.

Two standards are available in full text online, and linked from Further reading:

  • UNESCO Guidelines for the establishment and development of monolingual thesauri. 1971
  • ANSI/NISO Z39.19 2005 Guidelines for the construction, format, and management of monolingual controlled vocabularies.

Further reading:

BARTOC[edit | edit source]

BARTOC stands for Basic Register of Thesauri, Ontologies & Classification. It is a database of these kinds of items, that is, knowledge organization systems (KOS). For each registered thesaurus, it contains a description. Sometimes, it reports the number of items; thus, for African Studies Thesaurus it reports it has 5,200 English descriptors; for Getty Art and Architecture Thesaurus, it reports it has around 131,000 terms.

BARTOC is used by Wikidata for authority control of entities representing knowledge organization systems.

A BARTOC record features the following items:

Further reading:

Non-English terms for thesaurus relations[edit | edit source]

What follows are non-English terms sometimes used for thesaurus relations such as BT, NT, RT, scope note, etc.

  • Czech
    • nadřazený termín: broader term
    • podřazený termín: narrower term
    • asociovaný termín: related term
  • Dutch
    • Gebruikt voor: used for
    • Ruimere term: broader term
    • Nauwere term: narrower term
  • Finnish
    • Yläkäsite: broader term
    • Alakäsitteet: narrower terms
    • Assosiatiiviset käsitteet: related terms
    • Huomautus: note
  • French
    • Employé pour: use for
    • Concept générique: broader concept
    • Concept spécifique: narrower concept
    • Concept associé: related concept
    • Terme(s) générique(s): broader term(s)
    • Terme(s) spécifique(s): narrower term(s)
    • Terme(s) associé(s): related term(s)
    • Définition: definition
  • German
    • Oberbegriffe: broader terms
    • Untergeordnet: narrower terms
    • Verwandter Begriff: related term
  • Italian
    • Usato per: use for
    • Termine apicale: top term
    • Termine più generale: broader terms
    • Termine più specifico: narrower terms
    • Termine associato: related terms
    • Nota d'ambito: scope note
    • Definizione: definition
  • Spanish
    • Usado por: use for
    • Término genérico: broader term
    • Término específico: narrower term
    • Término relacionado: related term

Wikidata items for these relations:

Data interchange formats[edit | edit source]

One data interchange format useful for thesaurus data is Simple Knowledge Organization System (SKOS). The format has facilities for preferred and non-preferred ("alternative") labels of concepts, hierarchical and associative relations between concepts, textual notes and definitions, and more.

Further reading:

Further reading[edit | edit source]