Choosing a Controlled Vocabulary

So you have a content standard for your metadata, but now you need to start filling out the fields using specific terms. How do you decide whether to use a controlled vocabulary and which one(s) to use?

The adoption of vocabularies for a metadata project requires some thought, and some understanding of the characteristics of the project and its data system. This guide is written for the data managers and managing scientists who must implement a data system for their project. It assumes the metadata fields have already been defined, for example by a content standard or a data model, but the detailed terms used to fill out the fields - the vocabularies - have not been identified.

Of course, it is often true that the selection of a vocabulary can influence the selection of a content standard. That is reasonable, since the two should work together. But this guide assumes the more common situation, in which a content standard or data model has been defined as the first step.

There are several factors that influence the selection of the vocabularies to use when filling out metadata fields:

  • specification of a required vocabulary by the content standard
  • characteristics of existing vocabularies
    • availability
    • quality (completeness, clarity and precision, relevance)
    • community adoption
  • support of content standard or data model for multiple vocabularies

These factors must all be considered, but often the main question is simply: Does an acceptable vocabulary exist? This guide will help you assess whether acceptable vocabularies exist, and how to evaluate the relative merits of different vocabularies.

Finding and selecting an appropriate vocabulary involves some thought and research. The alternative is to define your own vocabulary. This approach is often taken by projects, as it can seem easier, but ultimately decreases your project's ability to interoperate with other data sources (if projects collecting similar data are using different vocabularies, they cannot be combined or searched effectively without additional vocabulary mapping work). Well-established vocabularies with wide community review are also likely to be more fully developed and logically arranged than those developed in a more ad-hoc manner by an individual project.

Conventions and Assumptions in this Guide

The fields that must be filled out with metadata can be specified by a content standard or other written specification, or by a model that defines a data system. Often both apply, as the data system model was created to fulfill the requirements of one or more content standards. In this guide, we will refer to the document that specifies the related metadata fields as the 'content standard', referring as well to any data model that serves the same purpose.

For most content standards, there are many different fields that require, or should require, vocabularies. Since most vocabularies cover only a single topic, usually multiple vocabularies are required, one (or more) for each field of the content standard. In the following discussion, we assume the context is choosing a vocabulary for a given field of the content standard.

Finally, it is assumed that the field(s) in question are best entered using terms from a vocabulary. Some textual fields are designed for ad-hoc text, and vocabularies are obviously unsuitable. Vocabularies are most suitable for those fields that have a finite number of potential terms which can be defined in advance.

Selection

How does the developer determine what vocabulary, or vocabularies, to use? Each of the following sections addresses a step in the process of selecting an appropriate controlled vocabulary for a given purpose.

A fundamental criteria, not included in this list, is whether the vocabulary meets your analytic goals. Even if a vocabulary has very high ratings in the following categories, it just may not be what you are looking for. (A user looking for a "sensor type" vocabulary to drive post-processing software will find most sensor type vocabularies useless, because not all sensors of a given type have the same post-processing characteristics.) Always consider actual entries in the vocabulary to assess whether it will serve your purpose.

First: Does the Content Standard Specify a Vocabulary?

If the content standard specifies a vocabulary to be used, then the discovery and selection process is straightforward. For example, the Directory Interchange Format (DIF) requires that the field "Parameters" be filled out from a set list of DIF science keyword categories. The only concern when a vocabulary is specified is using the appropriate version of the vocabulary per the standard - in some cases this will be a specific version, and in other cases the most recent version of the vocabulary.

Finding Available Vocabularies

There are essentially three types of vocabularies: those developed by a community as a general-use vocabulary; those that are developed by a project for its own purposes, yet are useful in other contexts; and those that may not have been developed as a metadata vocabulary per se, but can be adapted to that purpose. While the first category is the most immediately useful, the other vocabularies are also valuable.

The first place to look for vocabularies is in catalogs, reference pages, and vocabulary or ontology servers. The Marine Metadata Interoperability project provides an extensive list of vocabularies, many of which extend beyond the marine domain, and also provides an ontology service. SWEET is another source of earth science ontologies. More general references may also serve as a source of vocabularies. In the marine domain, the IODE Ocean Portal and NASA's GCMD reference many resources, including vocabularies. Broader resources like Wikipedia can provide pointers to vocabularies (and can also suggest specific terms via their own entries, if you have to create your own vocabulary).

Individual projects typically have one or more vocabularies for the project, and some, like SeaDataNet, maintain a large number of vocabulary lists. These can usually be found by following the Data link on the project web site, but a personal contact may be necessary to find or obtain the actual vocabulary. (MMI tries to represent as many of those marine and environmental vocabularies as are made available, and would appreciate notification of those that it is missing.)

For a particular domain or topic, a web search on 'topic vocabulary' may prove useful. A number of taxonomic vocabularies (i.e., species registries) are available; see the Catalogue of Life for an example list.

For science domains, like marine habitats, many vocabularies are published in individual research papers. Again, where these have come to the attention of MMI they are referenced on the site, but a literature search may prove useful.

Research libraries are also an important source of vocabularies, particular vocabularies that have been published but not put on-line. Contact your institution's reference library for assistance.

Finally, word of mouth, and its on-line equivalent the email forum, can still be an effective source of information. If you are looking for a vocabulary in a given domain, asking experts in that domain may prove fruitful. For more general vocabulary questions, the ask@marinemetadata.org mail list often elicits useful information, or ask at one of the other metadata email lists pointed to by the site.

Assessing the Quality of a Vocabulary

Vocabularies can be evaluated according to criteria that are largely measurable. The relative weight of each criterion may vary according to individual needs. While this section attempts to present the most significant evaluation criteria, under some circumstances even the most important criteria may be irrelevant.

Management - Is the Vocabulary Maintained Using Established and Robust Processes?

While other characteristics may be more apparent, the management of a vocabulary is the most important factor in whether the vocabulary will continue to be useful throughout the life of your project. It is all but certain that your project will require some terms that are not in the vocabulary. Unless you expect the vocabulary to remain a static reference, its ability to adapt to new or changed terms will determine its long-term suitability.

Factors that reflect good management practices include a vocabulary's age, the existence and transparency of change procedures, change tracking, and publication record.

Age: When was the last update? A vocabulary that has not been updated for more than a year is likely to be maintained slowly, if at all. (Of course, exceptions are possible if the vocabulary and domain are mature and unchanging, as can be the case for project vocabularies.)

Processes: Do change procedures exist? Change procedures document how the vocabulary can be modified. Typical modifications include adding terms, improving the definition (or other characteristics) of terms, and marking or deleting terms that are obsolete. The change procedures should be clearly and publicly described. Ideally they call for community feedback on proposed changes.

Transparency: Are procedures open and transparently followed? If changes occur without being visible in an open forum, it is difficult to be sure that they are being followed consistently and correctly. Lack of visibility also limits diverse inputs from the community.

Tracking: Are changes effectively tracked? Each change made to a vocabulary should be tracked, including the date, author, original requester, and the item changed. Ideally a reference to any related materials should be documented. Changes should be tracked at the level of individual items or records, not just at the level of whole files. Each time a change is made, the revision identifications (version number or other identifier) for any documents containing the change (e.g., the file or data set) should be updated; a single revision update may incorporate multiple item changes. Any past version of the vocabulary, or any of its terms, should be readily recoverable using either a timestamp or a revision identification.

Continuity of Presence: Has the vocabulary been consistently published? Vocabularies intended for public use should be presented in a reliably accessible on-line forum. The URL (web address) for the 'most current' vocabulary should not change, nor should URLs for specific vocabulary versions. All past versions should be available; obsolete terms or definitions should remain available via archives (and not be removed from them), since previous metadata may use the obsolete terms.

Organizational Sponsorship. Although it is a relatively subjective characteristic, the nature of the organization that is maintaining a vocabulary can occasionally provide a useful clue in evaluating vocabularies. Organizations that are larger, better-funded, more permanent, and focused on good metadata practices and solutions may have an advantage here. At the same time, open source efforts that have significant community investment may have a comparably large long-term viability, since the responsibility is spread out over many individuals, organizations, and countries.

Completeness - Is It Comprehensive?

A vocabulary that covers more aspects of a topic or domain is likely to be more carefully considered than one providing fewer terms, and is more likely to contain terms of value. For example, a list of sensor manufacturers that only considers current commercial instrument vendors is unlikely to include vendors of all of your instruments. Such a list will not incorporate robust practices to distinguish between multiple phases of the same company (e.g., as company takeovers and mergers occur).

Clarity and Precision - Are Terms Intuitive, Well Described, and Unambiguous?

The ideal vocabulary completely characterizes the topic the vocabulary is designed to address. Each term is clearly distinct from every other term, and the names intuitively bring to mind the concept they represent. Descriptions for each term are sufficiently clear to eliminate any uncertainty in the user's mind about whether a term is the appropriate one.

These are lofty goals. It is not always possible to meet them in a world of real language and users. The best vocabularies strive for these characteristics and identify solutions that break these principles the least.

Format - Is It Available On-Line in a Defined Format?

While many vocabularies are presented as a web page - that is, in HTML markup language - this is a difficult format to work with computationally. At a minimum, a vocabulary should be available in delimited text or Excel format. Serious developers of controlled vocabularies will present their work in an ontological language such as OWL (Ontology Language for the Web) or another RDF format, so that it can be accessed online by ontological tools and downloaded for local applications.

Evaluating Community Adoption

An important consideration in choosing a vocabulary is the vocabulary's level of adoption in relevant communities. Three communities to consider in this evaluation are:

  • global and regional communities
  • domain communities (e.g., research discipline or specific science domain)
  • project communities

For each community, the adoption level of a vocabulary can be assessed in non-quantitative ways. Sources of information include the vocabulary authors, managers of data systems in the community, and on-line searches (for either references to the vocabulary, or actual instances using the vocabulary).

While community adoption should not always be a dominant consideration, strong community adoption of a vocabulary can make an important difference in the value of the vocabulary (because it is easier to interoperate with other users in that community).

Advanced Semantic Relations

Vocabularies that are actually full-fledged ontologies, with detailed class-subclass relationships and defined properties, are potentially of greater long-term value. The additional knowledge embedded in sophisticated ontologies enables using them in more advanced and more automated ways. These capabilities may benefit users of your data, as they can use advanced semantic applications to evaluate and process your products.

Supporting Tools

Some metadata editor tools have built-in vocabulary pick-lists, making implementation of those vocabularies easier.

Have a specific question about vocabularies? Ask MMI!