Choosing a Controlled Vocabulary

This guide is written for data managers and managing scientists who must implement a data system for their project. The adoption of vocabularies for a metadata project requires understanding of the characteristics of the project and the data system in which the vocabularies will be applied.

Finding and selecting an appropriate vocabulary takes some research. Defining your own vocabulary may seem like an easier alternative, but ultimately this approach decreases your project’s ability to interoperate with other data sources.

For instance, combining or searching data from similar projects may require additional vocabulary mapping work that could be avoided by adopting an established vocabulary.

Although the selection of a vocabulary can influence the selection of a content standard so that the two can optimally work together, this guide assumes the more common situation, in which a content standard or data model has been defined as the first step. That is, it assumes the metadata fields have already been defined, but the detailed terms used to fill out the fields—the vocabularies—have not been identified.

There are several factors that influence the selection of the vocabularies to use when filling out metadata fields:

  • Specification of a required vocabulary by the content standard
  • Characteristics of existing vocabularies
    • Availability
    • Quality (completeness, clarity and precision, relevance)
    • Community adoption
  • Support of content standard or data model for multiple vocabularies

This guide will help you assess whether acceptable vocabularies exist that take these factors into consideration and will help you evaluate their relative merits.

Conventions and Assumptions in this Guide

In this guide, we will refer to the document that specifies the related metadata fields as the "content standard," referring as well to any data model that serves the same purpose.

For most content standards, there are many different fields that require, or should require, vocabularies. Since most vocabularies cover only a single topic, usually multiple vocabularies must be selected, one (or more) for each field of the content standard. In the following discussion, we assume the context is choosing a vocabulary for a given field of the content standard.

Finally, it is assumed that the fields in question are best entered using terms from a vocabulary. Some textual fields are designed for ad-hoc text, and vocabularies are obviously unsuitable. Vocabularies are most suitable for those fields that have a finite number of potential terms that can be defined in advance.

The Selection Process

Assessing Your Analytic Goals

Always consider actual entries in the vocabulary to assess whether it will serve your purpose. Even if a vocabulary has very high ratings in other factors, it may not meet your project’s needs. For instance, a user looking for a "sensor type" vocabulary to drive post-processing software will find most sensor type vocabularies useless because not all sensors of a given type have the same post-processing characteristics.

Conforming to the Content Standard

If your content standard specifies a vocabulary to be used, then the discovery and selection process is straightforward. For example, the Directory Interchange Format (DIF) requires that the field "Parameters" be filled out from a set list of DIF science keyword categories. The only concern when choosing a specified vocabulary is in using the appropriate version—either a specific version or the most recent version of the vocabulary.

Finding Available Vocabularies

If your content standard does not specify a vocabulary, then you will need to find one or more that fit your needs. There are essentially three types of vocabularies, the first of which is often the most useful:

  • Those developed by a community as a general-use vocabulary.
  • Those developed by a project for its own purposes, yet are useful in other contexts.
  • Those that may not have been developed as a metadata vocabulary per se, but can be adapted to that purpose.

The first places to look for vocabularies are in catalogs, reference pages, and vocabulary or ontology servers. The Marine Metadata Interoperability project (MMI) provides an extensive list of vocabularies, many of which extend beyond the marine domain. MMI also provides an ontology service.

SWEET is another source of earth science ontologies. More general references may also serve as a source of vocabularies. In the marine domain, the IODE Ocean Portal and NASA's GCMD reference many resources, including vocabularies. Broader resources like Wikipedia can provide pointers to vocabularies (and can also suggest specific terms via their own entries, if you have to create your own vocabulary).

Individual projects typically have one or more vocabularies for the project, and some, like SeaDataNet, maintain a large number of vocabulary lists. These can usually be found by following the Data link on the project website, but a personal contact may be necessary to find or obtain the actual vocabulary. (MMI tries to represent as many of those marine and environmental vocabularies as possible and would appreciate notification of any missing from our list.)

For a particular domain or topic, a web search on "topic vocabulary" may prove useful. A number of taxonomic vocabularies (e.g., species registries) are available; see the Catalogue of Life for an example list.

For science domains, like marine habitats, many vocabularies are published in individual research papers. Again, where these have come to the attention of MMI they are referenced on this site, but a literature search may uncover others.

Research libraries are also an important source of vocabularies, particularly vocabularies that have been published but not put online. Contact your institution's reference library for assistance.

Finally, word of mouth, and its online equivalent, the email forum, can still be effective sources of information. For more general vocabulary questions, the ask@marinemetadata.org mail list often elicits useful information, or ask at one of the other metadata email lists pointed to by the site.

Assessing the Quality of a Vocabulary

Vocabularies can be evaluated according to criteria that are largely measurable. The relative weight of each criterion may vary according to individual needs.

Management: Is the Vocabulary Maintained Using Established and Robust Processes?

While other characteristics may be more apparent, the management of a vocabulary is the most important factor in whether the vocabulary will continue to be useful throughout the life of your project. Unless you expect the vocabulary to remain a static reference, its ability to adapt to new or changed terms will determine its long-term suitability.

Factors that reflect good management practices include a vocabulary’s age, the existence and transparency of change procedures, change tracking, and publication record. More information about what to look for in these and other factors are described here.

Age: When was the last update? A vocabulary that has not been updated for more than a year is likely to be maintained slowly, if at all. Exceptions are possible if the vocabulary and domain are mature and unchanging, as can be the case for project vocabularies.

Processes: Do change procedures exist? Change procedures document how the vocabulary can be modified. Typical modifications include adding terms, improving the definition (or other characteristics) of terms, and marking or deleting terms that are obsolete. The change procedures should be clearly and publicly described. Ideally, they call for community feedback on proposed changes.

Transparency: Are procedures open and transparently followed? If changes occur without being visible in an open forum, it is difficult to be sure that they are being followed consistently and correctly. Lack of visibility also limits input from the community.

Tracking: Are changes effectively tracked? Each change made to a vocabulary should be tracked, including the date, author, original requester, and the item changed. Ideally, a reference to any related materials should be documented. Changes should be tracked at the level of individual items or records, not just at the level of whole files. Each time a change is made, the revision identifications (version number or other identifier) for any documents containing the change (e.g., the file or data set) should be updated. A single revision update may incorporate multiple item changes. Any past version of the vocabulary, or any of its terms, should be readily recoverable using either a timestamp or a revision identification.

Continuity of Presence: Has the vocabulary been consistently published? Vocabularies intended for public use should be presented in a reliably accessible online forum. The URL for the most current vocabulary should not change, nor should URLs for specific vocabulary versions. All past versions should be available. Obsolete terms or definitions should remain available via archives (and not be removed from them), since previous metadata may use the obsolete terms.

Organizational Sponsorship. Although it is a relatively subjective characteristic, the nature of the organization that is maintaining a vocabulary can occasionally provide a useful clue in evaluating vocabularies. Organizations that are larger, better-funded, more permanent, and focused on good metadata practices and solutions may have an advantage here. At the same time, open source efforts that have significant community investment may have a comparably large, long-term viability, since the responsibility is spread out over many individuals, organizations, and countries.

Completeness: Is the Vocabulary Comprehensive?

A vocabulary that covers more aspects of a topic or domain is likely to be a better candidate than one providing fewer terms because it is more likely to contain usable terms. For example, a list of sensor manufacturers that only considers current commercial instrument vendors is unlikely to include vendors of all of your instruments. Such a limited list will not incorporate robust practices to distinguish between multiple phases of the same company (e.g., as company takeovers and mergers occur).

Clarity and Precision: Are Terms Intuitive, Well Described, and Unambiguous?

The ideal vocabulary completely characterizes the topic the vocabulary is designed to address. Each term is clearly distinct from every other term, and the names intuitively bring to mind the concept they represent. Descriptions for each term are sufficiently clear to eliminate any uncertainty in the user’s mind about whether a term is the appropriate one.

Format: Is the Vocabulary Available Online in a Defined Format?

While many vocabularies are presented as a web page, that is, in Hypertext Markup Language (HTML), this is a difficult format to work with computationally. At a minimum, a vocabulary should be available in delimited text or Microsoft Excel format. Serious developers of controlled vocabularies will present their work in an ontological language such as OWL (Ontology Language for the Web) or another RDF format, so that it can be accessed online by ontological tools and downloaded for local applications.

Evaluating Community Adoption

An important consideration in choosing a vocabulary is its level of adoption in relevant communities:

  • Global and regional communities
  • Domain communities (e.g., research discipline or specific science domain)
  • Project communities

For each community, the adoption level of a vocabulary can be assessed in non-quantitative ways. Sources of information include the vocabulary authors, managers of data systems in the community, and online searches for either references to the vocabulary or actual instances using the vocabulary.

While community adoption should not always be a dominant consideration, strong community adoption of a vocabulary can make an important difference in the value of the vocabulary, especially when your goal is easier interoperability with other users’ projects and systems within that community.

Advanced Semantic Relations

Vocabularies that are full-fledged ontologies, with detailed class-subclass relationships and defined properties, are potentially of greater long-term value. The additional knowledge embedded in sophisticated ontologies enables using them in more advanced and more automated ways.

Supporting Tools

Some metadata editor tools have built-in vocabulary pick-lists, making implementation of those vocabularies easier. See the Tools guide for specifics.

Suggested Citation

Graybeal, J. 2011. "Choosing a Controlled Vocabulary." In The MMI Guides: Navigating the World of Marine Metadata. http://marinemetadata.org/guides/vocabs/cvchooseimplement/cvchoosing. Accessed December 7, 2019.