A Last Resort: Developing a Local Vocabulary

Controlled Vocabulary Management

First, you must choose whether to use existing controlled vocabularies, or to implement and manage your own vocabulary. Management tasks can be avoided if you use the vocabularies managed by another organization. This is usually a good idea, as it will save time and effort and maximize sharing of terminology (see Choosing and Implementing a Controlled Vocabulary).

However, a controlled vocabulary may not exist that meets your project's needs. In that case, your group will need to create and manage a controlled vocabulary (see A Last Resort... Developing a Local Controlled Vocabulary).

Regardless of whether you choose to use an established controlled vocabulary or create your own, it is useful to understand the processes of developing and managing a controlled vocabulary:

Simplified Controlled Vocabulary Development and Management

  1. Clearly define the need for a new controlled vocabulary and determine its specific requirements. Individuals or groups that manage controlled vocabularies must meet the needs of the relevant scientific and technical communities.
  2. Using community expertise, evaluate each candidate term. Is the term widely used? Does it have appropriate meaning to the community?
  3. After a thorough review, format the controlled vocabulary. Different types of controlled vocabularies can be implemented using different formats.
  4. Register the controlled vocabulary with an appropriate organization.
  5. Use the controlled vocabulary in community projects. Solicit input from implementing organizations.
  6. Incorporate user community input to improve future versions of the controlled vocabulary.

Evolution of Controlled Vocabularies

An organization could begin with an authority file, then provide descriptions and etymology in future versions of the controlled vocabulary. This will enhance the authority file and transform it into a dictionary. Perhaps one of the implementing organizations will enrich the dictionary by submitting classifications, relationships, and axioms to the managing organization for the dictionary. What started as an authority file has now become an ontology/dictionary combination. (These terms are explained in Classification of Vocabularies.)

The controlled vocabulary may evolve through contributions by implementing organizations, and can become a living resource that is relatively easy to update, enhance and understand.

It is preferable to use existing standards for long-term interoperability. However, if existing vocabularies are not sufficient, even with extensions, it may be necessary to create a customized controlled vocabulary, whether for new projects or for legacy data.

Considerations for Creating a Vocabulary

These considerations are valid for new projects as well as for legacy data and are explained in more detail below.

  • Identification of all terms and values as discrete content
  • Separation of embedded information
  • Clarity of units
  • Inclusion of natural terms
  • Reduction of ambiguity in definitions
  • Consistent syntactic rules
  • Grouping of terms for discovery
  • Scalability
  • Allowing for user input
  • Identification of discrete content

Identification of all terms and values as discrete content

The starting point is creating a list of all terms and possible values. To identify the terms of the vocabulary, you need to first examine the descriptions of your assets, looking for discrete (that is, non-continuous) content. Things that are measured are usually continuous, that is, they may have a limitless number of values. Terms whose values have specific descriptions are usually discrete, and any term for which the total number of possible descriptions can be counted is likely to be discrete.

If the possible content of the metadata element is found to be discrete, then it is a likely candidate for a vocabulary. For example, if the descriptor is ocean_name, and the content is the name of the ocean, then the five ocean names could be added to the system as terms in a vocabulary.

Once you have identified those elements that contain discrete terms, you must identify all possible terms to be contained in the elements as values. This is the list of terms for the vocabulary. A definition of each value should exist, such that its definition is unique to that value. This definition development is a process of building a dictionary of values for the vocabulary.

Separation of embedded information

Vocabulary terms should not include embedded information in the values. A value that contains encoded information may have certain characters that include facts about the value without any explanation. For example, a single value like "XT07aa" might indicate an XBT temperature from a T-7 computed using coefficient set aa. This example value contains information on the type of sensor, the model of sensor, the parameter being measured and processing information. Each of these pieces of information should be split out of the single value, into separate terms and values.

Clarity of units

Units are important. Your usage vocabulary may or may not contain explicit units. For example, the data values in the usage vocabulary may have a direct association with the unit (that is, one term can only have one unit). A preferred method is to allow multiple units for a single data value (for example, distance can have units of meters or kilometers). By allowing multiple units you effectively introduce another type of vocabulary that your system must support—a unit vocabulary.

Inclusion of natural terms

Whenever possible, natural terms that are commonly used within the community should be used in the vocabulary.

Reduction of ambiguity in definitions

This consideration is the counterpart of inclusion of natural terms. If terms introduce ambiguity, then consider other terms. The terms used in your vocabulary should be associated with rigorous definitions and these definitions should be unambiguous to the community using the vocabulary.

Consistent syntactic rules

The terms used in the vocabulary will be created using a set of syntactic rules that may involve capitalization, the use of underscores, or other special characters. The vocabulary must be developed with consistent application of these rules.

Grouping of terms for discovery

Values that are associated with the terms in the usage vocabulary may be grouped, effectively creating a discovery vocabulary. Allowing for such grouping will help in the management of both vocabularies and the discovery of terms by users. The vocabulary should be capable of accommodating this grouping with minimal impact on the management system.

Scalability

Allowing for additions to a vocabulary is an important aspect of planning. The vocabulary should not be limited by the initial terms and values in the list. To avoid this, the term list must consider the general class of things that each term describes and allow for attributes to be defined beyond the immediate terms. For example, if you were studying highway traffic and defined the acceptable values for the Number of Doors term, you might accept 2, 3, 4, or 5 as acceptable values for the number of models of cars. However, you may wish to broaden the term to its more general description as a vehicle and add 0 as an acceptable value for Number of Doors to allow for motorcycles and scooters.

The process of defining attributes of general classes is a good step towards developing an ontology, which is discussed in a subsequent guide.

Allowing for User Input

Users need a mechanism to suggest new terms for the vocabulary without giving them the direct ability to add new terms. A vocabulary is controlled to avoid confusion among terms and to avoid the introduction of errors. Additions, deletions or corrections must be managed by the person responsible for the vocabulary.

Special considerations for legacy data

Creating a vocabulary for data after a project has ended requires the same considerations as for a new project. In addition, the process may require additional effort, since only archival information is available to define terms. The data custodian may have to act as data system designer, scientist, and detective to obtain all the information necessary to create an interoperable vocabulary.

The topic of creating controlled vocabularies for legacy data is explored in more detail in the guides that follow.

Suggested Citation

Isenor, A. 2011. "A Last Resort: Developing a Local Vocabulary." In The MMI Guides: Navigating the World of Marine Metadata. http://marinemetadata.org/guides/vocabs/cvdev. Accessed December 14, 2019.