The Importance of Controlled Vocabularies

Controlled vocabularies are very powerful when combined with formal metadata standards. This is because the terms within the controlled vocabulary can be used as the content for specific metadata elements that make up the standard. In many cases, the controlled vocabulary terms completely define the allowable content for a particular metadata element. This control helps avoid misspellings and inconsistencies in the metadata content. Moreover, in the world of computers, the controlled vocabulary offers enhanced capabilities because it can be incorporated into automated procedures. For example, in a data system a controlled vocabulary can simplify system input and contribute to quality control of that input. Input is simplified by providing users or other systems with a list of allowed entries for the specific metadata elements. Similarly, the controlled vocabulary can be used to check existing or imported metadata descriptions for consistency and correctness, including things like spelling and hyphenation.

Controlled Vocabularies as an Interoperability Aid

Translation and crosswalking can be thought of as the basis for metadata interoperability. When a metadata description created by one system can be interpreted by another system, the resource described by the metadata can be used more easily and precisely within both systems.

In spoken language, when we move from one language to another, we need to identify a word in our own language and relate it with a word in another language. We might need to take a closer look at the word in our language, to determine exactly what it means, or the proper usage. There will be times when a word won't translate directly into a single word or phrase in another language. We will also probably make use of grammar rules in the translation process.

In the context of metadata, a controlled vocabulary is analogous to a language in the above example. The terms in one controlled vocabulary can be translated into the terms used by a second controlled vocabulary. If the entire controlled vocabulary is translated, then all metadata descriptions that use the first controlled vocabulary can also be translated to use the second controlled vocabulary. In this way, controlled vocabularies facilitate metadata interoperability.

The different types of controlled vocabularies provide different levels of interoperability. Often when we move from one project to another, we need to identify the metadata descriptions that use one controlled vocabulary, and relate these descriptions to another system. We might need to understand more about the terms in the initial controlled vocabulary-what it represents (glossary), how it came to be (dictionary), and what terms are similar (thesaurus, semantic network, or ontology). There will be times when one term doesn't fit nicely into the second controlled vocabulary. This is where hierarchies and classifications (subject headings, taxonomies, and ontologies) become very handy.

Example of Controlled Vocabulary Usage

Suppose three different oceanographic research projects are using various vessels or submersibles. In the worst case, we could imagine that none of these projects had a controlled vocabulary. In this case, if someone were to query the data resource to accurately locate all data associated with a particular research vessel like the R/V Moana Wave, they would need to know all the ways "R/V Moana Wave" was represented within the resource, and construct a search query for all of the variations (including the misspelled, misrepresented, and "nicknamed"). Beyond daunting, this seems nearly impossible!

In a better case, we could suppose each project generated a controlled vocabulary, as shown below.

Dictionary

Dictionary

Each term is articulated with an acronym. (1st entry, blue)

The acronyms are spelled out in the description. (2nd entry, yellow)

Additional information about how each term came to be is included in the etymology. (3rd entry, green)

Hierarchy

Taxonomy

The actual terms (2nd entry, blue) are placed in a structure, according to the decade in which they were commissioned (1st entry, green).

Ontology

Ontology

Actual terms (3rd entry, blue) are classified into two major classes (1st entry, green), and one subclass (2nd entry, yellow).

Notice the vessels are connected to submersibles, based on the operating institution. This is a complex interrelation, which enhances the class heirarchy.

Notice, each of these controlled vocabularies represents the same list of real-world objects (i.e., vessels or submersibles). They are presented as different types of controlled vocabularies, with different terms to represent the real-world objects, and with slightly different accompanying information.

Suppose each project exposed their particular controlled vocabulary to a search engine and that translations existed between the vocabularies. The search engine may provide a dropdown menu of platform names to expedite the user searches. When a user needs to identify all data associated with the R/V Moana Wave, they could use a dropdown menu to select that particular ship. Without the translation, which in fact only exists because of the three controlled vocabularies, the user would need to know that each project represents the SOEST University of Hawaii vessel in a different way. In that case, the user would need to do a search for "R/V Moana Wave" and "Moana Wave" and "MW"- all without typographical errors!

In addition, this example illustrates the value of adopting established controlled vocabularies, instead of developing a local vocabulary. Each of these three controlled vocabularies is a representation of the same set of real-world objects, but three different programs took the time to develop a unique controlled vocabulary. One or more of the locally developed controlled vocabularies might not be exhaustive, and not all three contain the same information. If the three programs collaborated and developed a single controlled vocabulary, this authoritative controlled vocabulary could be managed centrally. The controlled vocabulary would be more complete, and thus would be much stronger, possibly with less effort by any individual program.