Developing Vocabularies for Legacy Data

Creating a controlled vocabulary for data that already exist requires commitment to a long-term plan. The data manager must apply the considerations outlined in previous guides about developing local vocabularies, as well as take into account that some information about data collection may not be available after the project is complete. The data manager creating the vocabulary may not have been part of the original project and needs to consider the following:

Knowledge of the Field

Developing vocabularies requires knowledge of the field from which the data originated. Knowledge of the field will provide the data manager with the ability and credibility to talk to the people who really know the data–the scientists who collected or produced it.

Existence of Archived Reports or Planning Documentation

For legacy data, valuable sources of information may exist in documents produced during data collection, processing, or the reporting of results. These documents should be searched for metadata relevant to the data set, keeping in mind that the planning documents for data collection may differ from the actual collection.

Division of the Data into Subsets

Dividing the </ could be further subdivided into scientific topics, such as physics, chemistry, biology, or geology. Choosing the chemistry topic of one cruise would be a suitable starting place in this example, if that were the area best known by the data manager creating the vocabulary.

Examination of the Data

Data can be categorized by instrumentation, procedures, and units. Compiling lists of data types and their corresponding unit names, along with instrumentation, collection procedures, and processing procedures will form the starting point for creating local vocabularies. All allowed units must be included.

Usage vocabulary

Creating a usage vocabulary will require investigation of the provenance, or history, of the data names and values associated with these names.

Confirmation of data names and values

Careful examination of data names and values will include identification of when data with different names may actually be the same, or when data with the same name may be different.

As terminology is compared, identified, and changed, this process should be documented, since it will be extremely useful in the development of a thesaurus, metadata mappings, and general documentation.

Clarity of units

A considerable amount of complexity exists in the domain of units. If units are abbreviated differently, then they are different (for example, oxygen content in mg/l is not equivalent to ml/l, even if the values seem similar).

Differentiation of procedures

Research settings (for example, a chemistry or biology lab or an instrumentation development shop), are full of cases where multiple procedures exist to measure the same data type. If two different procedures or instruments measure the same parameter, then the name associated with the data should be the same in the usage vocabulary.

A procedure vocabulary will record the differences in the measurement procedures.

Procedure or Instrumentation Vocabularies

When different procedures quantify the same data type, the information detailing the different procedures must be documented and maintained with the data value in a separate, instrumentation/procedure vocabulary for each data type.

Suggested Citation

Isenor, A. 2011. "Developing Vocabularies for Legacy Data." In The MMI Guides: Navigating the World of Marine Metadata. Accessed December 7, 2019.