Developing Vocabularies for Legacy Data
Creating a controlled vocabularyA managed list of terms. In the context of vocabularies, management typically includes careful selection of terms, maintenance of terms over time (i.e. addition, deprecation, modification), and presentation of the vocabulary in an accessible format. Related Guide for data that already exist requires commitment to a long-term plan. The data manager must apply the considerations outlined in previous guides about developing local vocabulariesA set of terms (e.g., words) that are used in a specific community. Related Guide, as well as take into account that some information about data collection may not be available after the project is complete. The data manager creating the vocabulary may not have been part of the original project and needs to consider the following:
Knowledge of the Field
Developing vocabularies requires knowledge of the field from which the data originated. Knowledge of the field will provide the data manager with the ability and credibility to talk to the people who really know the data–the scientists who collected or produced it.
Existence of Archived Reports or Planning Documentation
For legacy data, valuable sources of information may exist in documents produced during data collection, processing, or the reporting of results. These documents should be searched for metadata relevant to the data set, keeping in mind that the planning documents for data collection may differ from the actual collection.
Division of the Data into Subsets
Dividing the </ could be further subdivided into scientific topics, such as physics, chemistry, biology, or geology. Choosing the chemistry topic of one cruise would be a suitable starting place in this example, if that were the area best known by the data manager creating the vocabulary.
Examination of the Data
Data can be categorized by instrumentation, procedures, and units. Compiling lists of data types and their corresponding unit names, along with instrumentation, collection procedures, and processing procedures will form the starting point for creating local vocabularies. All allowed units must be included.
Usage vocabulary
Creating a usage vocabulary will require investigation of the provenance, or history, of the data names and values associated with these names.
Confirmation of data names and values
Careful examination of data names and values will include identification of when data with different names may actually be the same, or when data with the same name may be different.
As terminology is compared, identified, and changed, this process should be documented, since it will be extremely useful in the development of a thesaurusA type of relational controlled vocabulary which provides a list of terms, with specific relationships between the terms. Related Guide, metadata mappings, and general documentation.
Clarity of units
A considerable amount of complexity exists in the domain of units. If units are abbreviated differently, then they are different (for example, oxygen content in mg/l is not equivalent to ml/l, even if the values seem similar).
Differentiation of procedures
Research settings (for example, a chemistry or biology lab or an instrumentation development shop), are full of cases where multiple procedures exist to measure the same data type. If two different procedures or instruments measure the same parameterIndividual instance of a metadata label and value pair. For example, "creator: John Doe" is a metadata element. Related Guide, then the name associated with the data should be the same in the usage vocabulary.
A procedure vocabulary will record the differences in the measurement procedures.
Procedure or Instrumentation Vocabularies
When different procedures quantify the same data type, the information detailing the different procedures must be documented and maintained with the data value in a separate, instrumentation/procedure vocabulary for each data type.