How to Approach Developing Vocabulary for Legacy Data

It is important that you approach this task with a long term vision and commitment to that vision. To do this task properly, will first and foremost require knowledge in the field from which the data originate. Knowledge in the field will provide you with the ability and credibility to talk to the people who really know the data – the scientists who collected or produced it. This knowledge will require time to build.

The first thing you will need is a plan for vocabulary development. This plan may actually be a subcomponent of a larger plan – a plan for metadata capture and management. However, here we only deal with the vocabulary subcomponent.

You should start with a small subset of the data. Divide the entire data set that you will be dealing with into logical pieces (logical from your point of view). If your data asset is a collection of many data sets collected from oceanographic cruises, start with a single cruise. Then subdivide further, perhaps by topic (physics, chemistry, biology, or geology). Start with the topic you know the most about.

Now examine the data. You should be looking for the different types of data that were collected, the different instrumentation or procedures that were used and different units that may be possible for the data types. You can start with compiling a list of names that refer to the data types. Also make a list of allowed units for those names. Finally, start to document the procedures followed to collect or process the data type (if you are lucky, there will be existing documentation on procedures). These lists will form the basis of your vocabularies. For example, the list of data types will form the starting point for your usage vocabulary.

The usage vocabulary will require a bit of extra work on your part. You should investigate the provenance or history of the data names and values associated with these names. During this process you should examine the various data quantities and the names affixed to these quantities. Ask yourself if two quantities with different names are actually the same, or alternately, if two with the same name are actually different. This terminology evolution should be documented, as it will be extremely useful in the development of a thesaurus, metadata mappings, and general documentation. You should think about different procedures for acquiring or processing the data. Finally, don’t forget units and don’t underestimate units. A considerable amount of complexity exists in the domain of units – and if the units are abbreviated differently, they are different (e.g., don’t think for a second that oxygen content in mg/l is equivalent to ml/l; even if the values are similar).

In this process, no detail is too small. The research environment is full of cases where multiple procedures exist to measure the same data type. For example, two different biological incubation setups may produce measurements of the same data quantity. These different procedures represent important metadata that needs to be associated with the data quantity. However, the usage vocabulary needs to indicate the same data term is being measured. Another vocabulary notes the differences in the measurement procedures.

At this point you should start to realize that your job as data custodian has been morphed into a combination of data system designer, scientist, investigative police officer and investigative news reporter.

Have a specific question about vocabularies? Ask MMI!

Suggested Citation

Isenor, A. 2009. "How to Approach Developing Vocabulary for Legacy Data." In The MMI Guides: Navigating the World of Marine Metadata. http://marinemetadata.org/guides/vocabs/cvdev/cvdevlegacy/cvdevlegacyapproach. Accessed December 6, 2019.