Example Case

Sample description of the development of a vocabulary for legacy data

Now we consider the development of a vocabulary. For this particular case, we consider a legacy data rescue project.

A data rescue project is an effort directed towards recovering data that are presently inaccessible. Such an effort could result from a situation where data have been collected over many years by individual scientists. Perhaps these scientists are nearing the end of their careers – perhaps some have already retired. The data they have collected during their careers is now in jeopardy of being lost.

Suppose these scientists have been collecting data during their 30+ year career. Since these scientists started in an era before common computer use, many of the data sets exist in paper form. In all likelihood, the data sets are in office filing cabinets or in the basement of the building where they work. These data may not exist in electronic form. Nevertheless, the data represent a wealth of historic information and are in fact, irreplaceable. The data may also contribute to long-term data sets, a particularly important topic for helping understand long-term, global trends. These data need to be rescued and placed in managed databases at the organizational and national levels.

The first step in this process is to begin collecting information from the individual scientists. The scientist, or possibly those involved in the field programs, may have documentation on data collection plans that pertain to individual data sets. There may also be log books, or field journals that were used for notes during the field activity. Reports may exist that describe the actual field program that resulted in the data set. Collect examples of the data, paper listings or plots. Inquire as to availability of digital data. Start to make notes as to the types of collected data. These notes should include the sampling procedures or instruments used for the particular data. Keep notes on when the data were collected and a general idea as to where it was collected. For oceanographic data, you might want to approximately locate the data sampling using Marsden Squares [1]. The spatial information may be useful for prioritizing the rescue effort. Determine if there are hardcopy or softcopy records. A particularly difficult problem is when the “data” exists as a physical sample (we won’t deal with that here). How are the data stored and are there backup copies? Also, is there any activity currently underway to rescue these data? You will need all of this information to help you understand the data you are rescuing.

The terminology you use for the collected data will be the starting point for your usage vocabulary. Initially when you are scanning the documentation and legacy data, don’t be concerned about building a vocabulary. Rather, you should be concerned about building your own knowledge as to the collected data.

After you have reviewed numerous data sets from various scientists, review your notes on the data that was collected. There will be different, but similar, terminology used for the collected data. As well, review the procedures or instruments looking for similar instruments collecting data that has been named differently. Using this type of information, revisit the scientists and attempt to clarify if the data names you have noted as different, are in fact the same (or alternately, if the same names are really different elements).

This process will likely reduce the number of terms in your list of element types, as you will find different terms that are in fact referring to the same element. With this reduced list, define the other important attributes for the terms. For example, the date you are formally creating the term and the limits on the values associated with the term. If you are storing this information in a database, make sure you assign a unique identifier to each term. A short description of the term, and a longer, more detailed description as to what the term means should also be noted.

This list of terms forms your usage vocabulary. Now you need to consider how your terminology matches with the organizational and national terminology. As well, you need to decide if the organizational and national terminology meets your needs. If it does, then use the organizational or national usage vocabularies. Also, identify if you need to suggest updates to these vocabularies.

These steps should help you form the initial parts of a vocabulary. Since you are likely the person with the most knowledge on this vocabulary, you should be the person responsible for its management.

[1] Learn more about Marsden Squares.

Have a specific question about developing vocabularies for legacy data? Ask MMI!