Vocabulary for Legacy Data: Data Rescue

Legacy data can be found in many formats and in varying condition. A data rescue project is an effort toward recovering data that are presently inaccessible.

In the hypothetical case below, we are rescuing data that were collected by multiple scientists over more than thirty-year careers, much of it before common computer use. The data being rescued may be in paper form in filing cabinets stored in the basements of office buildings or similar locations.

The data represent a wealth of historic information that are irreplaceable and are now in jeopardy of being lost. The data may contribute to long-term data sets, a particularly important topic for understanding global trends. These data need to be rescued and placed in managed databases at the organizational and national levels.

Initial Investigation and Data Evaluation

Applying the recommendations from previous guides, the first step is to begin collecting information from the scientists. They, or possibly those involved in the field programs, may have documentation on data collection plans that pertain to individual data sets. There may be logbooks or field journals that were used for notes during the field activity. Reports may exist that describe the actual field programs that resulted in the data sets. The data rescue project will need to undertake each of the steps below for a full understanding of the data being rescued:

  • Collect examples of the data, paper listings, or plots.
  • Identify the types of collected data; notes taken during this step should include the sampling procedures or instruments used for the particular data.
  • Keep notes on when the data were collected and a general idea as to where they were collected; spatial information may be useful for prioritizing the rescue effort (for oceanographic data, Marsden Squares may be useful to approximately locate data samplings).
  • Determine if there are hardcopy or electronic records (a particularly difficult problem is when the data exists as a physical sample).
  • Discover any backup copies.
  • Determine if there is any other activity currently underway to rescue these data.

The terminology used for the collected data will be the starting point for the usage vocabulary. However, when initially scanning the documentation and legacy data, the emphasis should be on building knowledge about the collected data, rather than on building a vocabulary.

Review, Clarification, and Reduction of Terms

After reviewing numerous data sets from various scientists, the next step will be reviewing the notes from the first step about the data that were collected. When comparing the input from the scientists, there is likely to be different, but similar, terminology used for the collected data. The scientists may have used similar procedures or instruments to collect data but named the data differently. Based on the notes and comparisons of findings, revisiting the scientists will help to clarify if the data names that were noted as different, are in fact the same (or, if the same names are really different elements).

This process will likely reduce the number of terms in the list of element types, since different terms will refer to the same element.

Creation of Terms and Value Lists

With this refined list, the next step is defining the other important attributes for the terms, such as the date of formal creation of the term and the limits on the values associated with the term. If the information will be stored in a database, each term will need a unique identifier, as well as a short description of the term and a longer, more detailed, description as to what the term means. This list of terms forms the usage vocabulary for the project.

Comparison with Organizational and National Terminology

If the organizational and national terminology meets the needs of the legacy data rescue project, then the project should use the organizational and national usage vocabularies. In that case, this stage is the proper time to identify and suggest updates to the established vocabularies. If the existing vocabularies are not adequate for the rescue project, then the project must create its own local vocabulary based on the list of terms identified above.

Long-Term Management of the Vocabulary

If the data rescue project results in a new, local controlled vocabulary, it will need to be managed. The data manager who established the vocabulary is likely the person with the most knowledge about this vocabulary and project and therefore the best person to be responsible for its management.

Suggested Citation

Isenor, A. 2011. "Vocabulary for Legacy Data: Data Rescue." In The MMI Guides: Navigating the World of Marine Metadata. http://marinemetadata.org/guides/vocabs/cvdev/cvdevlegacy/cvdevlegacyexample. Accessed July 9, 2020.