Usage vs Discovery Vocabularies
Previously, we noted the term ‘altitude’ as describing part of the spatial position of something. We may complete the spatial description by including the terms ‘latitude’ and ‘longitude’. The term ‘latitude’ typically refers to a value that describes north-south placement (or y-coordinate) of something on the earth (more generally a rotational ellipsoid). With the term ‘longitude’ to describe the east-west placement (or x-coordinate), we can fully specify the ‘position’ of something on the earth.
Now consider a data asset that contains altitude, latitude and longitude values. We can think of the asset as being a database table, spreadsheet or text file. The asset will likely have names for the columns of numbers. The names could be ‘altitude’, ‘latitude’ and ‘longitude’. Alternately, the names could be cryptic codes such as ALT, LAT and LONG. The terms (or names) used within the asset represent what we refer to as a usage vocabulary*.
A usage vocabulary is important when clients want to utilize the data within the asset. Software applications, or people, have to understand these terms in order to effectively access and use the data within the asset.
However, discovering the content of the asset is different from utilizing the content. In the discovery process, the usage vocabulary may or may not be useful. In the case where cryptic codes are used to identify the data column, the usage vocabulary is not useful. This is because the search software (or the people) will not likely think of using the exact cryptic codes that are used within the asset.
In this case, we introduce another vocabulary – the discovery vocabulary*. The discovery vocabulary uses terminology to identify the data that are common to the subject community. Terms in the discovery vocabulary are very diverse; thus making the vocabulary itself difficult to define.
Terms in the discovery vocabulary often represent an aspect of the data asset that has a common description in the subject community. These terms can take a variety of forms.
- Terms in the discovery vocabulary may be identical to terms in the usage vocabulary. This is the situation when the data asset uses common language terminology to identify the data. An example would be a data asset containing data values identified as ‘temperature’ or ‘salinity’. Both of these terms are part of the usage vocabulary, and since they are natural search terms, they also would be terms in the discovery vocabulary.
- Terms in the discovery vocabulary may represent groups of terms in the usage vocabulary. This is a common situation for legacy assets, where cryptic codes have been used to identify similar data from multiple sources. As an example, consider a legacy data asset that contains temperature values from sensors A, B, and C. Suppose these data are identified within the asset as ATEMP, BTEMP, CTEMP (i.e., terms in the usage vocabulary). The discovery vocabulary term that encapsulates all three usage terms would be ‘temperature’. In this case, the ‘temperature’ term in the discovery vocabulary represents a group of terms from the usage vocabulary.
- Terms in the discovery vocabulary may represent groups of data values. In this case, the discovery vocabulary terms identify particular subgroups of the data, rather than all of the data. As an example, if the data asset contains geology data, then certain geological time periods (e.g., Mesozoic Era) may be identified in the discovery vocabulary. In physical oceanography, a discovery term may identify a particular water mass (e.g., Labrador Sea Water) which has particular characteristics (i.e., physical or chemical).
Discovery vocabularies aid the client in finding the data asset, while the usage vocabulary aids in utilization of the asset. Both vocabularies can pertain to data-related topics such as parameters, platforms, sensors, geographic areas, etc.
* Both usage and discovery vocabularies are specialized forms of a controlled vocabulary.