Metadata Classifications
In developing data systems, a number of categories have been used for metadata. Unfortunately, many of them are not particularly well defined, and most cannot be used as the a priori basis for developing a data system.
The following brief write-up explains a few of these concepts, and in some cases their strengths and weaknesses. You may find that one of these concepts makes an excellent basis for handling metadata in your data system, or you may find that treating all data, including metadata, in exactly the same way works better for your system. You may also want to look at the guide on "Vocabularies: Dictionaries, Ontologies, etc." which provides definitions for classifications of vocabularies (lists of values that can be used for a metadata element).
Metadata Types
Some ways of dividing metadata are listed below. Each is described in a little more detail in the following section.
- Syntactic vs Semantic
Appearance and organization (implicitly computable), as opposed to meaning (implicitly human). - Use vs Search
Used in presenting the data (implicitly computable), as opposed to for finding the data (implicitly human). - Static vs Dynamic
Whether the metadata change as fast as the data change. - State vs Persistent
Meant to describe varying conditions (the "state") of the instrument or system, as opposed to an unchanging data context. - By Functional Category
Includes 6 different functions performed by metadata.
Comments on Classification Techniques
Syntactic vs Semantic
Syntactic metadata describe what the data "look" like and how they are organized. Semantic metadata describe what they really "mean." Sometimes people assert that semantic data are human-oriented, not machine-usable, but that seems to be an assumption, and not required by the term itself.
Syntactic fields often include the unique variable name, data type (integer, float, etc., including sizes), and units of measurement. Note the first and last of these certainly have semantic meaning, even if their primary use is for labeling or identification.
Semantic fields are often more descriptive, such as long name, definition, comments, and copyright. Yet, most of these would be more widely useful if they were computable, with agreed-upon conventions and terminology. The increasing use of ontologies will likely push the semantic content much more into a computational realm.
Use vs Search
"Search" metadata, also known as "discovery" metadata, are similar to what a person would be interested in seeing, in order to decide if there are things of interest in a data set. An observation type such as "multibeam bathymetry" is an example of helpful search metadata, especially when managed by a system of controlled vocabularies. Search metadata might also be latitude and longitude bounds, so that a computer or a person could know if the data fall within an area of interest.
Once the data are discovered, "Use" metadata may help a computer or a person to understand or process the data. Typical use metadata would be calibration parameters.
Use metadata are sometimes thought of as synonymous with Syntactic metadata, although they are not. For example, labels provided as syntactic metadata may not be unique, which makes them useless for processing the actual data (except for labeling the data, of course).
Based on typical definitions of these use vs. search terms, these are likely to be fuzzy boundaries. It may be that some or all search metadata may be automatable, i.e., represented in ways that are "meaningful" to the applications processing, and used by that software. Indeed, this will be necessary to facilitate widespread data mining. Furthermore, some use metadata will be of interest to people searching for data, even though it is more oriented toward computer applications.
Static vs Dynamic
Static metadata do not change much over the life of the data they describe, even as the data evolve. Dynamic metadata are a function of the contents of the data, so as a data set evolves, dynamic metadata change. Unfortunately "static" metadata may be captured incorrectly or not at all, or change in some unforeseen way, and so the "static" metadata in fact may change, or need to be changed, after data collection. An interesting special case involves the seeding of metadata prior to the arrival of data themselves. Metadata captured "before data arrival" are implicitly static, and can be associated with that data permanently, possibly as part of an automated process embedded in the data stream. Metadata captured "after data arrival" must be associated with the data after the arrival, implying some other process for entering that information.
State Metadata vs Persistent Metadata
State metadata capture the state of something (a system, a component, a data set) at a given time or time range. All "other" metadata do not perform this capture. Persistent metadata describe a system, component, or data set which is unchanging. Typically there is no precise dividing line between the two, since all things may change eventually, and in some systems the state of the system changes in every data record, and is likely captured within the data stream, for that reason.
By Functional Category
In their 2006 paper (see References below), Ganesan Shankaranarayanan and Adir Even propose 6 types of metadata:
- Infrastructure metadata: abstracts the components of the computer systems, e.g., for system maintenance;
- Model metadata (the data dictionary): abstracts the modeling of data into entities and their relationships: conceptual, logical, and physical; it includes semantic and translation elements;
- Process metadata: information on how data is generated and the transformations it undergoes from source to target;
- Quality metadata: captures the assessment of the actual data stored in the system, including quality measurements (e.g., accuracy) and summaries of the data (e.g., total records or bytes);
- Interface (delivery and reporting) metadata: captures how the data is used;
- Administration metadata: includes information on users, security, and access privileges to data and applications.
In Conclusion
The categories above represent frameworks to consider in designing a metadata-centric system. What is important in your system design is that you understand what kind of data you are dealing with, and what kinds of questions you need to answer with your metadata. You need to understand which distinctions above are likely to be important to your system, and which are not relevant.
Finally, the categories above illustrate the importance of precise terminology when collaborating on a design. Be sure that the data system's developers are using "search metadata" to mean the same thing that you are. Lists of the metadata fields, and the user queries they will enable answering, are one particularly good way to ensure agreement and understanding.
References
Shankaranarayanan, G. and A. Even. The Metadata Enigma, Communications of the ACM, Vol. 49, No. 2, pp 88-94, February 2006.