On Versions
Introduction
Without having extensively surveyed the literature, these are some thoughts I've put together (as part of my OOI Cyberinfrastructure job) on what it means to talk about a data product version. Your comments are appreciated.
In this document, I call a Version a form of relationship in which one thing is the same as another in some conceptual sense, but has differences as the result of other attributes not being the same. To be specific:
Version: an instance for which certain characterstics stay the same while other characteristics change, relative to another instance or a class of instances.
A typical definition of 'version' is "a snapshot of an item at a certain point in time". As we shall see, this definition is not precise, in that it does not suggest what constitutes an 'item', and multiple versions may exist for one entity at the same point in time. Because the CI and its architecture are so general, it is necessary to be precise and yet appropriately abstract when understanding the concept of Version.
Description of Concepts for Versioning
Review of Basic Example
A data-centric example of a version: An FVCOM model produces a data set using an initial set of bathymetry data. We'll call that output version '1'. Later, a new run of the model is generated using corrected bathymetry; we call that version '2'. How does the user know that version '2' is better than version '1'? Indeed, how does he or she know it is a variation of the same data at all?
It turns out that in almost every data system, this 'version' concept is defined, and represented, entirely by convention. Large numbers are used for more recent versions; more recent versions are assumed better until claimed otherwise; and users tell one version from another by looking at the name of the item, or at best, by a number inside the metadata. All of these practices were established for convenience for the particular system—and the way its designers built it, and its users use it.
The difficulty is that none of these conventions are necessarily valid from one system to the next, and because they carry implicit meaning, people will be prone to misinterpreting them. We need a way of identifying things that are conceptually alike, but different in their instances; and we need to avoid conflating our technique with time of generation or other possibly useful metadata.
It is also the case that most systems consider versions a function of time against a particular data set, but do not carefully describe the characteristics of the 'base data set', nor the differences between one instance and another of it. And as noted above, a time-based system may not allow for the possibility that multiple versions of a data set may be generated by different authors, and so 'creation time' will tell us nothing about their validity, or even their uniqueness. All of these defects must be remedied in a general-purpose system.
Proposed General View of Versions
We define Version as above: "an item for which certain characterstics stayed the same while other characteristics changed, relative to another item". From this definition, we can infer that we must have a set of data characteristics which can stay the same, even as other characteristics change. What should those characteristics be?
In fact, while it's tempting to think there is a core set of common characteristics, the actual definition will very from one scenario to the next.
- A user who wants the latest version of the quality controlled data will not be interested in my data file with the latest observations, if those aren't quality controlled.
- A user who wants to compare versions of the calibration coefficients probably is only comparing ones that result from new calibrations; an update of the same coefficients introduced by a different staff member is then of no interest.
- If my query is for the latest version of a piece of code in a repository, the only thing that remains the same is the file name; the content may be entirely different.
Or, to think about it from the perspective of what changes: One user wants to consider model output a new version if the software has executed again, even if the resulting data is the same; another user sees a new version only when the timestamp of the data changes; and a third user only considers it a new version if the representation schema has actually changed. (See Data Versions vs Schema Versions, below.)
We must support the view of versions from the user's perspective, rather than trying to pre-define a narrow concept that constitutes a version. This means we must give every user the ability to define for themselves what they consider a new version of a resource, whether that resource is a data file, or data stream, or an instrument state.
How Would This Work?
Defining and Using Version Specifications
Every information resource in a system has numerous descriptive characteristics (metadata) that are associated with the resource. It is these characteristics that can be characterized as 'must stay fixed' for a Version Type description to be created. The user can create a Version Type Specification for any resource by declaring what metadata makes the resource versionable.
Now, when a new resource arrives, each of the Version Type Specifications are checked. If the resource matches the Version Type Specification description, then it can be considered a new version of that 'version type', or the class of resources of that described type.
For example, my Version Type Specification may say for something to be a new version of my "Best Data Set", the following characteristics must be met:
Creator = "GraybealJ" Resource Type = "Data" Resource Name = "MyBestDataSet"
Now whenever any new resource is put into the system, these three characteristics are checked. Since no one else should be able to submit data that has my name as Creator, I can be confident that this Version Description will grab only the things I am interested in.
One Resource, Multiple Versions
But won't this create a slew of versions, where one resource could be a new version of many different version types? Yes, and this is in fact what happens in the real world. A new model run can simultaneously be a new version of the Nov 14 prediction, a new version of the 24-hour model output, and a new version of the Tidal Bay currents predication at the Tidal Bay web site. The next model run may be in the latter two categories, but not be a prediction for Nov 14 any more. We understand these things intuitively about each category in turn, but it can be daunting to think about them all simultaneously. (For a non-scientific example, consider a newspaper, which has both daily updates and editions within each day. Most subscribers want the former, while newspaper editors care about the latter.)
If we consistently view versions as relative to a particular set of characteristics, then the fact that one resource may represent different 'versions' to different people is of little concern. Each Version Type Specification can itself keep track of the resource updates that meet its criteria; and the most common Version Type Specifications will become references used heavily by the community, and represented to the wider community.
If we want to, we can annotate any instance of a resource with the versions that it satisfies; as long as annotation doesn't represent an incumbrance on the resource instance, that mechanism will be of some interest to users and operators of the system, as a metric of what resources are being watched.
Version Numbering
Yes, but what version number should a resource get? If I can set up a Version Type Specification at any time, the idea of a version number is pretty confusing, isn't it?
If version numbering is critical to the resource watchers, it can be specified relative to the setup of the Version Type Specification (in which case the first version after enabling that Specification will be version 1). If the version requester wants to track versions since the beginning of the system, that should be possible as well, by mining all the matching entries in the data catalog. It should be possible to create a version history for any Version Type Specification at any time, even after the fact.
There is of course the potential for confusion, as similar Version Type Specifications may have different numbering sequences because of the time when they were set up, and the differences in the Specifications. The two mechanisms to limit this confusion are the requirement for unique names for the Version Type Specifications ("New CTD Data" vs "New CTD Data 1"), and the availability of unique identifiers for Version Type Specifications.
In fact, the Version Type Specification names give us a way to 'tag' our resource instances with names like "Dr. Chu's POEM Output", so that the latest data output can become "Dr. Chu's POEM Output #325". Version subscribers will be able to see these names and filter on them, and on the characteristics of the Specifications they reference. This returns the version usage to one that will be very familiar to most scientists.
Implicit in the above discussion is that numbering proceeds in the order in which versions are received by the system, a sequence which must be made determinable. Other orderings are possible but would not create persistent indices. (The later arrival of a preceding version would force renumbering of all subsequent versions.)
Discussion
Newer Not Always Better
The above versioning system helps resolve the dilemma of multiple variations on a single data product. If 3 different people separately submit updates to a calculated data product, but with different bugs fixed in each update, there is a risk that the updates will be considered a sequence of improvements, rather than independent improvements. Because the submitters each provide different metadata—at least their names will be different—subscribers will be able to quickly see the versions only of interest to them, and distinguish the different versions that have arrived using the appropriate metadata. Those that want a simple algorithm can just take the most recently arrived data, while others can filter on more subtle criteria, for example the most trusted provider of changes.
How A Provider Specifies Version Relationships
Say you are a data provider, and you want to indicate that this product B is a new version of its predecessor A? Even more tricky, what if the next product C that you put out is also a new version of its predecessor A, but is only parallel to B? (In configuration management terms, this is referred to as a 'branch'.) For example, having run a model using a standard set of forcing data, the model might be run using two other variant sets—each of these will be in parallel to each other, and versions of the original.
Note that here we are not describing the data sources from which a data product has been generated, like using pressure data and conductivity data to produce salinity data. In most contexts, the word 'provenance' is used to describe that kind of relationship between process inputs and the outputs that result. in this section we use the term data provenance to reflect this typical meaning.
Instead, we want to know how an information product is conceptually related to another information product (apart from any data provenance relationship). At a detailed level, the metadata for the two products should explicitly describe what is the same and what is different. But what conceptual relations are possible?
Generally, the entity that produces the versions is best positioned to describe their relationships. One relationship of particular interest is 'replaces', as it indicates intended improvement between versions. (The alternative relationship might be 'is alternative to'.) Because such relationships are transitive, taken as a group they can fully describe a tree of related data product versions, even though each relationship just references 2 data product instances.
Relationships may also describe the reason for the version's existence ('latest assimilation', 'computation fixed', 'new inputs added', 'alternate algorithm'), or how it differs from the preceding version ('enhances metadata of','extends data'). Multiple associative relations may describe the difference between any two versions.
Data Versions vs Schema Versions
In our example above, we had two versions of FVCOM model data, called '1' and '2'. Because the data in each file were different, we will call these examples of 'data versions'.
Now, it so happens that this model represents its data in the FVCOM extension for NetCDF CF. Imagine that we want to upgrade the organization of this model output, to take advantage of some new structures in NetCDF 4.0. My description of the data will need to change; now the schema for representing my data has gone from NC3 to NC4, for example. (It is useful to know this when trying to understand the file; although netCDF files are self-describing as to their schema version, it's also nice to be able to tell whether an application should even try to open the file.)
Whatever file I may put my new data in, we can say that it is following a new version for the way it is organized. We will call that distinction a 'schema version'.
Incremental Versions
I may not want every version of a resource; it may change faster than I care about or can keep up with. With the concept of a Version Type Description, I can specifically ask to get every 'n' versions for that Version Type, or a maximum of one Version a day, or other constraints evaluated after a resource has been determined to be a new version.
Versions vs records in a data stream? (Versions vs Subscriptions?)
If a data stream is coming from a sensor, is the new record really just the latest version of data? This can be answered by applying the mechanisms above. For some users, the 'latest version' of data from that sensor will always be the thing of interest, and so for them subscribing to that data stream is equivalent to creating a version description that keeps constant all the other characteristics of the data stream messages (aside from the data and timestamp, of course!).
Why would you create a version description, rather than just subscribe to the stream? In the end, the two concepts may be nearly identical, depending on how the system is implemented. The added feature provided by versioning is that it is essentially an annotation system, declaring something about the resource instance itself. This would not naturally be available as part of the concept of a subscription.

Versions
Here is a real world instance of data versions. I provide these to test if the concepts stand up to these cases.
An oceanographic vessel stops at a location, lowers a CTD and collects data about the temperature (T) and salinity(S) properties of the water column. At the same time, water bottles are fired at various depths, water samples collected as well as temperatures from reversing thermometers. These samples are later analysed for nutrients. Does it matter if this is considered one instance of a sampling event, or more than one (say one per instrument used)? To begin, let’s consider this to be one. Version 1 is then the data returned at sea. In this case, it would consist of Tand S sampled by the CTD, plus T from the reversing thermometers, S from analyses of water samples but not the nutrients, because this is done later. These data, Version 1, are sent from ship to shore for immediate global distribution. Often the vertical resolution and instrument precision is downgraded to sane transmission bandwidth. This degraded version is Version 2.
The ship returns home and there the CTD calibration is checked as is the salinometer used to analyse S from water samples. Some changes are noted and the data corrected. As yet, there is no time for the nutrients to be analysed. This is now Version 3.
In time, nutrient values are derived from the water samples with appropriately calibrated instruments. These nutrient values are added into the dat set from that one sampling event at sea. This is now Version 4.
In time, someone notices the ship’s clock was not correct but the offset is discovered and corrected to produce Version 5.
How does the Graybeal specification work to distinguish these versions? For the sake of simplicity, assume that only one person “created” the data. (This is likely not true since the nutrient analyses are typically done by someone else who looked at the T and S). In the simple case, the “Creator”, “Resource Type” and “Resource Name” would be a constant. We would need to add another attribute describing the relationship. To remove ambiguity, I assume we would want a vocabulary for this. So let’s propose the following:
Version 1 Relationship = “original”
Version 2 Relationship = “V1 degraded”
Version 3 Relationship = “V1 calibrated”
Version 4 Relationship = “V1 extended”
Version 5 Relationship = “V2 corrected”
A difficulty is in the last definition because V4 also represents V3 corrected. This is a branch point that Graybel describes.
Are these relationship terms sufficiently rich? I have doubts.
Should we instead of a verion number, simply time stamp the version? That is, replace “V1” by the appropriate time stamp when that version was created.
Note that people who receive Version 2 will likely want to get Version 5 and to “replace” Version 2 sine the information content (higher resolution, higher precision, calibrated, complete suite of variables, etc.) of Version 5 is higher than Version 2. This means the version information must be carried in the distribution of the Version 2 data at the cost of bandwidth.
What other attributes of a version are needed or how better to describe th relationships?
There are more complicated variants of the above noted case. For example, the V2 data may go through some quality assessment, as would other versions. Or another version of V5 may have depth averaging done to reduce noise. We can introduce these later ater dealing with the issues raised here.
Bob Keeley
great use case
Bob, this is a wonderful use case and your questions are spot on. Without getting into the detailed analysis at this moment (!), I think you've highlighted the importance of identifying the core relations that need to be tracked as each data collection is derived from its predecessors. There won't be a one-size-fits-all set of high-level terms that works for everyone -- different people will care about, and track, the succession in different ways. So we have to be very specific and precise about the associations.
Thanks for the material!