Nice example of data provenance

MBARI has done something I consider impressive in the scientific data management realm, and I want to toot their horn a little, so a blog entry seemed good for that. (Disclosure: I was hired in part to start up the project discussed below, and have gotten to claim part of its success despite contributing relatively little to it technically over the years.)

As indicated by this MMI news item, the American Geophysical Union has recently emphasized the value of metadata and long-term stewardship of data. Many of the long-term technical staff at MBARI, notably Mike McCann and Rich Schramm, have been emphasizing these priorities for a long time. In about 2000, as part of its Monterey Ocean Observing System (MOOS) development effort, MBARI began developing a data system that would systematically manage data from MOOS devices, in a way that emphasized metadata, provenance, and stewardship.

Not long after I arrived in 2001 to help with this project, Kevin Gomes came on board, and eventually became the project's technical lead, a position he holds today. Together, the team members developed a systematic data management capability, the Shore Side Data System (SSDS), that has successfully captured data from a wide range of oceanographic assets[2]. More to the point, it has done so while capturing very rich metadata about those data sets. Considerable information about the data's origin is collected as the data is generated and managed, and that metadata used to manage the data throughout its life cycle. Even after the raw data is stored, post-processing systems send metadata back to SSDS describing the processing that has taken place, and the new data sets that result.

A 2008 paper[1] describes this process, and below I've captured a few of the key details. The work highlights some key aspects of effective metadata management:

  • the need for automated systems (in our case, SIAM, the Software Infrastructure and Applications for MOOS) that capture metadata as the data is captured,
  • the importance of a detailed data model that reflects reality, and
  • the fact that while metadata may be an important underlying requirement, it alone is not sufficient to impress the system users.

The following diagrams and screen shots capture a few significant system aspects.

 

The data model for the system evolved rapidly over the first two years, but more slowly after that. This data model has proven quite satisfactory over 6 years of data collection (though improvement is always possible).

 

SSDS Data Model

The SSDS has a lot of detail in its model—it takes time to decide on and implement all these details.

SSDS Provenance Details

Although the experienced user can use this web interface to navigate to just about any data product from the multiple observing systems served by SSDS, less technical users demand simpler schemes, closer to their way of working, before they are happy with a data system. SSDS Data Browsing

 

[1] Oceanographic Data Provenance Tracking with the Shore Side Data System. (Chapter 6 in Provenance and Annotation of Data and Processes) McCann, M. and Gomes, K. Springer Berlin / Heidelberg, 2008, pp309-322. Available at SpringerLink, fee required.

[2] Shortly after the mooring data capture began working, Brian Schlining took the lead on integrating AUV data into the system. Although the AUVs create a very different data structure, they do contain enough metadata in their logs to enable rich metadata tracking. Brian successfully created pre-processing software to convert AUV metadata to a form recognized by the SSDS.

AttachmentSize
SSDSDataProvenanceSchema.jpg64.42 KB
SSDSDataProvenanceRelations.jpg66.74 KB
SSDSDataProvenanceBrowsing.jpg65.71 KB