The Good Enough Data System
I was describing good future data practices to a panel considering ocean science needs for 2030, and was asked a good question after the talk: What are the qualities of a 'good' data system?
Hopefully our Guides on MMI give some ideas about that question, but the Guides are really focused on metadata. And they don't put it all in a neat summary.
So here are my proposals for what makes a good data system, with suitable weasel words for the wide array of projects that need data systems.
Principles and Functions of 'Good' Data Systems
- Data products openly available
- Open license to use data
- Ready Internet access (standardized + simple ones)
- Save original data
- Provide access to unaltered data
- Quality control processes flag data, not change it
- Data sources (provenance) identified
- Time & location of observation
- Well-described source (sensor or software or person)
- Data processing applied
- Data well described
- Data format described (syntax, including structure organization)
- Data variables labeled
- Data characteristics categorized (what kind of variables; what kind of structure or model)
- Provide context for system (instruments, platforms, software) deployment and operations
- Instrument/platform deployment information (when deployed, and on what parent platform)
- Instrument calibration, quality assurance, test activities (before, during, after deployment)
- Scientific and organization context
- Observation frequency and duration (what is the observing plan?)
- Observing team members and contacts
- Program affiliation(s)
- Interoperable system
- Support standard protocols and interfaces when providing data, metadata
- Semantic-friendly (all terms come from controlled, published vocabularies)
- Ease of federation/collaboration/data submission
More Subtle, but Still Valuable, Characteristics
- Data is annotatable
- System operators, public users, and other systems can annotate data
- Annotations are also data
- Data system characteristics
- Data system and data are persistent and sustainable
- Appropriately configurable (based on how fast inputs change)
- Appropriately scalable (based on how fast scope increases)
- Appropriately distributable/reliable
- absolute minimum: backups of system and data that have been tested
- built-in redundancy in communication pipeline to save data when data (or communication) system goes down
- for increased usage, build with multiple systems available at distributed sites (hot failover)
Posted July 11th, 2010 by graybeal
