The Good Enough Data System

I was describing good future data practices to a panel considering ocean science needs for 2030, and was asked a good question after the talk: What are the qualities of a 'good' data system?

Hopefully our Guides on MMI give some ideas about that question, but the Guides are really focused on metadata. And they don't put it all in a neat summary.

So here are my proposals for what makes a good data system, with suitable weasel words for the wide array of projects that need data systems.

Principles and Functions of 'Good' Data Systems

  • Data products openly available
    • Open license to use data
    • Ready Internet access (standardized + simple ones)
  • Save original data
    • Provide access to unaltered data
    • Quality control processes flag data, not change it
  • Data sources (provenance) identified
    • Time & location of observation
    • Well-described source (sensor or software or person)
    • Data processing applied
  • Data well described
    • Data format described (syntax, including structure organization)
    • Data variables labeled
    • Data characteristics categorized (what kind of variables; what kind of structure or model)
  • Provide context for system (instruments, platforms, software) deployment and operations
    • Instrument/platform deployment information (when deployed, and on what parent platform)
    • Instrument calibration, quality assurance, test activities (before, during, after deployment)
  • Scientific and organization context
    • Observation frequency and duration (what is the observing plan?)
    • Observing team members and contacts
    • Program affiliation(s)
  • Interoperable system
    • Support standard protocols and interfaces when providing data, metadata
    • Semantic-friendly (all terms come from controlled, published vocabularies)
    • Ease of federation/collaboration/data submission

More Subtle, but Still Valuable, Characteristics

  • Data is annotatable
    • System operators, public users, and other systems can annotate data
    • Annotations are also data
  • Data system characteristics
    • Data system and data are persistent and sustainable
    • Appropriately configurable (based on how fast inputs change)
    • Appropriately scalable (based on how fast scope increases)
    • Appropriately distributable/reliable
      • absolute minimum: backups of system and data that have been tested
      • built-in redundancy in communication pipeline to save data when data (or communication) system goes down
      • for increased usage, build with multiple systems available at distributed sites (hot failover)