Designing Your Sensor Network in 3 Easy Rules

The Problem

Say you're in charge of your lab's project to collect some environmental data. Your boss specified the sensors, or at least specified their capabilities, and expects you to put everything else together to make the data really work. How can you build this system so the data really is the Right Stuff?

There are a lot of technical decisions you'll have to make, but most of those (how to connect sensor A to driver B to data ingest/storage system C, how many sensors to buy, how much storage you'll need) you can figure out more or less as you go. What I want to share are the basic (metadata) rules to follow so that 5, 10, or 20 years from now your lab can use the data—the stuff that ends up in system C—to do credible research.

I also want to provide as many links as possible to current (2014) examples of these concepts, so you can go read more details about each. I hope to add more links over time, but in the meantime web searches and the MMI Guides can get you where you need to go.

The Rules

Rule 1: Describe with Identifiers

Before your data are created, describe them. If you're making a simple data system, the description can be simple; for a rich cyberinfrastructure like the Ocean Observatory Initiative, you can add some details (but not too many! it's hard to maintain rich knowledge here).

Describe every value (parameter) that you create, whether by measurement or processing. At a minimum the following should be stated about the value: its units, its origin, and its semantically meaningful name. Describe each of those things using concepts from controlled vocabularies, which should be in the form of globally unique identifiers.

  • Units: Ideally units should be specified using a standard units vocabulary like UDUNITS, UCUM, or arguably most rigorously, a unit system like the SI unit system, defined using the QUDT (Quantities, Units, Dimensions and Data Types) ontologies. For example, the SI system quantity kind of 'area' has units square meter, uniquely identified in the QUDT ontology as
  • Origin: Your values' origin can be described many ways: measured or calculated? what type of sensor or process? what specific sensor or process? on what platform? platform type? from what location? is it part of a bigger structure of values (a 'feature type'), like a time series, swath, or structured array? For each of these characteristics, advanced terms and structures can define the source with ever more detail; you'll need to assess how much detail to include in the design. As a minimum, indicate the type and model of sensor or process, and any pre-defined locations of the value in X, Y, and Z dimensions.

    The key point is to document your description using explicit concepts from controlled vocabularies, either from the community or defined in your system. Sensor type vocabularies can be found in GCMD instrument keywords, or vocabularies in the MMI Ontology Registry and Repository or elsewhere; authoritative vocabularies for sensor and process models are harder to find, so you might build them yourself for your sensors.
  • Name: The name documents what the value represents, and following best practices should be associated with a definition. There are many vocabularies for parameter names; two of the best known in environmental science are the Global Change Master Directory Science Parameters, and the COARDS Climate and Forecast convention standard names. Such community vocabularies must balance design concerns, between generality and detail, expertise and community engagement, speed and definition accuracy. In the end, the best practice is the use of unique identifiers to identify a concept that describes the parameter, resolves to a definition for the concept, and is maintained in a persistent repository over time.

Rule 2: Timestamp and Tag

These concepts are simple, though the optimal design may take thought. Every value that is obtained must be associated with two types of information: the time when the value was obtained; and tags that connect the value to its related metadata. The related metadata includes the description from Rule 1, and other dynamic information about the location and circumstances for the value.

  • Timestamp: With the timestamp, the value can be associated with all other observations and calculations at that time, whether about earth or observing systems. A timestamp may be embedded with the original data values, but it is important to capture a timestamp that is extremely likely to be accurate; embedded timestamps rarely fit that description, so timestamping data values as they arrive is often the best approach. The timestamp should always be captured in Universal Time (UT). Whenever possible, format the time in ISO 8601 format, ideally something like YYYYMMDDThhmmss.ddd, for human and software disambiguation.

    Even if the end user demands local time or sidereal time on the sensor data records, also include the Universal Time, for interoperability's sake. Using local time zones for timestamps introduces several issues: which time zone is being used? how are the time zones changed for moving platforms? and how sure are you that the time zone has been set and used correctly in generating the timestamp?
  • Description Tag: You've put the description from Rule 1 somewhere uniquely identifiable: a unique URL, or file location, or index in a database table. Your tag need to point to that location, so that the value can be 'looked up' and fully understood by your system. In rare cases, the documentation may be included with the value and transported with the value records; for example when large arrays of values are collected at once, and can be described as a group with relatively brief static metadata.
  • Dynamic Information Tags: This is the information that changes 'relatively often' for your data, for example location, orientation, a sensor's mode setting or status, or the parent platform. There are 3 ways to capture this dynamic information: embedding it with the values (as timestamps are often embedded); in a separate but persistent location, always valid for that stream of values (a satellite's location and orientation might always be reported in a certain data stream that is generated in parallel with the science data values); or a location that may change from time to time (when a sensor is moved from one platform to another, its position on earth may be defined by the current platform's geolocation data stream). In the first case the description of the value stream includes the description of the dynamic information. In the second case, the description of the value stream includes the description of the location of the dynamic information. And in the third case, the value stream or the tags you append to it must include either the actual dynamic information, or the current place to look for that information. (For example, in some systems the sensor driver is responsible for reporting the parent platform(s) in the sensor hierarchy, so that the sensor's current parent platform, and the orientation of the platform, can be obtained.)

Rule 3: Connect and Store

Now we have to make the connections among these different entities, before storing the values for access by user-facing systems. All the metadata—timestamp, location, data description, dynamic information—must be associated with each value in a persistent and discoverable way. As you design the solutions to satisfy Rules 1 and 2, you need to think ahead to how your data ingest and storage software will use your descriptions, timestamps, and tags to make a coherent collection of information available.

You may have had to optimize some of your strategies, perhaps using short names or symbols instead of fully resolvable unique identifiers. As you connect up the data and get it ready to store, this may be a good time to unpack the optimizations, and use fully resolvable identifiers as part of your storage system. It is also a good time to add other attributes to your user-facing values, by associating them with additional contextual metadata like the project and institution that generated and served them, responsible contacts, and any other information you may have or generate on the fly.

As the user-facing software makes these values available to users, now it can apply the best practices of unique identifiers, Linked Open Data, and the semantic web, because the key metadata has been attached to the data values throughout their life cycle.

Sequence of Events

Generally, Rule 1 should be satisfied before the sensor is deployed or data process is executed (yes, this can be done!), and Rule 2 is applied as soon as each value is created. Rule 3 gets applied later in the value's life cycle, when it is being organized for storage, and perhaps indexed and mapped to other information stores. In a fast end-to-end real-time system, where data must be immediately accessible, Rule 3 must be achieved very fast. But in most systems, local caches and back-up stores can make the connect-and-store process a little bit more leisurely and tolerant of computer failures. (There's a hint: Make sure your end-to-end system includes caches or backups for the raw data, and doesn't purge them until their data are delivered downstream with confirmation.)

End Results

If you keep in mind the use of good real-time descriptions using unique identifiers throughout the design of your data system, then you, your boss, and future users will reap huge interoperability and reusability benefits. You will have ensured good descriptions will be available for a long time for all these values, and with this basic set of associations readily available, further enhancements can be added later by other users of your data, based on your excellent records.