6. Concept of Operations

Here we illustrate typical operational sequences to achieve the goals of various system actors.

A key concept in these scenarios is that every term in a community vocabulary has a URI, or Uniform Resource Identifier. This URI may be a URL (Uniform Resource Locator, like you enter in a web browser) or a URN (Uniform Resource Name, a unique string that describes a resource).

Data Originator Needs a Vocabulary to Describe Data

A typical Data Originator-call him Dave Oridger-typically starts with a data set in hand, or at least a mental concept of a desired data set. Dave will have a name in mind for each data parameter, even if just informally. Dave first engages with the semantic framework when he decides to share his data in a community-friendly way. To do this he must name his parameters using a vocabulary that the community can understand.

To create a vocabulary that satisfies semantic interoperability requirements, Dave starts with his list of parameters and the Community Vocabulary Builder's Term Selection Service. (He can also use the tools in interactive mode, entering one term at a time to search for.) Dave selects 'Begin New Vocabulary' and names it Profit Ocean Exchange. He then selects his own parameter list, which can be in several common formats, and clicks on Read Parameters. At this point he can configure the default behavior of the Community Vocabulary Builder in one of two ways.

Configuration 1: Map from Existing Names. In the first configuration, the tools will create a vocabulary with Dave's parameter names, then help Dave create mappings from those parameter names to related community terms. This creates a publicly usable association between the two vocabularies, and Dave's utilities (using community-supplied libraries) can automatically publish Dave's data as community resources by supplying the mappings.

Configuration 2: Develop a New Set of Names. In the second configuration, the tools help Dave create a vocabulary list that primarily consists of community names. In many situations, unique identifiers for these community names can be used as the identifying name for each data item. When short or unique names are necessary (for example, when a system has multiple sensors measuring the same type of parameter), the Community Vocabulary Builder can generate appropriate local names from the community terms, creating a 'local vocabulary' that can be mapped to the community terms.

Dave selects his first vocabulary term, possibly editing it slightly if using configuration 2, and selects Search. The Term Selection Service searches the core (suggested) vocabularies for matching terms, presenting them in a pick list. Dave could have chosen to omit any of the core vocabularies, or re-order the sequence in which they are searched. (In a future release the Term Selection Service will also let Dave select terms without having to do a search, for example using multi-faceted selection menus, or as just cutting and pasting URIs.)

Configuration 2 Process: In configuration 2, Dave can check any terms from the resulting list and click 'Use selected term(s)'. The associated URIs are registered in the Profit Ocean Exchange data set. In this way Dave finds 75% of the terms he wants to use, and he hasn't had to make up any 'official' names yet.

To find additional terms Dave checks the "Extend term search", and set a slider to indicate how many of the stored ontologies to search. Ontologies are searched for terms in ranked order using an algorithm based on size, a quality metric (do all terms include definitions?), age, number of updates, and community usage; specific ontologies can be included or excluded. In many cases a few terms can be identified that match Dave's needs.

Finally Dave has found all the available matches, but still has some terms that are not satisfied by any other vocabulary. He now follows the same process that is used in configuration 1 for all terms.

Configuration 1 Process:

Dave uses the Term Creation Service to create new community-accessible terms for his vocabulary. This capability prompts for descriptors for each term (name and definition, suggesting appropriate practices for each), and creates a new entry in the controlled vocabulary.

For these new terms, Dave is encouraged to create mappings to other community vocabularies. Dave can select any of his own terms, then use a Term Selection Service to identify related concepts. This time he can select multiple community terms and click a relationship, like 'Same as these terms' or narrower than these terms'. This helps users of his data set, and his vocabulary, understand the relationship of the parameter names to other, more familiar names.

Under configuration 2, Dave may now choose to 'Generate Custom Names', selecting the criteria (short names, no punctuation, word case, unique names, etc.) that he wishes to apply. This will create a custom vocabulary for a specific data set, and requires some additional configuration (like the number of terms that correspond to a particular community name).

Finally, Dave has a complete controlled vocabulary that he can use to identify his data. He can save a copy of this vocabulary as a local file, in several useful formats (OWL, RDF, formatted HTML, text, and CSV). But his vocabulary only becomes interoperable&emdash;his original goal&emdash;when he publishes it to the community repository. By clicking on the "Publish" button, an OWL file&emdash;an Ontology with his controlled vocabulary&emdash;is registered in the Ontology Repository, and becomes available for discovery, analysis, and re-use. When he updates this vocabulary in the future, the Ontology Repository will track changes, providing version numbers for both individual terms and the whole Ontology.

Data Originator Applies Vocabulary to a Data Set

Dave Orijor has created a community-accessible vocabulary that describes the concepts he is measuring, but now he needs to use the vocabulary to describe his data. There are several forms in which he might apply the vocabulary in his data description.

  1. Unique short names describe each parameter being measured; the overall vocabulary is referenced as part of the data set description
  2. Both a (local) short name, and an interoperable or 'standard' name, are provided for each term.
  3. Each parameter in his data description is described in an interoperable way, as for example by providing a URI.

Of course, considerable additional metadata—units, detailed description, source, and so on—may describe each parameter, but these are considered independent of this discussion.

Where the data description capability (e.g., a content standard, or a data standard like netCDF that calls for metadata to be embedded) is aware of semantic web issues, it will typically allow (or even require) the specification of a term as a Uniform Resource Identifier, or URI. Therefore, Dave needs a URI for each term he uses to describe his parameters. These URIs are provided by the Community Vocabulary Builder, as part of building Dave's vocabulary. (In some cases the URI must be a URN, in others a URL; the Ontology Repository therefore needs to provide both for the terms in its ontologies.) In a complete and well-thought-out framework, all of these URIs should be resolvable via web services. This enables the user to learn about a particular term by appropriately resolving its URI.

To enter a URI-a relatively long string in most cases-for each term used to describe the data set implies considerable typing, or at a minimum cutting and pasting. While a preferred alternative is to provide data set definition tools, there are relatively few such tools for an extremely wide range of formats. Nonetheless, a major advantage in usability will occur when metadata entry tools are developed to include reference to drop-down selections of terms from community and local vocabularies. (This would be an ideal demonstration area for an application developer.)

Data User Searches for Desired Data

Daisy Usenew needs to find data for a particular project. She has access to several catalogs of data sources, including registries of data streams, service registries that include data services, visual presentations of data that have links to the Data Providers, and data repositories ('Data Aggregators') that include 'ongoing' data sets. She knows the data she wants is called Sea Surface Temperature or SST, but she does not appreciate how many data sources are not using those terms natively.

At the simplest level of semantic interoperability, her searches can still be successful, because the aggregators of data have integrated term mapping into both their search and registration interfaces. On the registration side, the Data Aggregators have insisted that registered data sources (Data Providers) include associations to community terms for each of the local descriptive terms, or follow a notational convention when no community term exists; so the Data Aggregators have a community term for every data item that they know about. To validate the submissions, the mappings must be provided in an agreed way (either within the data set, as described in the previous example, or in a separate mapping file, as may be constructed in the first example).

On the searching side, a number of community vocabularies are used to supply the default search terms. These community vocabularies are mapped to each other in many cases. So, by using inferences, a relationship from Daisy's search terms, or any other Data User's search terms, to the community terms used by the Data Provider can usually be found.

The Data Aggregator may determine these relationships in one of two ways. The Data Aggregator may keep a local copy of all the relevent ontologies and mapping files, and perform inferencing as a local computation, possibly in advance for each community search term. Or more likely, the Data Aggregator can query an ontology mediation service, providing a search term and receiving in response all the terms to which that search term can be related. In either case, some attention must be paid to cacheing strategies and network latency issues, to avoid unpleasant delays in responding to search queries.

Of course, since many terms may not indicate whether the measurement is near the surface, Daisy may have to decide between getting too much data (all the water temperature measurements) and too little (only the ones that have are clearly mapped to a sea surface temperature concept). Larger Data Aggregators are likely to provide graduated control over the number of returns, using relationship strengths within software algorithms to decide the likelihood that a return is relevant.

At a second level more sophisticated inferencing is possible. The concept of "sea surface temperature" typically implies something not just about what is measured (water temperature), but where it is measured (near the surface of the ocean). In many cases data providers will label such data as "water temperature", and provide a different piece of metadata to indicate the location (depth) of the measurement. A more comprehensive approach will use ontological concepts for terms like "sea surface temperature". The carefully defined ontological concept may include additional properties, like the location of the measurement that the term impliies, that can be used to filter or select the data most likely to be of interest. This kind of inferencing can only be done when the data model in the repository (and in the Data Provider) is sophisticated enough to support all the necessary information, and so only some Data Aggregators will provide it.

Even more advanced data models can be utilized through inferencing. For example, a repository may recognize from a data set's descriptive metadata that the water temperature was detected by a satellite, and so 'understand' that only surface water temperatures can be measured, even though the term 'sea surface temperature' was not used.

Solution Developer Integrates Semantics into Software

The Solution Developer, Sol Devin, has the following problem: users of the system want to use their own vocabulary to specify something (e.g., a search term, or a label for their data). The Solution Developer must provide some mechanism that acts as a go-between, translating the vocabulary of the user into a vocabulary of "the community" (where the community could be local, consisting of other users and their vocabularies; or a larger external community, which may even have formal vocabularies defined).

Option 1 for Sol is to enforce the use of a 'community vocabulary'. In the search term interface, Sol can configure the application to offer only a drop-down (or auto-complete) set of terms, constrained to the vocabularies that are acceptable and understood by the application. In this case, integration of semantics consists of incorporating the acceptable vocabularies directly into the application and its interfaces (see paragraph below).

Option 2 involves an attempt to find terms in the acceptable vocabularies that match the intent of the user. In the simplest approach, Sol adds code to perform a text search across the acceptable vocabularies (previously incorporated as described below), optionally searching across descriptions and other metadata, and returns possible matches to the user. The user can select from any of those matches, or refine or expand the search. More complex or subtle search algorithms can be created, of source.

Option 3 involves using explicit relationships between the local vocabulary and the acceptable vocabularies, if these relationships have been created. If the Data Originator had at some point explicitly described the relationship between the local terms and any of the acceptable vocabularies-or to other intermediate vocabularies that are in turn mapped-the Solution Developer could explicitly use these mappings to substitute an acceptable term for the local term.

Each of these options is possible within a stand-alone application, although the application will either have to be pre-configured with the necessary vocabularies, or will have to occasionally update its locally-cached acceptable vocabularies with the most recent versions of them from the web (e.g., from the Semantic Mediator). But a second approach is available for software that always has web access, for example a web service: Ask the Semantic Mediator to perform the desired transformation. See below for further information on this alternative.

In addition to user expectations for the application's behavior, the best approach may also reflect whether the application is being developed from scratch, or already exists. An existing application may embed limitations that prevent certain approaches, for example accessing information via the internet, or working with local copies of large controlled vocabularies. In these situations it may be easiest to change the default controlled vocabulary, or augment the existing set of options so that a simple mapping can be created from the original terms to the community vocabulary.

To Incorporate a Formal Vocabulary In An Application

To make an application use an existing formal vocabulary, the Solution Developer must have a way to select the vocabulary's terms. Locally, this can be done by reading in the desired vocabulary(ies) from local files. The files are then parsed for the desired information.

In a web-enabled environment, a vocabulary's terms can be obtained in real time from the internet. The ability to obtain a vocabularies terms will be provided (by the Semantic Mediator) via a URL in an HTTP protocol, or with another form of web service request, that can 'select all terms for a specified vocabulary'. Metadata can be optionally requested. The returned information can then be parsed for the desired information, and the form or other interface initialized with the parsed data.

Obviously, the local input, or the HTTP or service call, and the subsequent form initialization can be performed at any appropriate time, say at application startup or upon a user command.

To Implement Semantics Using a Semantic Mediator

An on-line variation of these three options can be performed using the Semantic Mediator. The process of incorporating a formal vocabulary, as required for Option 1, is described in the previous subsection.

In option 2 (searching for term matches), Sol can configure the application or service to put this search query to the Semantic Mediator. The Semantic Mediator can support more sophisticated matching algorithms, and search across a wider variety of terms, because of its role as a community service provider. In response to the web service invocation, the Semantic Mediator returns the requested information, which is then served to the user in the same way.

In option 3 (explicitly mapped terms), both local and acceptable vocabularies must be registered as Controlled Vocabularies in the Semantic Mediator. The query from Sol's application or service specifies the source term, and a list of acceptable Controlled Vocabularies for the destination term. The Semantic Mediator can search its entire list of inferred relationships for the best match(es), or for all matches, according to the query. The 'best match' solution is one that provides the most transparent semantic mediation experience to the user.

Recommendation for Solution Developers

Implementation of option 3 using a Semantic Mediator provides the strongest, most current, and most feature-rich semantic mediation capability for an application. If internet access is reliably available to Sol's application or service, and the local vocabulary's terms have been explicitly mapped to any other reasonably interoperable Controlled Vocabulary, option 3 will provide excellent results to the user.

The Semantic Mediator can also work with the Solution Developer's application to provide access to multiple approaches, using direct and inferred mappings if those are viable, searches across the repository if mappings do not produce an acceptable solution, and finally insisting on a community term selection from one of the approved vocabularies. This is the most robust interoperable solution that is possible.

Information Architect Develops Community Vocabulary

(Focus on the operations relatively key to vocabulary development: searching for related terms, creating a framework for the names, creating the names and definitions, and doing all this in the context of publishing the results for open review and update)

Multiple Semantic Mediators Serve Same Ontology

is it even possible?(probably, but it may not be advisable; still, one could imagine that different services could provide different levels of administrative metadata for terms from a particular provider)

what happens to URIs if an ontology provider-the primary owner responsible for creating the ontology-changes? (this may be resolvable by deprecating Terms and Controlled Vocabularies in the previous ontology, in favor of those in the new ontology; a particular relationship, 'is redirected to' (or 'is now served by' or 'is replaced by'), could be used to 'forward' any 'most recent version' URIs to the new Term or Controlled Vocabulary; the existing terms would continue to be served by the previous provider, or could likewise be forwarded if the new provider took them on

how are multiple URIs for same term resolved as the same entity? (a relationship like 'is re-served from' or 'is a copy of' could point to the term as served by the originator of that term; so users would see that this is (a) not from the original owner, and presumably therefore not authoritative, and (b) not a unique URI for the term)