Next Gen Web: Identifier Reality

Resource identifiers on the World Wide Web. By now we've all had the chance to study up on these identifiers, play with them, proclaim our favorite types, even build something with them. Universal, unique, uniform, resource-y, resolvable, digital, persistent, discoverable, cool—what's not to like about identifiers? Especially because they are so … purposeful.

The reality is more messy. (Of course.) As MMI anticipated when deciding on semantic identifiers, way back when, no single type of identifier is a magic bullet, guaranteeing all those nice features. An identifier type may offer a good start, but it's the systems, organizations, and people behind the identifier that ensure its success.

Disturbing and other stories

I suppose we shouldn't be too hard on lay computer users, but really, shouldn't the U.S. Supreme Court care more about web references? The idea that they choose to reference About.com (really?) to support an opinion is curious enough, but it shouldn't be hard for a legal scholar to realize that URLs may become, well, unlocatable.

On the technical side of the equation, there are all of us experts, wrestling with the rest of the world. CrossRef asks "DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right?" You might not be surprised to learn that in practice, DOIs are not always as great as all that. Yet this was a source of wonder to the CrossRef author, whose amount of surprise was, well, a surprise.

Thinking this stuff through isn't rocket science, and simple use cases can help us avoid surprise in our own systems. Let's take some examples.

Counter-claim use cases

To see how to analyze your own problem, let's run through a few assumptions made by promoters of identifier types, using some simple use cases and questions.

  • We can control the use (and value) of our identifiers: The web is the web, one step short of anarchy. If someone wants to destroy the usefulness of your identifier, they can do it, simply by using it extensively and inappropriately, perhaps with 10 million of their closest friends. Twitter hashtag collisions are the folksonomy variant of identifier squatting: You may not be able to control what appears at http://noaa.gov, but you could use that string as an identifier for lima beans any time you want.
  • URNs are persistent, location-independent identifiers for web resources: An oldie (URNs are less emphasized as an identity technology), this claim of location independence is literally true in http namespace terms. But take the example urn:ietf:rfc:2648—what will happen if someday the IETF changes its name? An awful lot of identifiers will suddenly have an "old" namespace, that's what.
  • PURLs/DOIs/[your identifer type here] are more persistent than URLs: A favorite canard to 'prove' this is to look at how many old URLs are resolvable today (hint: it's a miserable percentage). But why is that? Those owners didn't care about persistence, or they could have provided it (see Tim Berner-Lee's cool 1998 manifesto Cool URIs Don't Change). Every system that 'makes identifiers persistent' has to (a) build a persistent repository for all these identifiers, and (b) make it possible to look up the 'more persistent identifier' and resolve both it, and any original web page that the identifier was created for. So:
    • You're still usually depending on URLs for ultimate resolution of information, and there's a new identifier system that has to be persistent also. Mathematically, this has to be less reliable (it inherits the reliability of both systems).
    • You inevitably lose some resolvability, unless browsers handle your identifier type.
    I give the DOI community lots of credit here, for building huge brand awareness and market, creating a useful supporting infrastructure, and working on the obviously unresolved (sic) issues. It's a big job. Just don't assume all that work on DOI systems automatically makes them better for your purpose.
  • URLs are poor choices for identifiers because they imply resolvability: Well, first of all, if you believe that all import identifiers should resolve to something, and implement your system to resolve the URLs it creates, the implication of resolvability is a good thing. But returning to our earlier point, URLs aren't required to be resolvable, and the fact that many aren't resolvable somewhat neutralizes this argument.
  • UUIDs or [your identifier type] are always unique: By design, yes they are. If the UUID generator is broken or misused, or the company processes likewise, you may find the same UUID on two of the widgets ACME company sent out. (I've seen it happen, with serial numbers anyway. The stories I could tell…)

Enough examples. What's a poor system designer to do?

It's about you

If you're trying to evaluate what type of identifier to use for your project, the situation looks messy, and in some ways it is—especially because you can find a way to break some aspect(s) of any chosen approach. But we can simplify this first decision with a few rough statements:

  1. If you're publishing formal academic documents and you just need IDs for them, DOIs are likely the best fit.
  2. If you're publishing something else (say, data sets) and you want citeable IDs, DOIs can work. But if you're publishing lots of data sets, or want the simplest possible resolution of the ID in a browser, or have NO money, a URL is entirely viable.
  3. If you're publishing anything else that you want a resolvable web identifier for, URLs can be set up to meet all your goals, with some set-up design. If you're smart enough to manage identifiers in a software system, you should be smart enough to think through some good basic practices that meet your needs.
  4. If you're creating something physical and have a computer-clueless clientele, choose your identifier to match. A UUID is impressively long, as guaranteed to be unique as anything else you can create, and can be looked up on the web in a pinch, if you've published them all where search engines can find them (read: on your public web site). Awkward to type in, but your clients aren't using computers, are they?
  5. Or, choose any of the other identifiers that you want. Just decide on the important characteristics for your identifier (from the list at the top of the page), and make sure you've designed your system, in combination with the identifer, to provide those capabilities.

Conclusion

There. It's just that easy to pick a Resource Identifier type.

Look, the identifier type you choose may not be perfect. But looking at it now, 10 years down the road from the original arguments—about life science identifiers (LSIDs), URNs, semantic identifiers, use of versioning in identifiers, or DOIs—I can tell you that the fundamental arguments (and arguers) haven't changed that much. If you don't design yourself into a corner, you'll have a migration path at least, and with luck a viable system, for a long time to come.

Addenda

  1. I haven't kept up, but I think cool URIs may really be IRIs now, for International Resource Identifiers. I don't think you have to know that though.
  2. One of the big arguments in the semantic/ontology community is about whether the identifier changes when the name changes (e.g., when we all adopt British spelling) or definition changes, or both, or neither, as the language evolves. I am firmly convinced that there are use cases supporting all sides. (There are a LOT of big arguments in that community. I usually end up agreeing with both sides, the real world is just not perfectly organized nor persistent.) So in MMI's repository, you can take several approaches: use an identifier based on the exact spelling of a term, or on a specific version of that spelling, or even use a scrambled code like 3FE926FT instead of the name of the term.