Permanent Identifiers for the Web

Web applications that deal with data on the web often need to specify and use URLs that are very stable. They utilize services such as purl.org to ensure that applications using their URLs will always be re-directed to a working website. These “permanent URL” redirection services operate kind of like a switchboard, connecting requests for information with the true location of the information on the Web. These switchboards can be reconfigured to point to a new location if the old location stops working.

How Does it Work?

If the concept sounds a bit vague, perhaps an example will help. A web author could use the following link (https://w3id.org/payswarm/v1) to refer to an important document. That link is hosted on a permanent identifier service. When a Web browser attempts to retrieve that link, it will be re-directed to the true location of the document on the Web. Currently, that location is https://payswarm.com/contexts/payswarm-v1.jsonld. If the location of the payswarm-v1.jsonld document changes at any point in the future, the only thing that needs to be updated is the re-direction entry on w3id.org. That is, all Web applications that use the https://w3id.org/payswarm/v1 URL will be transparently re-directed to the new location of the document and will continue to “Just Work™”.

w3id.org Launches

Permanent identifiers on the Web are an important thing to support, but until today there was no organization that would back a service for the Web to keep these sorts of permanent identifiers operating over the course of multiple decades. A number of us saw that this is a real problem and so we launched w3id.org, which is a permanent identifier service for the Web. The purpose of w3id.org is to provide a secure, permanent URL re-direction service for Web applications. This service will be run and operated by the W3C Permanent Identifier Community Group.

Specifically, the following organizations that have pledged responsibility to ensure the operation of this service for the decades to come: Digital Bazaar, 3 Round Stones, OpenLink Software, Applied Testing and Technology, and Openspring. Many more organizations will join in time.

These organizations are responsible for all administrative tasks associated with operating the service. The social contract between these organizations gives each of them full access to all information required to maintain and operate the website. The agreement is setup such that a number of these companies could fail, lose interest, or become unavailable for long periods of time without negatively affecting the operation of the site.

Why not purl.org

While many web authors and data publishers currently use purl.org, there are a number of issues or concerns that we have about the website:

  1. The site was designed for the library community and was never intended to be used by the general Web.
  2. Requests for information or changes to the service frequently go unanswered.
  3. The site does not support HTTPS connections, which means it cannot be used to serve documents for security-sensitive industries such as medicine and finance. Requests to migrate the site to HTTPs have gone unanswered.
  4. There is no published backup or fail-over plan for the website.
  5. The site is run by a single organization, with a single part-time administrator, on a single machine. It suffers from multiple single points of failure.

w3id.org Features

The launch of the w3id.org website mitigates all of the issues outlined above with purl.org:

  1. The site is specifically designed for web developers, authors, and data publishers on the general Web. It is not tailored for any specific community.
  2. Requests for information can be sent to a public mailing list that contains multiple administrators that are accountable for answering questions publicly. All administrators have been actively involved in world standards for many years and know how to run a service at this scale.
  3. The site supports HTTPS security, which means it can be used to securely serve data for industries such as medicine and finance.
  4. Multiple organizations, with multiple administrators per organization have full access to administer all aspects of the site and recover it from any potential failure. All important site data is in version control and is mirrored across the world on a regular basis.
  5. The site is run by a consortium of organizations that have each pledged to maintain the site for as long as possible. If a member organization fails, a new one will be found to replace the failing organization while the rest of the members ensure the smooth operation of the site.

All identifiers associated with the w3id.org website are intended to be around for as long as the Web is around. This means decades, if not centuries. If the final destination for popular identifiers used by this service fail in such a way as to be a major inconvenience or danger to the Web, the community will mirror the information for the popular identifier and setup a working redirect to restore service to the rest of the Web.

Adding a Permanent Identifier

Anyone with a github account and knowledge of simple Apache redirect rules can add a permanent identifier to w3id.org by performing the following steps:

  1. Fork w3id.org on Github.
  2. Add a new redirect entry and commit your changes.
  3. Submit a pull request for your changes.

If you wish to engage the community in discussion about this service for your Web application, please send an e-mail to the public-perma-id@w3.org mailing list. If you are interested in helping to maintain this service for the Web, please join the W3C Permanent Identifier Community Group.


Note: The letters ‘w3′ in the w3id.org domain name stand for “World Wide Web”. Other than hosting the software for the Permanent Identifier Community Group, the “World Wide Web Consortium” (W3C) is not involved in the support or management of w3id.org in any way.

10 Comments

Got something to say? Feel free, I want to hear from you! Leave a Comment

  1. What a great service! As you aptly enumerate, a better version of the service that purl.org was never really designed to fully support.

    A quick implementation note on redirect entries for w3id.org. In your example for payswarm.com you’ve set up a 302 redirect.

    That is probably appropriate for this document. However, users should of any redirect service should be aware that all major search engines interpret a 302 as a temporary redirect. The net result of this is that Google et al. will sometimes show the source rather than the destination page at that URL in search results.

    Less equivocally for those that need to be concerned about such things is that link equity is never passed via a 302 but only a 301. That is, one of the primary signals used by search engines in the ranking of resources is the quality, quantity and relevance of the links pointing to a resource; the ability of a URL (and, to a certain degree – by extension, of the domain on which it resides) to rank well for relevant keywords is impeded if important links point to that resource only by employing 302 redirects. This is not the case with a 301 redirect, which passes link equity to the target directly.

    Again, your example is an apt use of a 302. And maybe 301s simply don’t factor into the world of web applications using linked data that w3id.org seems chiefly designed to support. But developers availing themselves of w3id.org redirection should be aware of the implications of their choice of redirect type, especially as 302 is the default choice for most developers (having been conditioned to that because it’s the default choice of so many programs).

    Nope, I won’t even raise the topic of three-oh-threes. :)

    • ManuSporny says: (Author)

      Thanks Aaron, excellent points. One of the other beauties of the service is that authors/developers have the choice of the particular type of re-direct they want to use, and can change it at any time.

  2. In order to be able to apply the Memento protocol to access old version of targets of redirection, it would be nice to archive the redirection history for these permanent URIs and to support requests issued against them that have Accept-Datetime values. Responses to such requests replay the temporally appropriate archived redirect. The target of that archived redirection may reside in a web archive. That way, thanks to Memento, not only the current, but also prior versions of the target of the redirection become available via the permanent URI. I am most willing to explain this in more detail. Permanent, for me, is not only about the present and the future but also about the past.

    • ManuSporny says: (Author)

      Very interesting. We’d certainly be willing to try and add that sort of support to the website. Our primary goal is to ensure that the site is easy to use. So, whatever software is used on the site must be able to be modified easily, mirrored by anyone, and easy to use.

      I will be the first to admit that doing a github fork, edit, and pull request isn’t very friendly. We’d love to have some custom software running on the website, but have not had the opportunity or bandwidth to write it yet. If you are interested in writing software to make re-direct management easier, we’re very interested in figuring out a way to make that happen.

  3. uo says:

    I have two questions/suggestions:

    1. How do you identify/authenticate the original redirect owner if the target URL needs to change? Worst case: the former domain is gone, so you can’t upload someting to authenticate you. Do you look up the GitHub username from the first request? Well, I guess GitHub could be gone in a few years. Maybe require registration via mailing list, to compare the mail adress? (in the hope that users don’t lose access to their mail adresses, which would be … utopistic ;) )

    2. What is the policy for URL paths at w3id.org? Do all redirects go under /people/? What if there are cases where it can’t be decided if the target goes to a person or a company or any site etc.? What if the target changes from a personal FOAF file to a company site? And which ar the “names” that can be chosen? Only a-z 0-1? Any punctuation? Umlauts? Other Unicode chars? Length? Minimum? Maximum?

    • ManuSporny says: (Author)

      How do you identify/authenticate the original redirect owner if the target URL needs to change?

      Eventually we will have software to deal with this, but for the moment, the admins will check who requested the initial addition of the redirect and then check to make sure that it’s the same person. If it’s not the same person, then they need to send a public e-mail to the public-perma-id@w3.org mailing list to request the change and why they’re doing it on behalf of someone else. We would also check to make sure that the target of the re-direct seems to match up w/ the original purpose of the re-direct, and if it doesn’t, we’d ask the requestor to explain why the change is being made. There are no hard and fast rules as the sorts of things you need to check for changes based on the type of information you’re re-directing.

      If we make a mistake, we’ll know who requested the change and which admin pushed the change through. So, at least there is accountability in the current system.

      In the future, we hope to have portions of the URL space, like the one for people, automated so that people can manage their own re-directs.

      What is the policy for URL paths at w3id.org? Do all redirects go under /people/?

      Re-directs can go wherever the requestor wants to put them. There are no hard policies for URL paths. People request the path, and the admins add the path. There may be some discussion required for paths that could be in violation of some sort of intellectual property law (like claiming the URL for pepsi, or coca-cola). Any punctuation that is allowed in a URL is allowed in the re-direct. The length can be anything the requestor asks for, within reason. Having a permanent URL that is 65,535 characters long doesn’t make a whole lot of sense and will probably be rejected by the admin team. That said, if there is a good reason to have a path that long, then it will be created.

      The bottom line is that we want this service to be useful to people. So, if adding the redirect results in a net positive for the requestor and the Web at large, we will add it. If it doesn’t, then we won’t. We hope to run the site using general guiding principles and commons sense rather than a strict set of rules. Does that make sense?

  4. Gregor Hagedorn says:

    I have serious doubts about this approach for two reasons:

    a) I believe we are reaching an age of webometrics. Institutions will be evaluated how often their domains are being cited (i.e. linked in html, RDF/LOD, etc.). Just like scientists are evaluted for their work being cited. It is therefore in the interest of institutions to publish their objects and concepts under their own domains.

    b) you write “the only thing that needs to be updated is the re-direction entry on w3id.org”. Well. With the current system, the only thing that needs to be updated is the local re-direction entry of the local domain. Redirection IS a standard feature of every webserver. Why does it not happen? I believe: Lack of attention and resources. But then: why is it more likely that an institution, on modifying the URL of a resource, will inform w3id.org and w3id.org will have the resources to timely update all the redirections for the entire world? Since when is a bottleneck, both in technical access (all access would go over the server pool provided by w3id.org) and curational resources (communicating and enacting redirections) more effective than a central bottleneck?

    You correctly identify problems with purl.org to be unable to cope, but I find the argument why your service, with perhaps 5 admins will not very soon be unable to scale to the required level of handling all concept traffic on the web.

    I believe an awareness drive that ensures that all organisation take up the responsibility to manage their own identifiers with long-term perspective would be different. It would localize failure and distribute work to those parts each organisation is interested in. We need to educate the executive officers of these institutions how important this is.

    • ManuSporny says: (Author)

      Institutions will be evaluated how often their domains are being cited … It is therefore in the interest of institutions to publish their objects and concepts under their own domains.

      This happened around 1999 with Google’s Page Rank algorithm. Your future is already here. :)

      You are also making the false assumption that long-lived institutions are the only entities that will be producing permanent identifiers. In practice, there are many more individuals (programmers, authors, scientists) and short-lived organizations (standards initiatives) that need to create permanent identifiers, but don’t have the necessary infrastructure to ensure that those identifiers last for decades.

      Redirection IS a standard feature of every webserver. Why does it not happen? I believe: Lack of attention and resources.

      If lack of attention and resources are the reason, and this lack of attention and resources haven’t changed in the past decade, why do you think they’re going to change now? I think the bottom line is that many organizations aren’t setup to provide this sort of redirection service for precisely the reasons you outline. For many organizations, it’s not a part of their primary business or motivation.

      I find the argument why your service, with perhaps 5 admins will not very soon be unable to scale to the required level of handling all concept traffic on the web.

      You are making two false assumptions here: 1) that it’s expensive to run a service like w3id.org (it isn’t), and 2) that the parts that need to scale and have a human in the loop now won’t eventually be replaced by software (they will).

      You also seem to be assuming that the w3id.org service is fixed in how it operates and the services it provides. It isn’t. Some of the best data scientists in the world are also the administrators of the site. If there are problem with scaling, we’re capable of writing software to address those scaling issues.

      We need to educate the executive officers of these institutions how important this is.

      Our practical deployment experience shows us that it’s typically a bad idea to assume that something like a data vocabulary will always exist at a particular domain. There are many factors that conspire against URLs of this nature (funding, politics, apathy, etc.). That domain would also need to implement what w3id.org has already implemented in order to provide a service as stable as we do. Doing so is well outside the operating charter of many organizations.

      In reality, the future will be a mix of what you want and what w3id.org provides. Some organizations will host their own long-term identifiers. Some people and groups will chose to use a re-direct service to ensure that if they need to change domains for their data, they can do so easily.

  5. First of all, great work! I like this initiative. I must admit I was skeptic at first, but it has grown on me.

    Compared to URL shorteners and permalink providers like purl.org, the w3id.org approach is interesting in that it has gone for a full-out encrypted HTTPS solution, with https://w3id.org/whateveryoulike being the norm. While I am happy that there’s a self-redirect for the poor souls forgetting that crucial S, using https in namespaces and ontologies have so far not been very common.

    The reasons for this avoidance could be many, one is that the vocabulary typically is public anyway, another that https typically brings in issues of certificate management, both on the server (w3id.org certificate expires next February 2013) and software clients like OWL libraries, who often are not so well-trained in root certificates and SSL (e.g. wget in Ubuntu 10.04 does not manage even with –no-check-certificate). HTTPS connections can also be slow to set up, as it is a multi-step process, thus https redirects will consistently be slower than equivalent redirections over http.

    Have you done some kind of survey or testing of which software is still happy with the https:// URIs without requiring additional magic or options, and what might blow up? I presume the situation is much improved compared to 2010 (when my outdated wget was made).

    So far the w3id.org redirections are publically accessible on GitHub, which I think is a great thing for preservability and provenance (Even if half the internet stopped working, I could just embed the cloned repository and a host file in a virtual machine running Apache).

    In your reasoning above, you mention “The site supports HTTPS security, which means it can be used to securely serve data for industries such as medicine and finance.” Could you detail on how this would be secure? I presume you are thinking of for instance https://w3id.org/companies/{stockticker} redirecting to https://www.google.co.uk/finance?client=ob&q=NASDAQ:{stockticker} (or a RESTful equivalent). The redirection rule itself is not particularly revealing, and still I can require authentication on the other side. However the benefit here is that someone looking up https://w3id.org/companies/GOOG every 3rd minute would not be publishing to everyone who happens to sit nearby with a wifi sniffer that they are considering buying more GOOG stock. They are still at will that you, the w3id admins, are not misleading her by redirecting to a different service.

    The w3id.org admins would however have powers to find out who is requesting what. The server might be compromised or leaking logs, and additionally malicious change requests could be raised. So I think it would be important to state exactly what you mean by “secure” and to exclude the more ‘super’ kind of secure, such as having three-level authentication for managing the server, show up with a passport and your fingerprint to add a redirection rule, etc. :)

    • ManuSporny says: (Author)

      First of all, great work! I like this initiative. I must admit I was skeptic at first, but it has grown on me.

      Great, good to hear that it’s grown on you. :)

      Compared to URL shorteners and permalink providers like purl.org, the w3id.org approach is interesting in that it has gone for a full-out encrypted HTTPS solution, with https://w3id.org/whateveryoulike being the norm. While I am happy that there’s a self-redirect for the poor souls forgetting that crucial S, using https in namespaces and ontologies have so far not been very common.

      No, and neither has using https for everything. Although, I believe now we’re seeing a movement towards https for everything. Why not move our permanent URLs over to such an infrastructure as well?

      Have you done some kind of survey or testing of which software is still happy with the https:// URIs without requiring additional magic or options, and what might blow up? I presume the situation is much improved compared to 2010 (when my outdated wget was made).

      Nope, but if a Web Agent is not capable of following an HTTPS re-direct, then it’s a broken piece of software. This is 2013, and I think we should expect our software to be able to do HTTPS.

      In your reasoning above, you mention “The site supports HTTPS security, which means it can be used to securely serve data for industries such as medicine and finance.” Could you detail on how this would be secure? I presume you are thinking of for instance https://w3id.org/companies/{stockticker} redirecting to https://www.google.co.uk/finance?client=ob&q=NASDAQ:{stockticker} (or a RESTful equivalent). The redirection rule itself is not particularly revealing, and still I can require authentication on the other side. However the benefit here is that someone looking up https://w3id.org/companies/GOOG every 3rd minute would not be publishing to everyone who happens to sit nearby with a wifi sniffer that they are considering buying more GOOG stock. They are still at will that you, the w3id admins, are not misleading her by redirecting to a different service.

      No, someone with a WiFi sniffer would never know which URL you’re connecting to at w3id.org; the request is encrypted, that’s sort of the whole point of HTTPS. :)

      By secure, I mean that people can’t man-in-the-middle attack the data received from w3id.org. That is, an attacker can’t trick you into thinking that a vocabulary document resides on spoofedServerA instead of realServerA. Imagine what would happen if you could switch the meaning of ‘source’ and ‘destination’ in a financial transaction. These are the sorts of attacks that running over HTTPS can help mitigate.

      The w3id.org admins would however have powers to find out who is requesting what. The server might be compromised or leaking logs, and additionally malicious change requests could be raised. So I think it would be important to state exactly what you mean by “secure” and to exclude the more ‘super’ kind of secure, such as having three-level authentication for managing the server, show up with a passport and your fingerprint to add a redirection rule, etc.

      We’re in the early days of the service. We may require this sort of access in the future, especially for mission-critical URLs. Then again, if something is so mission critical that you need a passport and a fingerprint to change it, it probably shouldn’t be using a URL re-direction service in the first place.

      As for trusting the admins for w3id.org – yes, you have to trust them. You have to trust the server software and the administration team of any website that you choose to use for a mission critical task. We are trying to show people that we’re trustworthy by operating completely in the open with a public history of what all administrators do to the website. If people don’t trust w3id.org after doing all of that, then they should setup their own permanent identifier service for the Web. That’s the great thing about the Web, this doesn’t have to be the only permanent URL service on the Web. :)

Leave a Comment

Let us know your thoughts on this post but remember to play nicely folks!