Identifiers in JSON-LD and RDF

TL;DR: This blog post argues that the extension of blank node identifiers in JSON-LD and RDF for the purposes of identifying predicates and naming graphs is important. It is important because it simplifies the usage of both technologies for developers. The post also provides a less-optimal solution if the RDF Working Group does not allow blank node identifiers for predicates and graph names in RDF 1.1.

We need identifiers as humans to convey complex messages. Identifiers let us refer to a certain thing by naming it in a particular way. Not only do humans need identifiers, but our computers need identifiers to refer to data in order to perform computations. It is no exaggeration to say that our very civilization depends on identifiers to manage the complexity of our daily lives, so it is no surprise that people spend a great deal of time thinking about how to identify things. This is especially true when we talk about the people that are building the software infrastructure for the Web.

The Web has a very special identifier called the Uniform Resource Locator (URL). It is probably one of the best known identifiers in the world, mostly because everybody that has been on the Web has used one. URLs are great identifiers because they are very specific. When I give you a URL to put into your Web browser, such as the link to this blog post, I can be assured that when you put the URL into your browser that you will see what I see. URLs are globally scoped, they’re supposed to always take you to the same place.

There is another class of identifier on the Web that is not globally scoped and is only used within a document on the Web. In English, these identifiers are used when we refer to something as “that thing”, or “this widget”. We can really only use this sort of identifier within a particular context where the people participating in the conversation understand the context. Linguists call this concept deixis. “Thing” doesn’t always refer to the same subject, but based on the proper context, we can usually understand what is being identified. Our consciousness tags the “thing” that is being talked about with a tag of sorts and then refers to that thing using this pseudo-identifier. Most of this happens unconsciously (notice how your mind unconsciously tied the use of ‘this’ in this sentence to the correct concept?).

The take-away is that there are globally-scoped identifiers like URLs, and there are also locally-scoped identifiers, that require a context in order to understand what they refer to.

JSON and JSON-LD

In JSON, developers typically express data like this:

{
  "name": "Joe"
}

Note how that JSON object doesn’t have an identifier associated with it. JSON-LD creates a straight-forward way of giving that object an identifier:

{
  "@context": ...,
  "@id": "http://example.com/people/joe",
  "name": "Joe"
}

Both you and I can refer to that object using http://example.com/people/joe and be sure that we’re talking about the same thing. There are times that assigning a global identifier to every piece of data that we create is not desired. For example, it doesn’t make much sense to assign an identifier to a transient message that is a request to get a sensor reading. This is especially true if there are millions of these types or requests and we never want to refer to the request once it has been transmitted. This is why JSON-LD doesn’t force developers to assign an identifier to the objects that they express. The people that created the technology understand that not everything needs a global identifier.

Computers are less forgiving, they need identifiers for most everything, but a great deal of that complexity can be hidden from developers. When an identifier becomes necessary in order to perform computations upon the data, the computer can usually auto-generate an identifier for the data.

RDF, Graphs, and Blank Node Identifiers

The Resource Description Framework (RDF) primarily uses an identifier called the Internationalized Resource Identifier (IRI). Where URLs can typically only express links in Western languages, an IRI can express links in almost every language in use today including Japanese, Tamil, Russian and Mandarin. RDF also defines a special type of identifier called a blank node identifier. This identifier is auto-generated and is locally scoped to the document. It’s an advanced concept, but is one that is pretty useful when you start dealing with transient data, where creating a global identifier goes beyond the intended usage of the data. An RDF-compatible program will step in and create blank node identifiers on your behalf, but only when necessary.

Both JSON-LD and RDF have the concept of a Statement, Graph, and a Dataset. A Statement consists of a subject, predicate, and an object (for example: “Dave likes cookies”). A Graph is a collection of Statements (for example: Graph A contains all the things that Dave said and Graph B contains all the things that Mary said). A Dataset is a collection of Graphs (for example: Dataset Z contains all of the things Dave and Mary said yesterday).

In JSON-LD, at present, you can use a blank node identifier for subjects, predicates, objects, and graphs. In RDF, you can only use blank node identifiers for subjects and objects. There are people, such as myself, in the RDF WG that think this is a mistake. There are people that think it’s fine. There are people that think it’s the best compromise that can be made at the moment. There is a wide field of varying opinions strewn between the various extremes.

The end result is that the current state of affairs have put us into a position where we may have to remove blank node identifier support for predicates and graphs from JSON-LD, which comes across as a fairly arbitrary limitation to those not familiar with the inner guts of RDF. Don’t get me wrong, I feel it’s a fairly arbitrary limitation. There are those in the RDF WG that don’t think it is and that may prevent JSON-LD from being able to use what I believe is a very useful construct.

Document-local Identifiers for Predicates

Why do we need blank node identifiers for predicates in JSON-LD? Let’s go back to the first example in JSON to see why:

{
  "name": "Joe"
}

The JSON above is expressing the following Statement: “There exists a thing whose name is Joe.”

The subject is “thing” (aka: a blank node) which is legal in both JSON-LD and RDF. The predicate is “name”, which doesn’t map to an IRI. This is fine as far as the JSON-LD data model is concerned because “name”, which is local to the document, can be mapped to a blank node. RDF cannot model “name” because it has no way of stating that the predicate is local to the document since it doesn’t support blank nodes for predicates. Since the predicate doesn’t map to an IRI, it can’t be modeled in RDF. Finally, “Joe” is a string used to express the object and that works in both JSON-LD and RDF.

JSON-LD supports the use of blank nodes for predicates because there are some predicates, like every key used in JSON, that are local to the document. RDF does not support the use of blank nodes for predicates and therefore cannot properly model JSON.

Document-local Identifiers for Graphs

Why do we need blank node identifiers for graphs in JSON-LD? Let’s go back again to the first example in JSON:

{
  "name": "Joe"
}

The container of this statement is a Graph. Another way of writing this in JSON-LD is this:

{
  "@context": ...,
  "@graph": {
    "name": "Joe"
  }
}

However, what happens when you have two graphs in JSON-LD, and neither one of them is the RDF default graph?

{
  "@context": ...,
  "@graph": [
    {
      "@graph": {
        "name": "Joe"
      }
    }, 
    {
      "@graph": {
        "name": "Susan"
      }
    }
  ]
}

In JSON-LD, at present, it is assumed that a blank node identifier may be used to name each graph above. Unfortunately, in RDF, the only thing that can be used to name a graph is an IRI, and a blank node identifier is not an IRI. This puts JSON-LD in an awkward position, either JSON-LD can:

  1. Require that developers name every graph with an IRI, which seems like a strange demand because developers don’t have to name all subjects and objects with an IRI, or
  2. JSON-LD can auto-generate a regular IRI for each predicate and graph name, which seems strange because blank node identifiers exist for this very purpose (not to mention this solution won’t work in all cases, more below), or
  3. JSON-LD can auto-generate a special IRI for each predicate and graph name, which would basically re-invent blank node identifiers.

The Problem

The problem surfaces itself when you try to convert a JSON-LD document to RDF. If the RDF Working Group doesn’t allow blank node identifiers for predicates and graphs, then what do you use to identify predicates and graphs that have blank node identifiers associated with them in the JSON-LD data model? This is a feature we do want to support because there are a number of important use cases that it enables. The use cases include:

  1. Blank node predicates allow JSON to be mapped directly to the JSON-LD and RDF data models.
  2. Blank node graph names allow developers to use graphs without explicitly naming them.
  3. Blank node graph names make the RDF Dataset Normalization algorithm simpler.
  4. Blank node graph names prevent the creation of a parallel mechanism to generate and manage blank node-like identifiers.

It’s easy to see the problem exposed when performing RDF Dataset Normalization, which we need to do in order to digitally sign information expressed in JSON-LD and RDF. The rest of this post will focus on this area, as it exposes the problems with not supporting blank node identifiers for predicates and graph names. In JSON-LD, the two-graph document above could be normalized to this NQuads (subject, predicate, object, graph) representation:

_:bnode0 _:name "Joe" _:graph1 .
_:bnode1 _:name "Susan" _:graph2 .

This is illegal in RDF since you can’t have a blank node identifier in the predicate or graph position. Even if we were to use an IRI in the predicate position, the problem (of not being able to normalize “un-labeled” JSON-LD graphs like the ones in the previous section) remains.

The Solutions

This section will cover the proposed solutions to the problem in order least desirable to most desirable.

Don’t allow blank node identifiers for predicates and graph names

Doing this in JSON-LD ignores the point of contention. The same line of argumentation can be applied to RDF. The point is that by forcing developers to name graphs using IRIs, we’re forcing them to do something that they don’t have to do with subjects and objects. There is no technical reason that has been presented where the use of a blank node identifier in the predicate or graph position is unworkable. Telling developers that they must name graphs using IRIs will be surprising to them, because there is no reason that the software couldn’t just handle that case for them. Requiring developers to do things that a computer can handle for them automatically is anti-developer and will harm adoption in the long run.

Generate fragment identifiers for graph names

One solution is to generate fragment identifiers for graph names. This, coupled with the base IRI would allow the data to be expressed legally in NQuads:

_:bnode0 <http://example.com/base#name> "Joe" <http://example.com/base#graph1> .
_:bnode1 <http://example.com/base#name> "Susan" <http://example.com/base#graph2> .

The above is legal RDF. The approach is problematic when you don’t have a base IRI, such as when JSON-LD is used as a messaging protocol between two systems. In that use case, you end up with something like this:

_:bnode0 <#name> "Joe" <#graph1> .
_:bnode1 <#name> "Susan" <#graph2> .

RDF requires absolute IRIs and so the document above is illegal from an RDF perspective. The other down-side is that you have to keep track of all fragment identifiers in the output and make sure that you don’t pick fragment identifiers that are used elsewhere in the document. This is fairly easy to do, but now you’re in the position of tracking and renaming both blank node identifiers and fragment IDs. Even if this approach worked, you’d be re-inventing the blank node identifier. This approach is unworkable for systems like PaySwarm that use transient JSON-LD messages across a REST API; there is no base IRI in this use case.

Skolemize to create identifiers for graph names

Another approach is skolemization, which is just a fancy way of saying: generate a unique IRI for the blank node when expressing it as RDF. The output would look something like this:

_:bnode0 <http://blue.example.com/.well-known/genid/2938570348579834> "Joe" <http://blue.example.com/.well-known/genid/348570293572375> .
_:bnode1 <http://blue.example.com/.well-known/genid/2938570348579834> "Susan" <http://blue.example.com/.well-known/genid/49057394572309457> .

This would be just fine if there was only one application reading and consuming data. However, when we are talking about RDF Dataset Normalization, there are cases where two applications must read and independently verify the representation of a particular IRI. One scenario that illustrates the example fairly nicely is the blind verification scenario. In this scenario, two applications de-reference an IRI to fetch a JSON-LD document. Each application must perform RDF Dataset Normalization and generate a hash of that normalization to see if they retrieved the same data. Based on a strict reading of the skolemization rules, Application A would generate this:

_:bnode0 <http://blue.example.com/.well-known/genid/2938570348579834> "Joe" <http://blue.example.com/.well-known/genid/348570293572375> .
_:bnode1 <http://blue.example.com/.well-known/genid/2938570348579834> "Susan" <http://blue.example.com/.well-known/genid/49057394572309457> .

and Application B would generate this:

_:bnode0 <http://red.example.com/.well-known/genid/J8Sfei8f792Fd3> "Joe" <http://red.example.com/.well-known/genid/j28cY82Pa88> .
_:bnode1 <http://red.example.com/.well-known/genid/J8Sfei8f792Fd3> "Susan" <http://red.example.com/.well-known/genid/k83FyUuwo89DF> .

Note how the two graphs would never hash to the same value because the Skolem IRIs are completely different. The RDF Dataset Normalization algorithm would have no way of knowing which IRIs are blank node stand-ins and which ones are legitimate IRIs. You could say that publishers are required to assign the skolemized IRIs to the data they publish, but that ignores the point of contention, which is that you don’t want to force developers to create identifiers for things that they don’t care to identify. You could argue that the publishing system could generate these IRIs, but then you’re still creating a global identifier for something that is specifically meant to be a document-scoped identifier.

A more lax reading of the Skolemization language might allow one to create a special type of Skolem IRI that could be detected by the RDF Dataset Normalization algorithm. For example, let’s say that since JSON-LD is the one that is creating these IRIs before they go out to the RDF Dataset Normalization Algorithm, we use the tag IRI scheme. The output would look like this for Application A:

_:bnode0 <tag:w3.org,2013:dsid:345> "Joe" <tag:w3.org,2013:dsid:254> .
_:bnode1 <tag:w3.org,2013:dsid:345> "Susan" <tag:w3.org,2013:dsid:363> .

and this for Application B:

_:bnode0 <tag:w3.org,2013:dsid:a> "Joe" <tag:w3.org,2013:dsid:b> .
_:bnode1 <tag:w3.org,2013:dsid:a> "Susan" <tag:w3.org,2013:dsid:c> .

The solution still doesn’t work, but we could add another step to the RDF Dataset Normalization algorithm that would allow it to rename any IRI starting with tag:w3.org,2013:. Keep in mind that this is exactly the same thing that we do with blank nodes, and it’s effectively duplicating that functionality. The algorithm would allow us to generate something like this for both applications doing a blind verification.

_:bnode0 <tag:w3.org,2013:dsid:predicate-1> "Joe" <tag:w3.org,2013:dsid:graph-1> .
_:bnode1 <tag:w3.org,2013:dsid:predicate-1> "Susan" <tag:w3.org,2013:dsid:graph-2> .

This solution does violate one strong suggestion in the Skolemization section:

Systems wishing to do this should mint a new, globally unique IRI (a Skolem IRI) for each blank node so replaced.

The IRI generated is definitely not globally unique, as there will be many tag:w3.org,2013:dsid:graph-1s in the world, each associated with data that is completely different. This approach also goes against something else in Skolemization that states:

This transformation does not appreciably change the meaning of an RDF graph.

It’s true that using tag IRIs doesn’t change the meaning of the graph when you assume that the document will never find its way into a database. However, once you place the document in a database, it certainly creates the possibility of collisions in applications that are not aware of the special-ness of IRIs starting with tag:w3.org,2013:dsid:. The data is fine taken by itself, but a disaster when merged with other data. We would have to put a warning in some specification for systems to make sure to rename the incoming tag:w3.org,2013:dsid: IRIs to something that is unique to the storage subsystem. Keep in mind that this is exactly what is done when importing blank node identifiers into a storage subsystem. So, we’ve more-or-less re-invented blank node identifiers at this point.

Allow blank node identifiers for graph names

This leads us to the question of why not just extend RDF to allow blank node identifiers for predicates and graph names? Ideally, that’s what I would like to see happen in the future as it places the least burden on developers, and allows RDF to easily model JSON. The responses from the RDF WG are varied. These are all of the current arguments against that I have heard:

There are other ways to solve the problem, like fragment identifiers and skolemization, than introducing blank nodes for predicates and graph names.

Fragment identifiers don’t work, as demonstrated above. There is really only one workable solution based on a very lax reading of skolemization, and as demonstrated above, even the best skolemization solution re-invents the concept of a blank node.

There are other use cases that are blocked by the introduction of blank node identifiers into the predicate and graph name position.

While this has been asserted, it is still unclear exactly what those use cases are.

Adding blank node identifiers for predicates and graph names will break legacy applications.

If blank nodes for predicates and graph names were illegal before, wouldn’t legacy applications reject that sort of input? The argument that there are bugs in legacy applications that make them not robust against this type of input is valid, but should that prevent the right solution from being adopted? There has been no technical reason put forward for why blank nodes for predicates or graph names cannot work, other than software bugs prevent it.

The PaySwarm work has chosen to model the data in a very strange way.

The people that have been working on RDFa, JSON-LD, and the Web Payments specifications for the past 5 years have spent a great deal of time attempting to model the data in the simplest way possible, and in a way that is accessible to developers that aren’t familiar with RDF. Whether or not it may seem strange is arguable since this response is usually levied by people not familiar with the Web Payments work. This blog post outlines a variety of use cases where the use of a blank node for predicates and graph naming is necessary. Stating that the use cases are invalid ignores the point of contention.

If we allow blank nodes to be used when naming graphs, then those blank nodes should denote the graph.

At present, RDF states that a graph named using an IRI may denote the graph or it may not denote the graph. This is a fancy way of saying that the IRI that is used for the graph name may be an identifier for something completely different (like a person), but de-referencing the IRI over the Web results in a graph about cars. I personally think that is a very dangerous concept to formalize in RDF, but there are others that have strong opinions to the contrary. The chances of this being changed in RDF 1.1 is next to none.

Others have argued that while that may be the case for IRIs, it doesn’t have to be the case for blank nodes that are used to name graphs. In this case, we can just state that the blank node denotes the graph because it couldn’t possibly be used for anything else since the identifier is local to the document. This makes a great deal of sense, but it is different from how an IRI is used to name a graph and that difference is concerning to a number of people in the RDF Working Group.

However, that is not an argument to disallow blank nodes from being used for predicates and graph names. The group could still allow blank nodes to be used for this purpose while stating that they may or may not be used to denote the graph.

The RDF Working Group does not have enough time left in its charter to make a change this big.

While this may be true, not making a decision on this is causing more work for the people working on JSON-LD and RDF Dataset Normalization. Having the tag:w3.org,2013:dsid: identifier scheme is also going to make many RDF-based applications more complex in the long run, resulting in a great deal more work than just allowing blank nodes for predicates and graph names.

Conclusion

I have a feeling that the RDF Working Group is not going to do the right thing on this one due to the time pressure of completing the work that they’ve taken on. The group has already requested, and has been granted, a charter extension. Another extension is highly unlikely, so the group wants to get everything wrapped up. This discussion could take several weeks to settle. That said, the solution that will most likely be adopted (a special tag-based skolem IRI) will cause months of work for people living in the JSON-LD and RDF ecosystem. The best solution in the long run would be to solve this problem now.

If blank node identifiers for predicates and graphs are rejected, here is the proposal that I think will move us forward while causing an acceptable amount of damage down the road:

  1. JSON-LD continues to support blank node identifiers for use as predicates and graph names.
  2. When converting JSON-LD to RDF, a special, relabelable IRI prefix will be used for blank nodes in the predicate and graph name position of the form tag:w3.org,2013:dsid:

Thanks to Dave Longley for proofing this blog post and providing various corrections.

5 Comments

Got something to say? Feel free, I want to hear from you! Leave a Comment

  1. great article!

    in skolemization option i see no mention of UUID which to my understanding would at least solve problem of global uniqueness

    • ManuSporny says: (Author)

      UUIDs wouldn’t solve the issue as far as I can tell. Two applications would need a way of generating the same UUID for the same data, the question is, how do you do this? UUIDs are just a subset of skolemization IDs. The problems w/ skolemization are outlined above.

      • True! Now I notice that you intended http://blue.example.com/.well-known/genid/348570293572375 and other /genid/* as an unique identifiers… I should have followed skolemization hyperlink right on first read, especially since I ask question related to it!

        While reading it the first time, i looked at it from perspective of deserializing data from JSON-LD document, saving it in graph store, then retrieving data back and serializing it again as JSON-LD document. In this case generating unique Skolem IRIs looks like reasonable way to create internal identifiers just for the needs of graph store. One possible issue which pops in my mind – storing the same document multiple times could result in creating multiple copies of those blank nodes…
        When I think about minting Skolem IRIs, I don’t find convincing idea of using well-known IRIs with http(s) scheme. Since someone already decided that they have meaning only in original context why to change this decision and expose them directly over http and out of original context? Using another scheme, and making them not opaque and recognized as just internal skolems for blank nodes sounds more appealing to me at lest in this very moment.

        I mentioned UUID since I have impression that sometimes people take ‘all or nothing’ approach. Where all = mint http(s), permanent, dereferenceable and cool IRIs, and nothing = just use blank node giving it no identifier at all. Maybe taking middle path and while publishing data, identifying what one would find of little significance with “urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6″ (not-dereferencable seems proper here) would make certain things, like your example of checking hash of normalized data, more straight forward? Still I see your point of saying: “You could argue that the publishing system could generate these IRIs, but then you’re still creating a global identifier for something that is specifically meant to be a document-scoped identifier.” At this moment I understand that to compare those hashes you need to use very exact algorithm to create ‘Skolem IRIs’ for blank nodes and you will end up with non unique ones. With those pros and cons I still would prefer to mint UUID IRIs for data which I would like that someone can verify, especially after normalizing it and not simply using hash of original document.

        Could you maybe add some hyperlinks in your article to relevant discussions? At this moment I just see quotes without broader context of where someone stated it and . I also would like to look at it in context of some real world use cases and most of all hear what people who implement graph stores have to say about allowing blank node identifiers everywhere!

        Also link to http://www.w3.org/TR/rdf11-concepts/#section-generalized-rdf seems relevant… as well last paragraph in first note of http://www.w3.org/TR/rdf11-concepts/#section-dataset

        Once again, thank you for this great write up :) I wonder why no one commented on it before?…

        BTW I haven’t noticed notification about your reply in my mailbox, not sure if it got lost somewhere on my side or your system don’t send notifications about replies?

Leave a Comment

Let us know your thoughts on this post but remember to play nicely folks!