All posts in RDFa

The Origins of JSON-LD

Full Disclosure: I am one of the primary creators of JSON-LD, lead editor on the JSON-LD 1.0 specification, and chair of the JSON-LD Community Group. These are my personal opinions and not the opinions of the W3C, JSON-LD Community Group, or my company.

JSON-LD became an official Web Standard last week. This is after exactly 100 teleconferences typically lasting an hour and a half, fully transparent with text minutes and recorded audio for every call. There were 218+ issues addressed, 2,071+ source code commits, and 3,102+ emails that went through the JSON-LD Community Group. The journey was a fairly smooth one with only a few jarring bumps along the road. The specification is already deployed in production by companies like Google, the BBC, HealthData.gov, Yandex, Yahoo!, and Microsoft. There is a quickly growing list of other companies that are incorporating JSON-LD, but that’s the future. This blog post is more about the past, namely where did JSON-LD come from? Who created it and why?

I love origin stories. When I was in my teens and early twenties, the only origin stories I liked to read about were of the comic and anime variety. Spiderman, great origin story. Superman, less so, but entertaining. Nausicaä, brilliant. Major Motoko Kusanagi, nuanced. Spawn, dark. Those connections with characters fade over time as you understand that this world has more interesting ones. Interesting because they touch the lives of billions of people, and since I’m a technologist, some of my favorite origin stories today consist of finding out the personal stories behind how a particular technology came to be. The Web has a particularly riveting origin story. These stories are hard to find because they’re rarely written about, so this is my attempt at documenting how JSON-LD came to be and the handful of people that got it to where it is today.

The Origins of JSON-LD

When you’re asked to draft the press pieces on the launch of new world standards, you have two lists of people in your head. The first is the “all inclusive list”, which is every person that uttered so much as a word that resulted in a change to the specification. That list is typically very long, so you end up saying something like “We’d like to thank all of the people that provided input to the JSON-LD specification, including the JSON-LD Community, RDF Working Group, and individuals who took the time to send in comments and improve the specification.” With that statement, you are sincere and cover all of your bases, but feel like you’re doing an injustice to the people without which the work would never have survived.

The all inclusive list is very important, they helped refine the technology to the point that everyone could achieve consensus on it being something that is world class. However, 90% of the back breaking work to get the specification to the point that everyone else could comment on it is typically undertaken by a 4-5 people. It’s a thankless and largely unpaid job, and this is how the Web is built. It’s those people that I’d like to thank while exploring the origins of JSON-LD.

Inception

JSON-LD started around late 2008 as the work on RDFa 1.0 was wrapping up. We were under pressure from Microformats and Microdata, which we were also heavily involved in, to come up with a good way of programming against RDFa data. At around the same time, my company was struggling with the representation of data for the Web Payments work. We had already made the switch to JSON a few years previous and were storing that data in MySQL, mostly because MongoDB didn’t exist yet. We were having a hard time translating the RDFa we were ingesting (products for sale, pricing information, etc.) into something that worked well in JSON. At around the same time, Mark Birbeck, one of the creators of RDFa, and I were thinking about making something RDFa-like for JSON. Mark had proposed a syntax for something called RDFj, which I thought had legs, but Mark didn’t necessarily have the time to pursue.

The Hard Grind

After exchanging a few emails with Mark about the topic over the course of 2009, and letting the idea stew for a while, I wrote up a quick proposal for a specification and passed it by Dave Longley, Digital Bazaar’s CTO. We kicked the idea around a bit more and in May of 2010, published the first working draft of JSON-LD. While Mark was instrumental in injecting the first set of basis ideas into JSON-LD, Dave Longley would become the most important key technical mind behind how to make JSON-LD work for web programmers.

At that time, JSON-LD had a pretty big problem. You can represent data in JSON-LD in a myriad of different ways, making it hard to tell if two JSON-LD documents are the same or not. This was an important problem to Digital Bazaar because we were trying to figure out how to create product listings, digital receipts, and contracts using JSON-LD. We had to be able to tell if two product listings were the same, and we had to figure out a way to serialize the data so that products and their associated prices could be listed on the Web in a decentralized way. This meant digital signatures, and you have to be able to create a canonical/normalized form for your data if you want to be able to digitally sign it.

Dave Longley invented the JSON-LD API, JSON-LD Framing, and JSON-LD Graph Normalization to tackle these canonicalization/normalization issues and did the first four implementations of the specification in C++, JavaScript, PHP, and Python. The JSON-LD Graph Normalization problem itself took roughly 3 months of concentrated 70+ hour work weeks and dozens of iterations by Dave Longley to produce an algorithm that would work. To this day, I remain convinced that there are only a handful of people on this planet with a mind that is capable of solving those problems. He was the first and only one that cracked those problems. It requires a sort of raw intelligence, persistence, and ability to constantly re-evaluate the problem solving approach you’re undertaking in a way that is exceedingly rare.

Dave and I continued to refine JSON-LD, with him working on the API and me working on the syntax for the better part of 2010 and early 2011. When MongoDB started really taking off in 2010, the final piece just clicked into place. We had the makings of a Linked Data technology stack that would work for web developers.

Toward Stability

Around April 2011, we launched the JSON-LD Community Group and started our public push to try and put the specification on a standards track at the World Wide Web Consortium (W3C). It is at this point that Gregg Kellogg joined us to help refine the rough edges of the specification and provide his input. For those of you that don’t know Gregg, I know of no other person that has done complete implementations of the entire stack of Semantic Web technologies. He has Ruby implementations of quad stores, TURTLE, N3, NQuads, SPARQL engines, RDFa, JSON-LD, etc. If it’s associated with the Semantic Web in any way, he’s probably implemented it. His depth of knowledge of RDF-based technologies is unmatched and he focused that knowledge on JSON-LD to help us hone it to what it is today. Gregg helped us with key concepts, specification editing, implementations, tests, and a variety of input that left its mark on JSON-LD.

Markus Lanthaler also joined us around the same time (2011) that Gregg did. The story of how Markus got involved with the work is probably my favorite way of explaining how the standards process should work. Markus started giving us input while a masters student at Technische Universität Graz. He didn’t have a background in standards, he didn’t know anything about the W3C process or specification editing, he was as green as one can be with respect to standards creation. We all start where he did, but I don’t know of many people that became as influential as quickly as Markus did.

Markus started by commenting on the specification on the mailing list, then quickly started joining calls. He’d raise issues and track them, he started on his PHP implementation, then started making minor edits to the specifications, then major edits until earning our trust to become lead specification editor for the JSON-LD API specification and one of the editors for the JSON-LD Syntax specification. There was no deliberate process we used to make him lead editor, it just sort of happened based on all the hard work he was putting in, which is the way it should be. He went through a growth curve that normally takes most people 5 years in about a year and a half, and it happened exactly how it should happen in a meritocracy. He earned it and impressed us all in the process.

The Final Stretch

Of special mention as well is Niklas Lindström, who joined us starting in 2012 on almost every JSON-LD teleconference and provided key input to the specifications. Aside from being incredibly smart and talented, Niklas is particularly gifted in his ability to find a balanced technical solution that moved the group forward when we found ourselves deadlocked on a particular decision. Paul Kuykendall joined us toward the very end of the JSON-LD work in early 2013 and provided fresh eyes on what we were working on. Aside from being very level-headed, Paul helped us understand what was important to web developers and what wasn’t toward the end of the process. It’s hard to find perspective as work wraps up on a standard, and luckily Paul joined us at exactly the right moment to provide that insight.

There were literally hundreds of people that provided input on the specification throughout the years, and I’m very appreciative of that input. However, without this core of 4-6 people, JSON-LD would have never had a chance. I will never be able to find the words to express how deeply appreciative I am to Dave, Markus, Gregg, Niklas and Paul, who did the work on a primarily volunteer basis. At this moment in time, the Web is at the core of the way human kind communicates and the most ardent protectors of this public good create standards to ensure that the Web continues to serve all of us. It boils my blood to then know that they will go largely unrewarded by society for creating something that will benefit hundreds of millions of people, but that’s another post for another time.

The next post in this series tells the story of how JSON-LD was nearly eliminated on several occasions by its critics and proponents while on its journey toward a web standard.

The Downward Spiral of Microdata

Full disclosure: I’m the chair of the RDFa Working Group and have been heavily involved during the RDFa and Microdata standardization initiatives. I am biased, but also understand all of the nuanced decisions that were made during the creation of both specifications.

Support for the Microdata API has just been removed from Webkit (Apple Safari). Support for the Microdata API was also removed from Blink (Google Chrome) a few months ago. This means that Apple Safari and Google Chrome will no longer support the Microdata API. Removal of the feature from a browser also shows us a likely future for Microdata, which is less and less support.

In addition, this discussion on the Blink developer list demonstrates that there isn’t anyone to pick up the work of maintaining the Microdata implementation. Microdata has also been ripped out of the main HTML5 specification at the W3C, with the caveat that the Microdata specification will only continue “if editorial resources can be found”. Translation: if an editor doesn’t step up to edit the Microdata specification, Microdata is dead at W3C. It just takes someone to raise their hand to volunteer, so why is it that out of a group of hundreds of people, no one has stepped up to maintain, create a test suite for, and push the Microdata specification forward?

A number of observers have been surprised by these events, but for those that have been involved in the month-to-month conversation around Microdata, it makes complete sense. Microdata doesn’t have an active community supporting it. It never really did. For a Web specification to be successful, it needs an active community around it that is willing to do the hard work of building and maintaining the technology. RDFa has that in spades, Microdata does not.

Microdata was, primarily, a shot across the bow at RDFa. The warning worked because the RDFa community reacted by creating RDFa Lite, which matches Microdata feature-for-feature, while also supporting things that Microdata is incapable of doing. The existence of RDFa Lite left the HTML Working Group in an awkward position. Publishing two specifications that did the exact same thing in almost the exact same way is a position that no standards organization wants to be in. At that point, it became a race to see which community could create the developer tools and support web developers that were marking up pages.

Microdata, to this day, still doesn’t have a specification editor, an active community, a solid test suite, or any of the other things that are necessary to become a world class technology. To be clear, I’m not saying Microdata is dying (4 million out of 329 million domains use it), just that not having these basic things in place will be very problematic for the future of Microdata.

To put that in perspective, HTML5+RDFa 1.1 will become an official W3C Recommendation (world standard) next Thursday. There was overwhelming support from the W3C member companies to publish it as a world standard. There have been multiple specification editors for RDFa throughout the years, there are hundreds of active people in the community integrating RDFa into pages across the Web, there are 7 implementations of RDFa in a variety of programming languages, there is a mailing list, website and an IRC channel dedicated to answering questions for people learning RDFa, and there is a test suite with 800 tests covering RDFa in 6 markup languages (HTML4, HTML5, XML, SVG, XHTML1 and XHTML5). If you want to build a solution on a solid technology, with a solid community and solid implementations; RDFa is that solution.

JSON-LD is the Bee’s Knees

Full disclosure: I’m one of the primary authors and editors of the JSON-LD specification. I am also the chair of the group that created JSON-LD and have been an active participant in a number of Linked Data initiatives: RDFa (chair, author, editor), JSON-LD (chair, co-creator), Microdata (primary opponent), and Microformats (member, haudio and hvideo microformat editor). I’m biased, but also well informed.

JSON-LD has been getting a great deal of good press lately. It was adopted by Google, Yahoo, Yandex, and Microsoft for use in schema.org. The PaySwarm universal payment protocol is based on it. It was also integrated with Google’s Gmail service and the open social networking folks have also started integrating it into the Activity Streams 2.0 work.

That all of these positive adoption stories exist was precisely the reason why Shane Becker’s post on why JSON-LD is an Unneeded Spec was so surprising. If you haven’t read it yet, you may want to as the rest of this post will dissect the arguments he makes in his post (it’s a pretty quick 5 minute read). The post is a broad brush opinion piece based on a number of factual errors and misinformed opinion. I’d like to clear up these errors in this blog post and underscore some of the reasons JSON-LD exists and how it has been developed.

A theatrical interpretation of the “JSON-LD is Unneeded” blog post

Shane starts with this claim:

Today I learned about a proposed spec called JSON-LD. The “LD” is for linked data (Linked Data™ in the Uppercase “S” Semantic Web sense).

When I started writing the original JSON-LD specification, one of the goals was to try and merge lessons learned in the Microformats community with lessons learned during the development of RDFa and Microdata. This meant figuring out a way to marry the lowercase semantic web with the uppercase Semantic Web in a way that was friendly to developers. For developers that didn’t care about the uppercase Semantic Web, JSON-LD would still provide a very useful data structure to program against. In fact, Microformats, which are the poster-child for the lowercase semantic web, were supported by JSON-LD from day one.

Shane’s article is misinformed with respect to the assertion that JSON-LD is solely for the uppercase Semantic Web. JSON-LD is mostly for the lowercase semantic web, the one that developers can use to make their applications exchange and merge data with other applications more easily. JSON-LD is also for the uppercase Semantic Web, the one that researchers and large enterprises are using to build systems like IBM’s Watson supercomputer, search crawlers, Gmail, and open social networking systems.

Linked data. Web sites. Standards. Machine readable.
Cool. All of those sound good to me. But they all sound familiar, like we’ve already done this before. In fact, we have.


We haven’t done something like JSON-LD before. I wish we had because we wouldn’t have had to spend all that time doing research and development to create the technology. When writing about technology, it is important to understand the basics of a technology stack before claiming that we’ve “done this before”. An astute reader will notice that at no point in Shane’s article is any text from the JSON-LD specification quoted, just the very basic introductory material on the landing page of the website. More on this below.

Linked data
That’s just the web, right? I mean, we’ve had the <a href> tag since literally the beginning of HTML / The Web. It’s for linking documents. Documents are a representation of data.

Speaking as someone that has been very involved in the Microformats and RDFa communities, yes, it’s true that the document-based Web can be used to publish Linked Data. The problem is that standard way of expressing a link to another piece of data that can be followed did not carry over to the data-based Web. That is, most JSON-based APIs don’t have a standard way of encoding a hyperlink.

The other implied assertion with the statement above is that the document-based Web is all we need. If this were true, sending HTML documents to Web applications would be all we needed. Web developers know that this isn’t the case today for a number of obvious reasons. We send JSON data back and forth on the Web when we need to program against things like Facebook, Google, or Twitter’s services. JSON is a very useful data format for machine-to-machine data exchange. The problem is that JSON data has no standard way of doing a variety of things we do on the document-based Web, like expressing links, expressing the types of data (like times and dates), and a variety of other very useful features for the data-based Web. This is one of the problems that JSON-LD addresses.

Web sites
If it’s not wrapped in HTML and viewable in a browser it, is it really a website? JSON isn’t very useful in the browser by itself. It’s not style-able. It’s not very human-readable. And worst of all, it’s not clickable.

Websites are composed of many parts. It’s a weak argument to say that if a site is mainly composed of data that isn’t in HTML, and isn’t viewable in a browser, that it’s not a real website. The vast majority of websites like Twitter and Facebook are composed of data and API calls with a relatively thin varnish of HTML on top. JSON is the primary way that applications interact with these and other data-driven websites. It’s almost guaranteed these days that any company that has a popular API uses JSON in their Web service protocol.

Shane’s argument here is pretty confused. It assumes that the primary use of JSON-LD is to express data in an HTML page. Sure, JSON-LD can do that, but focusing on that brush stroke is missing the big picture. The big picture is that JSON-LD allows applications that use it to share data and interoperate in a way that is not possible with regular JSON, and it’s especially useful when used in conjunction with a Web service or a document-based database like MongoDB or CouchDB.

Standards based
To their credit, JSON-LD did license their website content Creative Commons CC0 Public Domain. But, the spec itself isn’t. It’s using (what seems to be) a W3C boilerplate copyright / license. Copyright © 2010-2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.


Nope. The JSON-LD specification has been released under a Creative Commons Attribution 3.0 license multiple times in the past, and it will be released under a Creative Commons license again, most probably CC0. The JSON-LD specification was developed in a W3C Community Group using a Creative Commons license and then released to be published as a Web standard via W3C using their W3C Community Final Specification Agreement (FSA), which allows the community to fork the specification at any point in time and publish it under a different license.

When you publish a document through the W3C, they have their own copyright, license, and patent policy associated with the document being published. There is a legal process in place at W3C that asserts that companies can implement W3C published standards in a patent and royalty-free way. You don’t get that with CC0, in fact, you don’t get any such vetting of the technology or any level of patent and royalty protection.

What we have with JSON-LD is better than what is proposed in Shane’s blog post. You get all of the benefits of having W3C member companies vet the technology for technical and patent issues while also being able to fork the specification at any point in the future and publish it under a license of your choosing as long as you state where the spec came from.

Machine readable
Ah… “machine readable”. Every couple of years the current trend of what machine readable data should look like changes (XML/JSON, RSS/Atom, xml-rpc/SOAP, rest/WS-*). Every time, there are the same promises. This will solve our problems. It won’t change. It’ll be supported forever. Interoperability. And every time, they break their promises. Today’s empires, tomorrow’s ashes.


At no point has any core designer of JSON-LD claimed 1) that JSON-LD will “solve our problems” (or even your particular problem), 2) that it won’t change, and 3) that it will be supported forever. These are straw-man arguments. The current consensus of the group is that JSON-LD is best suited to a particular class of problems and that some developers will have no need for it. JSON-LD is guaranteed to change in the future to keep pace with what we learn in the field, and we will strive for backward compatibility for features that are widely used. Without modification, standardized technologies have a shelf life of around 10 years, 20-30 if they’re great. The designers of JSON-LD understand that, like the Web, JSON-LD is just another grand experiment. If it’s useful, it’ll stick around for a while, if it isn’t, it’ll fade into history. I know of no great software developer or systems designer that has ever made these three claims and been serious about it.

We do think that JSON-LD will help Web applications interoperate better than they do with plain ‘ol JSON. For an explanation of how, there is a nice video introducing JSON-LD.

With respect to the “Today’s empires, tomorrow’s ashes” cynicism, we’ve already seen a preview of the sort of advances that Web-based machine-readable data can unleash. Google, Yahoo!, Microsoft, Yandex, and Facebook all use a variety of machine-readable data technologies that have only recently been standardized. These technologies allow for faster, more accurate, and richer search results. They are also the driving technology for software systems like Watson. These systems exist because there are people plugging away at the hard problem of machine readable data in spite of cynicism directed at past failures. Those failures aren’t ashes, they’re the bedrock of tomorrow’s breakthroughs.

Instead of reinventing the everything (over and over again), let’s use what’s already there and what already works. In the case of linked data on the web, that’s html web pages with clickable links between them.

Microformats, Microdata, and RDFa do not work well for data-based Web services. Using Linked Data with data-based Web services is one of the primary reasons that JSON-LD was created.

For open standards, open license are a deal breaker. No license is more open than Creative Commons CC0 Public Domain + OWFa. (See also the Mozilla wiki about standards/license, for more.) There’s a growing list of standards that are already using CC0+OWFa.

I think there might be a typo here, but if not, I don’t understand why open licenses are a deal breaker for open standards. Especially things like the W3C FSA or the Creative Commons licenses we’ve published the JSON-LD spec under. Additionally, CC0 + OWFa might be neat. Shane’s article was the first time that I had heard of OWFa and I’d be a proponent for pushing it in the group if it granted more freedom to the people using and developing JSON-LD than the current set of agreements we have in place. After glossing over the legal text of the OWFa, I can’t see what CC0 + OWFa buys us over CC0 + W3C patent attribution. If someone would like to make these benefits clear, I could take a proposal to switch to CC0 + OWFa to the JSON-LD Community Group and see if there is interest in using that license in the future.

No process is more open than a publicly editable wiki.

A counter-point to publicly accessible forums

Publicly editable wikis are notorious for edit wars, they are not a panacea. Just because you have a wiki, does not mean you have an open community. For example, the Microformats community was notorious for having a different class of unelected admins that would meet in San Francisco and make decisions about the operation of the community. This seemingly innocuous practice would creep its way into the culture and technical discussion on a regular basis leading to community members being banned from time to time. Similarly, Wikipedia has had numerous issues with publicly editable wikis and the behavior of their admins.

Depending on how you define “open”, there are a number of processes that are far more open than a publicly editable wiki. For example, the JSON-LD specification development process is completely open to the public, based on meritocracy, and is consensus-driven. The mailing list is open. The bug tracker is open. We have weekly design teleconferences where all the audio is recorded and minuted. We have these teleconferences to this day and will continue to have them into the future because we make transparency a priority. JSON-LD, as far as I know, is the first such specification in the world developed where all the previously described operating guidelines are standard practice.

(Mailing lists are toxic.)

A community is as toxic as its organizational structure enables it to be. The JSON-LD community is based on meritocracy, consensus, and has operated in a very transparent manner since the beginning (open meetings, all calls are recorded and minuted, anyone can contribute to the spec, etc.). This has, unsurprisingly, resulted in a very pleasant and supportive community. That said, there is no perfect communication medium. They’re all lossy and they all have their benefits and drawbacks. Sometimes, when you combine multiple communication channels as a part of how your community operates, you get better outcomes.

Finally, for machine readable data, nothing has been more widely adopted by publishers and consumers than microformats. As of June 2012, microformats represents about 70% of all of the structured data on the web. And of that ~70%, the vast majority was h-card and xfn. (All RDFa is about 25% and microdata is a distant third.)

Microformats are good if all you need to do is publish your basic contact and social information on the Web. If you want to publish detailed product information, financial data, medical data, or address other more complex scenarios, Microformats won’t help you. There have been no new Microformats released in the last 5 years and the mailing list traffic has been almost non-existent for around 5 years. From what I can tell, most everyone has moved on to RDFa, Microdata, or JSON-LD.

There are a few that are working on Microformats 2, but I haven’t seen anything that it provides that is not already provided by existing solutions that also have the added benefit of being W3C standards or backed by major companies like Google, Facebook, Yahoo!, Microsoft, and Yandex.

Maybe it’s because of the ease of publishing microformats. Maybe it’s the open process for developing the standards. Maybe it’s because microformats don’t require any additions to HTML. (Both RDFa and microdata required the use of additional attributes or XML namespaces.) Whatever the reason, microformats has the most uptake. So, why do people keep trying to reinvent what microformats is already doing well?

People aren’t reinventing what Microformats are already doing well, they’re attempting to address problems that Microformats do not solve.

For example, one of the reasons that Google adopted JSON-LD is because markup was much easier in JSON-LD than it was in Microformats, as evidenced by the example below:

Back to JSON-LD. The “Simple Example” listed on the homepage is a person object representing John Lennon. His birthday and wife are also listed on the object.

        {
          "@context": "http://json-ld.org/contexts/person.jsonld",
          "@id": "http://dbpedia.org/resource/John_Lennon",
          "name": "John Lennon",
          "born": "1940-10-09",
          "spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
        }

I look at this and see what should have been HTML with microformats (h-card and xfn). This is actually a perfect use case for h-card and xfn: a person and their relationship to another person. Here’s how it could’ve been marked up instead.

        <div class="h-card">
          <a href="http://dbpedia.org/resource/John_Lennon" class="u-url u-uid p-name">John Lennon</a>
          <time class="dt-bday" datetime="1940-10-09">October 9<sup>th</sup>, 1940</time>
          <a rel="spouse" href="http://dbpedia.org/resource/Cynthia_Lennon">Cynthia Lennon</a>.
        </div>

I’m willing to bet that most people familiar with JSON will find the JSON-LD markup far easier to understand and get right than the Microformats-based equivalent. In addition, sending the Microformats markup to a REST-based Web service would be very strange. Alternatively, sending the JSON-LD markup to a REST-based Web service would be far more natural for a modern day Web developer.

This HTML can be easily understood by machine parsers and humans parsers. Microformats 2 parsers already exists for: JavaScript (in the browser), Node.js, PHP and Ruby. HTML + microformats2 means that machines can read your linked data from your website and so can humans. It means that you don’t need an “API” that is something other than your website.

You have been able to do the same thing, and much more, using RDFa and Microdata for far longer (since 2006) than you have been able to do it in Microformats 2. Let’s be clear, there is no significant advantage to using Microformats 2 over RDFa or Microdata. In fact, there are a number of disadvantages for using Microformats 2 at this point, like little to no support from the search companies, very little software tooling, and an anemic community (of which I am a member) for starters. Additionally, HTML + Microformats 2 does not address the Web service API issue at all.

Please don’t waste time and energy reinventing all of the wheels. Instead, please use what already works and what works the webby way.


Do not miss the irony of this statement. RDFa has been doing what Microformats 2 does today since 2006, and it’s a Web standard. Even if you don’t like RDFa 1.0, RDFa 1.1, RDFa Lite 1.1, and Microdata all came before Microformats 2. To assert that wheels should not be reinvented and then claim that Microformats 2, which was created far after there were already a number of well-established solutions, is quite a strange position to take.

Conclusion

JSON-LD was created by people that have been directly involved in the Linked Data, lowercase semantic web, uppercase Semantic Web, Microformats, Microdata, and RDFa work. It has proven to be useful to them. There are a number of very large technology companies that have adopted JSON-LD, further underscoring its utility. Expect more big announcements in the next six months. The JSON-LD specifications have been developed in a radically open and transparent way, the document copyright and licensing provisions are equally open. I hope that this blog post has helped clarify most of the misinformed opinion in Shane Becker’s blog post.

Most importantly, cynicism will not solve the problems that we face on the Web today. Hard work will, and there are very few communities that I know of that work harder and more harmoniously than the excellent volunteers in the JSON-LD community.

If you would like to learn more about Linked Data, a good video introduction exists. If you want to learn more about JSON-LD, there is a good video introduction to that as well.

Technical Analysis of 2012 MozPay API Message Format

The W3C Web Payments group is currently analyzing a new API for performing payments via web browsers and other devices connected to the web. This blog post is a technical analysis of the MozPay API with a specific focus on the payment protocol and its use of JOSE (JSON Object Signing and Encryption). The first part of the analysis takes the approach of examining the data structures used today in the MozPay API and compares them against what is possible via PaySwarm. The second part of the analysis examines the use of JOSE to achieve the use case and security requirements of the MozPay API and compares the solution to JSON-LD, which is the mechanism used to achieve the use case and security requirements of the PaySwarm specification.

Before we start, it’s useful to have an example of what the current MozPay payment initiation message looks like. This message is generated by a MozPay Payment Provider and given to the browser to initiate a native purchase process:

jwt.encode({
  "iss": APPLICATION_KEY,
  "aud": "marketplace.firefox.com",
  "typ": "mozilla/payments/pay/v1",
  "iat": 1337357297,
  "exp": 1337360897,
  "request": {
    "id": "915c07fc-87df-46e5-9513-45cb6e504e39",
    "pricePoint": 1,
    "name": "Magical Unicorn",
    "description": "Adventure Game item",
    "icons": {
      "64": "https://yourapp.com/img/icon-64.png",
      "128": "https://yourapp.com/img/icon-128.png"
    },
    "productData": "user_id=1234&my_session_id=XYZ",
    "postbackURL": "https://yourapp.com/payments/postback",
    "chargebackURL": "https://yourapp.com/payments/chargeback"
  }
}, APPLICATION_SECRET)

The message is effectively a JSON Web Token. I say effectively because it seems like it breaks the JWT spec in subtle ways, but it may be that I’m misreading the JWT spec.

There are a number of issues with the message that we’ve had to deal with when creating the set of PaySwarm specifications. It’s important that we call those issues out first to get an understanding of the basic concerns with the MozPay API as it stands today. The comments below use the JWT code above as a reference point.

Unnecessarily Cryptic JSON Keys

...
  "iss": APPLICATION_KEY,
  "aud": "marketplace.firefox.com",
  "typ": "mozilla/payments/pay/v1",
  "iat": 1337357297,
  "exp": 1337360897,
...

This is more of an issue with the JOSE specs than it is the MozPay API. I can’t think of a good line of argumentation to shorten things like ‘issuer’ to ‘iss’ and ‘type’ to ‘typ’ (seriously :) , the ‘e’ was too much?). It comes off as 1980s protocol design, trying to save bits on the wire. Making code less readable by trying to save characters in a human-readable message format works against the notion that the format should be readable by a human. I had to look up what iss, aud, iat, and exp meant. The only reason that I could come up with for using such terse entries was that the JOSE designers were attempting to avoid conflicts with existing data in JWT claims objects. If this was the case, they should have used a prefix like “@” or “$”, or placed the data in a container value associated with a key like ‘claims’.

PaySwarm always attempts to use terminology that doesn’t require you to go and look at the specification to figure out basic things. For example, it uses creator for iss (issuer), validFrom for iat (issued at), and validUntil for exp (expire time).

iss and APPLICATION_KEY

...
  "iss": APPLICATION_KEY,
...

The MozPay API specification does not require the APPLICATION_KEY to be a URL. Since it’s not a URL, it’s not discoverable. The application key is also specific to each Marketplace, which means that one Marketplace could use a UUID, another could use a URL, and so on. If the system is intended to be decentralized and interoperable, the APPLICATION_KEY should either be dereferenceable on the public Web without coordination with any particular entity, or a format for the key should be outlined in the specification.

All identities and keys used in digital signatures in PaySwarm use URLs for the identifiers that must contain key information in some sort of machine-readable format (RDFa and JSON-LD, for now). This means that 1) they’re Web-native, 2) they can be dereferenced, and 3) when they’re dereferenced, a client can extract useful data from the document retrieved.

Audience

...
  "aud": "marketplace.firefox.com",
...

It’s not clear what the aud parameter is used for in the MozPay API, other than to identify the marketplace.

Issued At and Expiration Time

...
  "iat": 1337357297,
  "exp": 1337360897,
...

The iat (issued at) and exp (expiration time) values are encoded in the number of seconds since January 1st, 1970. These are not very human readable and make debugging issues with purchases more difficult than they need to be.

PaySwarm uses the W3C Date/Time format, which are human-readable strings that are also easy for machines to process. For example, November 5th, 2013 at 1:15:30 AM (Zulu / Universal Time) is encoded as: 2013-11-05T13:15:30Z.

The Request

...
  "request": {
    "id": "915c07fc-87df-46e5-9513-45cb6e504e39",
    "pricePoint": 1,
    "name": "Magical Unicorn",
...

This object in the MozPay API is a description of the thing that is to be sold. Technically, it’s not really a request. The outer object is the request. There is a big of a conflation of terminology here that should probably be fixed at some point.

In PaySwarm, the contents of the MozPay request value is called an Asset. An asset is a description of the thing that is to be sold.

Request ID

...
{
  "request": {
    "id": "915c07fc-87df-46e5-9513-45cb6e504e39",
...

The MozPay API encodes the request ID as a universally unique identifier (UUID). The major downside to this approach is that other applications can’t find the information on the Web to 1) discover more about the item being sold, 2) discuss the item being sold by referring to it by a universal ID, 3) feed it to a system that can read data published at the identifier address, and 4) index it for the purposes of searching.

The PaySwarm specifications use a URL for the identifier for assets and publish machine-readable data at the asset location so that other systems can discover more information about the item being sold, refer to the item being sold in discussions (like reviews of the item), start a purchase by referencing the URL, index the item being sold such that it may be utilized in price-comparison/search engines.

Price Point

...
  "request": {
...
    "pricePoint": 1,
...

The pricePoint for the item being sold is currently a whole number. This is problematic because prices are usually decimal numbers including a fraction and a currency.

PaySwarm publishes its pricing information in a currency agnostic way that is compatible with all known monetary systems. Some of these systems include USD, EUR, JYP, RMB, Bitcoin, Brixton Pound, Bernal Bucks, Ven, and a variety of other alternative currencies. The amount is specified as a decimal with fraction and a currency URL. A URL is utilized for the currency because PaySwarm supports arbitrary currencies to be created and managed external to the PaySwarm system.

Icons

...
  "request": {
...
    "icons": {
      "64": "https://yourapp.com/img/icon-64.png",
      "128": "https://yourapp.com/img/icon-128.png"
    },
...

Icon data is currently modeled in a way that is useful to developers by indexing the information as a square pixel size for the icon. This allows developers to access the data like so: icons.64 or icons.128. Values are image URLs, which is the right choice.

PaySwarm uses JSON-LD and can support this sort of data layout through a feature called data indexing. Another approach is to just have an array of objects for icons, which would allow us to include extended information about the icons. For example:

...
  "request": {
...
  "icon": [{size: 64, id: "https://yourapp.com/img/icon-64.png", label: "Magical Unicorn"}, ...]
...

Product Data

...
  "request": {
...
    "productData": "user_id=1234&my_session_id=XYZ",
...

If the payment technology we’re working on is going to be useful to society at large, we have to allow richer descriptions of products. For example, model numbers, rich markup descriptions, pictures, ratings, colors, and licensing terms are all important parts of a product description. The value needs to be larger than a 256 byte string and needs to support decentralized extensibility. For example, Home Depot should be able to list UPC numbers and internal reference numbers in the asset description and the payment protocol should preserve that extra information, placing it into digital receipts.

PaySwarm uses JSON-LD and thus supports decentralized extensibility for product data. This means that any vendor may express information about the asset in JSON-LD and it will be preserved in all digital contracts and digital receipts. This allows the asset and digital receipt format to be used as a platform that can be built on top of by innovative retailers. It also increases data fidelity by allowing far more detailed markup of asset information than what is currently allowed via the MozPay API.

Postback URL

...
  "request": {
...
    "postbackURL": "https://yourapp.com/payments/postback",
...

The postback URL is a pretty universal concept among Web-based payment systems. The payment processor needs a URL endpoint that the result of the purchase can be sent to. The postback URL serves this purpose.

PaySwarm has a similar concept, but just lists it in the request URL as ‘callback’.

Chargeback URL

...
  "request": {
...
    "chargebackURL": "https://yourapp.com/payments/chargeback"
...

The chargeback URL is a URL endpoint that is called whenever a refund is issued for a purchased item. It’s not clear if the vendor has a say in whether or not this should be allowed for a particular item. For example, what happens when a purchase is performed for a physical good? Should chargebacks be easy to do for those sorts of items?

PaySwarm does not build chargebacks into the core protocol. It lets the merchant request the digital receipt of the sale to figure out if the sale has been invalidated. It seems like a good idea to have a notification mechanism build into the core protocol. We’ll need more discussion on this to figure out how to correctly handle vendor-approved refunds and customer-requested chargebacks.

Conclusion

There are a number of improvements that could be made to the basic MozPay API that would enable more use cases to be supported in the future while keeping the level of complexity close to what it currently is. The second part of this analysis will examine the JavaScript Object Signature and Encryption (JOSE) technology stack and determine if there is a simpler solution that could be leveraged to simplify the digital signature requirements set forth by the MozPay API.

[UPDATE: The second part of this analysis is now available]

Identifiers in JSON-LD and RDF

TL;DR: This blog post argues that the extension of blank node identifiers in JSON-LD and RDF for the purposes of identifying predicates and naming graphs is important. It is important because it simplifies the usage of both technologies for developers. The post also provides a less-optimal solution if the RDF Working Group does not allow blank node identifiers for predicates and graph names in RDF 1.1.

We need identifiers as humans to convey complex messages. Identifiers let us refer to a certain thing by naming it in a particular way. Not only do humans need identifiers, but our computers need identifiers to refer to data in order to perform computations. It is no exaggeration to say that our very civilization depends on identifiers to manage the complexity of our daily lives, so it is no surprise that people spend a great deal of time thinking about how to identify things. This is especially true when we talk about the people that are building the software infrastructure for the Web.

The Web has a very special identifier called the Uniform Resource Locator (URL). It is probably one of the best known identifiers in the world, mostly because everybody that has been on the Web has used one. URLs are great identifiers because they are very specific. When I give you a URL to put into your Web browser, such as the link to this blog post, I can be assured that when you put the URL into your browser that you will see what I see. URLs are globally scoped, they’re supposed to always take you to the same place.

There is another class of identifier on the Web that is not globally scoped and is only used within a document on the Web. In English, these identifiers are used when we refer to something as “that thing”, or “this widget”. We can really only use this sort of identifier within a particular context where the people participating in the conversation understand the context. Linguists call this concept deixis. “Thing” doesn’t always refer to the same subject, but based on the proper context, we can usually understand what is being identified. Our consciousness tags the “thing” that is being talked about with a tag of sorts and then refers to that thing using this pseudo-identifier. Most of this happens unconsciously (notice how your mind unconsciously tied the use of ‘this’ in this sentence to the correct concept?).

The take-away is that there are globally-scoped identifiers like URLs, and there are also locally-scoped identifiers, that require a context in order to understand what they refer to.

JSON and JSON-LD

In JSON, developers typically express data like this:

{
  "name": "Joe"
}

Note how that JSON object doesn’t have an identifier associated with it. JSON-LD creates a straight-forward way of giving that object an identifier:

{
  "@context": ...,
  "@id": "http://example.com/people/joe",
  "name": "Joe"
}

Both you and I can refer to that object using http://example.com/people/joe and be sure that we’re talking about the same thing. There are times that assigning a global identifier to every piece of data that we create is not desired. For example, it doesn’t make much sense to assign an identifier to a transient message that is a request to get a sensor reading. This is especially true if there are millions of these types or requests and we never want to refer to the request once it has been transmitted. This is why JSON-LD doesn’t force developers to assign an identifier to the objects that they express. The people that created the technology understand that not everything needs a global identifier.

Computers are less forgiving, they need identifiers for most everything, but a great deal of that complexity can be hidden from developers. When an identifier becomes necessary in order to perform computations upon the data, the computer can usually auto-generate an identifier for the data.

RDF, Graphs, and Blank Node Identifiers

The Resource Description Framework (RDF) primarily uses an identifier called the Internationalized Resource Identifier (IRI). Where URLs can typically only express links in Western languages, an IRI can express links in almost every language in use today including Japanese, Tamil, Russian and Mandarin. RDF also defines a special type of identifier called a blank node identifier. This identifier is auto-generated and is locally scoped to the document. It’s an advanced concept, but is one that is pretty useful when you start dealing with transient data, where creating a global identifier goes beyond the intended usage of the data. An RDF-compatible program will step in and create blank node identifiers on your behalf, but only when necessary.

Both JSON-LD and RDF have the concept of a Statement, Graph, and a Dataset. A Statement consists of a subject, predicate, and an object (for example: “Dave likes cookies”). A Graph is a collection of Statements (for example: Graph A contains all the things that Dave said and Graph B contains all the things that Mary said). A Dataset is a collection of Graphs (for example: Dataset Z contains all of the things Dave and Mary said yesterday).

In JSON-LD, at present, you can use a blank node identifier for subjects, predicates, objects, and graphs. In RDF, you can only use blank node identifiers for subjects and objects. There are people, such as myself, in the RDF WG that think this is a mistake. There are people that think it’s fine. There are people that think it’s the best compromise that can be made at the moment. There is a wide field of varying opinions strewn between the various extremes.

The end result is that the current state of affairs have put us into a position where we may have to remove blank node identifier support for predicates and graphs from JSON-LD, which comes across as a fairly arbitrary limitation to those not familiar with the inner guts of RDF. Don’t get me wrong, I feel it’s a fairly arbitrary limitation. There are those in the RDF WG that don’t think it is and that may prevent JSON-LD from being able to use what I believe is a very useful construct.

Document-local Identifiers for Predicates

Why do we need blank node identifiers for predicates in JSON-LD? Let’s go back to the first example in JSON to see why:

{
  "name": "Joe"
}

The JSON above is expressing the following Statement: “There exists a thing whose name is Joe.”

The subject is “thing” (aka: a blank node) which is legal in both JSON-LD and RDF. The predicate is “name”, which doesn’t map to an IRI. This is fine as far as the JSON-LD data model is concerned because “name”, which is local to the document, can be mapped to a blank node. RDF cannot model “name” because it has no way of stating that the predicate is local to the document since it doesn’t support blank nodes for predicates. Since the predicate doesn’t map to an IRI, it can’t be modeled in RDF. Finally, “Joe” is a string used to express the object and that works in both JSON-LD and RDF.

JSON-LD supports the use of blank nodes for predicates because there are some predicates, like every key used in JSON, that are local to the document. RDF does not support the use of blank nodes for predicates and therefore cannot properly model JSON.

Document-local Identifiers for Graphs

Why do we need blank node identifiers for graphs in JSON-LD? Let’s go back again to the first example in JSON:

{
  "name": "Joe"
}

The container of this statement is a Graph. Another way of writing this in JSON-LD is this:

{
  "@context": ...,
  "@graph": {
    "name": "Joe"
  }
}

However, what happens when you have two graphs in JSON-LD, and neither one of them is the RDF default graph?

{
  "@context": ...,
  "@graph": [
    {
      "@graph": {
        "name": "Joe"
      }
    }, 
    {
      "@graph": {
        "name": "Susan"
      }
    }
  ]
}

In JSON-LD, at present, it is assumed that a blank node identifier may be used to name each graph above. Unfortunately, in RDF, the only thing that can be used to name a graph is an IRI, and a blank node identifier is not an IRI. This puts JSON-LD in an awkward position, either JSON-LD can:

  1. Require that developers name every graph with an IRI, which seems like a strange demand because developers don’t have to name all subjects and objects with an IRI, or
  2. JSON-LD can auto-generate a regular IRI for each predicate and graph name, which seems strange because blank node identifiers exist for this very purpose (not to mention this solution won’t work in all cases, more below), or
  3. JSON-LD can auto-generate a special IRI for each predicate and graph name, which would basically re-invent blank node identifiers.

The Problem

The problem surfaces itself when you try to convert a JSON-LD document to RDF. If the RDF Working Group doesn’t allow blank node identifiers for predicates and graphs, then what do you use to identify predicates and graphs that have blank node identifiers associated with them in the JSON-LD data model? This is a feature we do want to support because there are a number of important use cases that it enables. The use cases include:

  1. Blank node predicates allow JSON to be mapped directly to the JSON-LD and RDF data models.
  2. Blank node graph names allow developers to use graphs without explicitly naming them.
  3. Blank node graph names make the RDF Dataset Normalization algorithm simpler.
  4. Blank node graph names prevent the creation of a parallel mechanism to generate and manage blank node-like identifiers.

It’s easy to see the problem exposed when performing RDF Dataset Normalization, which we need to do in order to digitally sign information expressed in JSON-LD and RDF. The rest of this post will focus on this area, as it exposes the problems with not supporting blank node identifiers for predicates and graph names. In JSON-LD, the two-graph document above could be normalized to this NQuads (subject, predicate, object, graph) representation:

_:bnode0 _:name "Joe" _:graph1 .
_:bnode1 _:name "Susan" _:graph2 .

This is illegal in RDF since you can’t have a blank node identifier in the predicate or graph position. Even if we were to use an IRI in the predicate position, the problem (of not being able to normalize “un-labeled” JSON-LD graphs like the ones in the previous section) remains.

The Solutions

This section will cover the proposed solutions to the problem in order least desirable to most desirable.

Don’t allow blank node identifiers for predicates and graph names

Doing this in JSON-LD ignores the point of contention. The same line of argumentation can be applied to RDF. The point is that by forcing developers to name graphs using IRIs, we’re forcing them to do something that they don’t have to do with subjects and objects. There is no technical reason that has been presented where the use of a blank node identifier in the predicate or graph position is unworkable. Telling developers that they must name graphs using IRIs will be surprising to them, because there is no reason that the software couldn’t just handle that case for them. Requiring developers to do things that a computer can handle for them automatically is anti-developer and will harm adoption in the long run.

Generate fragment identifiers for graph names

One solution is to generate fragment identifiers for graph names. This, coupled with the base IRI would allow the data to be expressed legally in NQuads:

_:bnode0 <http://example.com/base#name> "Joe" <http://example.com/base#graph1> .
_:bnode1 <http://example.com/base#name> "Susan" <http://example.com/base#graph2> .

The above is legal RDF. The approach is problematic when you don’t have a base IRI, such as when JSON-LD is used as a messaging protocol between two systems. In that use case, you end up with something like this:

_:bnode0 <#name> "Joe" <#graph1> .
_:bnode1 <#name> "Susan" <#graph2> .

RDF requires absolute IRIs and so the document above is illegal from an RDF perspective. The other down-side is that you have to keep track of all fragment identifiers in the output and make sure that you don’t pick fragment identifiers that are used elsewhere in the document. This is fairly easy to do, but now you’re in the position of tracking and renaming both blank node identifiers and fragment IDs. Even if this approach worked, you’d be re-inventing the blank node identifier. This approach is unworkable for systems like PaySwarm that use transient JSON-LD messages across a REST API; there is no base IRI in this use case.

Skolemize to create identifiers for graph names

Another approach is skolemization, which is just a fancy way of saying: generate a unique IRI for the blank node when expressing it as RDF. The output would look something like this:

_:bnode0 <http://blue.example.com/.well-known/genid/2938570348579834> "Joe" <http://blue.example.com/.well-known/genid/348570293572375> .
_:bnode1 <http://blue.example.com/.well-known/genid/2938570348579834> "Susan" <http://blue.example.com/.well-known/genid/49057394572309457> .

This would be just fine if there was only one application reading and consuming data. However, when we are talking about RDF Dataset Normalization, there are cases where two applications must read and independently verify the representation of a particular IRI. One scenario that illustrates the example fairly nicely is the blind verification scenario. In this scenario, two applications de-reference an IRI to fetch a JSON-LD document. Each application must perform RDF Dataset Normalization and generate a hash of that normalization to see if they retrieved the same data. Based on a strict reading of the skolemization rules, Application A would generate this:

_:bnode0 <http://blue.example.com/.well-known/genid/2938570348579834> "Joe" <http://blue.example.com/.well-known/genid/348570293572375> .
_:bnode1 <http://blue.example.com/.well-known/genid/2938570348579834> "Susan" <http://blue.example.com/.well-known/genid/49057394572309457> .

and Application B would generate this:

_:bnode0 <http://red.example.com/.well-known/genid/J8Sfei8f792Fd3> "Joe" <http://red.example.com/.well-known/genid/j28cY82Pa88> .
_:bnode1 <http://red.example.com/.well-known/genid/J8Sfei8f792Fd3> "Susan" <http://red.example.com/.well-known/genid/k83FyUuwo89DF> .

Note how the two graphs would never hash to the same value because the Skolem IRIs are completely different. The RDF Dataset Normalization algorithm would have no way of knowing which IRIs are blank node stand-ins and which ones are legitimate IRIs. You could say that publishers are required to assign the skolemized IRIs to the data they publish, but that ignores the point of contention, which is that you don’t want to force developers to create identifiers for things that they don’t care to identify. You could argue that the publishing system could generate these IRIs, but then you’re still creating a global identifier for something that is specifically meant to be a document-scoped identifier.

A more lax reading of the Skolemization language might allow one to create a special type of Skolem IRI that could be detected by the RDF Dataset Normalization algorithm. For example, let’s say that since JSON-LD is the one that is creating these IRIs before they go out to the RDF Dataset Normalization Algorithm, we use the tag IRI scheme. The output would look like this for Application A:

_:bnode0 <tag:w3.org,2013:dsid:345> "Joe" <tag:w3.org,2013:dsid:254> .
_:bnode1 <tag:w3.org,2013:dsid:345> "Susan" <tag:w3.org,2013:dsid:363> .

and this for Application B:

_:bnode0 <tag:w3.org,2013:dsid:a> "Joe" <tag:w3.org,2013:dsid:b> .
_:bnode1 <tag:w3.org,2013:dsid:a> "Susan" <tag:w3.org,2013:dsid:c> .

The solution still doesn’t work, but we could add another step to the RDF Dataset Normalization algorithm that would allow it to rename any IRI starting with tag:w3.org,2013:. Keep in mind that this is exactly the same thing that we do with blank nodes, and it’s effectively duplicating that functionality. The algorithm would allow us to generate something like this for both applications doing a blind verification.

_:bnode0 <tag:w3.org,2013:dsid:predicate-1> "Joe" <tag:w3.org,2013:dsid:graph-1> .
_:bnode1 <tag:w3.org,2013:dsid:predicate-1> "Susan" <tag:w3.org,2013:dsid:graph-2> .

This solution does violate one strong suggestion in the Skolemization section:

Systems wishing to do this should mint a new, globally unique IRI (a Skolem IRI) for each blank node so replaced.

The IRI generated is definitely not globally unique, as there will be many tag:w3.org,2013:dsid:graph-1s in the world, each associated with data that is completely different. This approach also goes against something else in Skolemization that states:

This transformation does not appreciably change the meaning of an RDF graph.

It’s true that using tag IRIs doesn’t change the meaning of the graph when you assume that the document will never find its way into a database. However, once you place the document in a database, it certainly creates the possibility of collisions in applications that are not aware of the special-ness of IRIs starting with tag:w3.org,2013:dsid:. The data is fine taken by itself, but a disaster when merged with other data. We would have to put a warning in some specification for systems to make sure to rename the incoming tag:w3.org,2013:dsid: IRIs to something that is unique to the storage subsystem. Keep in mind that this is exactly what is done when importing blank node identifiers into a storage subsystem. So, we’ve more-or-less re-invented blank node identifiers at this point.

Allow blank node identifiers for graph names

This leads us to the question of why not just extend RDF to allow blank node identifiers for predicates and graph names? Ideally, that’s what I would like to see happen in the future as it places the least burden on developers, and allows RDF to easily model JSON. The responses from the RDF WG are varied. These are all of the current arguments against that I have heard:

There are other ways to solve the problem, like fragment identifiers and skolemization, than introducing blank nodes for predicates and graph names.

Fragment identifiers don’t work, as demonstrated above. There is really only one workable solution based on a very lax reading of skolemization, and as demonstrated above, even the best skolemization solution re-invents the concept of a blank node.

There are other use cases that are blocked by the introduction of blank node identifiers into the predicate and graph name position.

While this has been asserted, it is still unclear exactly what those use cases are.

Adding blank node identifiers for predicates and graph names will break legacy applications.

If blank nodes for predicates and graph names were illegal before, wouldn’t legacy applications reject that sort of input? The argument that there are bugs in legacy applications that make them not robust against this type of input is valid, but should that prevent the right solution from being adopted? There has been no technical reason put forward for why blank nodes for predicates or graph names cannot work, other than software bugs prevent it.

The PaySwarm work has chosen to model the data in a very strange way.

The people that have been working on RDFa, JSON-LD, and the Web Payments specifications for the past 5 years have spent a great deal of time attempting to model the data in the simplest way possible, and in a way that is accessible to developers that aren’t familiar with RDF. Whether or not it may seem strange is arguable since this response is usually levied by people not familiar with the Web Payments work. This blog post outlines a variety of use cases where the use of a blank node for predicates and graph naming is necessary. Stating that the use cases are invalid ignores the point of contention.

If we allow blank nodes to be used when naming graphs, then those blank nodes should denote the graph.

At present, RDF states that a graph named using an IRI may denote the graph or it may not denote the graph. This is a fancy way of saying that the IRI that is used for the graph name may be an identifier for something completely different (like a person), but de-referencing the IRI over the Web results in a graph about cars. I personally think that is a very dangerous concept to formalize in RDF, but there are others that have strong opinions to the contrary. The chances of this being changed in RDF 1.1 is next to none.

Others have argued that while that may be the case for IRIs, it doesn’t have to be the case for blank nodes that are used to name graphs. In this case, we can just state that the blank node denotes the graph because it couldn’t possibly be used for anything else since the identifier is local to the document. This makes a great deal of sense, but it is different from how an IRI is used to name a graph and that difference is concerning to a number of people in the RDF Working Group.

However, that is not an argument to disallow blank nodes from being used for predicates and graph names. The group could still allow blank nodes to be used for this purpose while stating that they may or may not be used to denote the graph.

The RDF Working Group does not have enough time left in its charter to make a change this big.

While this may be true, not making a decision on this is causing more work for the people working on JSON-LD and RDF Dataset Normalization. Having the tag:w3.org,2013:dsid: identifier scheme is also going to make many RDF-based applications more complex in the long run, resulting in a great deal more work than just allowing blank nodes for predicates and graph names.

Conclusion

I have a feeling that the RDF Working Group is not going to do the right thing on this one due to the time pressure of completing the work that they’ve taken on. The group has already requested, and has been granted, a charter extension. Another extension is highly unlikely, so the group wants to get everything wrapped up. This discussion could take several weeks to settle. That said, the solution that will most likely be adopted (a special tag-based skolem IRI) will cause months of work for people living in the JSON-LD and RDF ecosystem. The best solution in the long run would be to solve this problem now.

If blank node identifiers for predicates and graphs are rejected, here is the proposal that I think will move us forward while causing an acceptable amount of damage down the road:

  1. JSON-LD continues to support blank node identifiers for use as predicates and graph names.
  2. When converting JSON-LD to RDF, a special, relabelable IRI prefix will be used for blank nodes in the predicate and graph name position of the form tag:w3.org,2013:dsid:

Thanks to Dave Longley for proofing this blog post and providing various corrections.

Objection to Microdata Candidate Recommendation

Full disclosure: I’m the current chair of the standards group at the World Wide Web Consortium that created the newest version of RDFa, editor of the HTML5+RDFa 1.1 and RDFa Lite 1.1 specifications, and I’m also a member of the HTML Working Group.

Edit: 2012-12-01 – Updated the article to rephrase some things, and include rationale and counter-arguments at the bottom in preparation for the HTML WG poll on the matter.

The HTML Working Group at the W3C is currently trying to decide if they should transition the Microdata specification to the next stage in the standardization process. There has been a call for consensus to transition the spec to the Candidate Recommendation stage. The problem is that we already have a set of specifications that are official W3C recommendations that do what Microdata does and more. RDFa 1.1 became an official W3C Recommendation last summer. From a standards perspective, this is a mistake and sends a confused signal to Web developers. Officially supporting two specification that do almost exactly the same thing in almost exactly the same way is, ultimately, a failure to standardize.

The fact that RDFa already does what Microdata does has been elaborated upon before:

Mythical Differences: RDFa Lite vs. Microdata
An Uber-comparison of RDFa, Microdata, and Microformats

Here’s the problem in a nutshell: The W3C is thinking of ratifying two completely different specifications that accomplish the same thing in basically the same way. The functionality of RDFa, which is already a W3C Recommendation, overlaps Microdata by a large margin. In fact, RDFa Lite 1.1 was developed as a plug-in replacement for Microdata. The full version of RDFa can also do a number of things that Microdata cannot, such as datatyping, associating more than one type per object, embed-ability in languages other than HTML, ability to easily publish and mix vocabularies, etc.

Microdata would have easily been dead in the water had it not been for two simple facts: 1) The editor of the specification works at Google, and 2) Google pushed Microdata as the markup language for schema.org before also accepting RDFa markup. The first enabled Google and the editor to work on schema.org without signalling to the public that it was creating a competitor to Facebook’s Open Graph Protocol. The second gave Microdata enough of a jump start to establish a foothold for schema.org markup. There have been a number of studies that show that Microdata’s sole use case (99% of Microdata markup) is for the markup of schema.org terms. Microdata is not widely used outside of that context, we now have data to back up what we had predicted would happen when schema.org made their initial announcement for Microdata-only support. Note that schema.org now supports both RDFa and Microdata.

It is typically a bad idea to have two formats published by the same organization that do the same thing. It leads to Web developer confusion surrounding which format to use. One of the goals of Web standards is to reduce, or preferably eliminate, the confusion surrounding the correct technology decision to make. The HTML Working Group and the W3C is failing miserably on this front. There is more confusion today about picking Microdata or RDFa because they accomplish the same thing in effectively the same way. The only reason both exist is due to political reasons.

If we step back and look at the technical arguments, there is no compelling reason that Microdata should be a W3C Recommendation. There is no compelling reason to have two specifications that do the same thing in basically the same way. Therefore, as a member of the HTML Working Group (not as a chair or editor of RDFa) I object to the publication of Microdata as a Candidate Recommendation.

Note that this is not a W3C formal objection. This is an informal objection to publish Microdata along the Recommendation track. This objection will not become an official W3C formal objection if the HTML Working Group holds a poll to gather consensus around whether Microdata should proceed along the Recommendation publication track. I believe the publication of a W3C Note will continue to allow Google to support Microdata in schema.org, but will hopefully correct the confused message that the W3C has been sending to Web developers regarding RDFa and Microdata. We don’t need two specifications that do almost exactly the same thing.

The message sent by the W3C needs to be very clear: There is one recommendation for doing structured data markup in HTML. That recommendation is RDFa. It addresses all of the use cases that have been put forth by the general Web community, and it’s ready for broad adoption and implementation today.

If you agree with this blog post, make sure to let the HTML Working Group know that you do not think that the W3C should ratify two specifications that do almost exactly the same thing in almost exactly the same way. Now is the time to speak up!

Summary of Facts and Arguments

Below is a summary of arguments presented as a basis for publishing Microdata along the W3C Note track:

  1. RDFa 1.1 is already a ratified Web standard as of June 7th 2012 and absorbed almost every Microdata feature before it became official. If the majority of the differences between RDFa and Microdata boil down to different attribute names (property vs. itemprop), then the two solutions have effectively converged on syntax and W3C should not ratify two solutions that do effectively the same thing in almost exactly the same way.
  2. RDFa is supported by all of the major search crawlers, including Google (and schema.org), Microsoft, Yahoo!, Yandex, and Facebook. Microdata is not supported by Facebook.
  3. RDFa Lite 1.1 is feature-equivalent to Microdata. Over 99% of Microdata markup can be expressed easily in RDFa Lite 1.1. Converting from Microdata to RDFa Lite is as simple as a search and replace of the Microdata attributes with RDFa Lite attributes. Conversely, Microdata does not support a number of the more advanced RDFa features, like being able to tell the difference between feet and meters.
  4. You can mix vocabularies with RDFa Lite 1.1, supporting both schema.org and Facebook’s Open Graph Protocol (OGP) using a single markup language. You don’t have to learn Microdata for schema.org and RDFa for Facebook – just use RDFa for both.
  5. The creator of the Microdata specification doesn’t like Microdata. When people are not passionate about the solutions that they create, the desire to work on those solutions and continue improve upon them is muted. The RDFa community is passionate about the technology that they have created together and have strived to make it better since the standardization of RDFa 1.0 back in 2008.
  6. RDFa Lite 1.1 is fully upward-compatible with RDFa 1.1, allowing you to seamlessly migrate to a more feature-rich language as your Linked Data needs grow. Microdata does not support any of the more advanced features provided by RDFa 1.1.
  7. RDFa deployment is broader than Microdata. RDFa deployment continues to grow at a rapid pace.
  8. The economic damage generated by publishing both RDFa and Microdata along the Recommendation track should not be underestimated. W3C should try to provide clear direction in an attempt to reduce the economic waste that a “let the market sort it out among two nearly identical solutions” strategy will generate. At some point, the market will figure out that both solutions are nearly identical, but only after publishing and building massive amounts of content and tooling for both.
  9. The W3C Technical Architecture Group (TAG), which is responsible for ensuring that the core architecture of the Web is sound, has raised their concern about the publication of both Microdata and RDFa as recommendations. After the W3C TAG raised their concerns, the RDFa Working Group created RDFa Lite 1.1 to be a near feature-equivalent replacement for Microdata that was also backwards-compatible with RDFa 1.0.
  10. Publishing a standard that does almost exactly the same thing as an existing standard in almost exactly the same way is a failure to standardize.

Counter-arguments and Rebuttals

[This is a] classic case of monopolistic anti-competitive protectionism.

No, this is an objection to publishing two specifications that do almost exactly the same thing in almost exactly the same way along the W3C Recommendation publication track. Protectionism would have asked that all work on Microdata be stopped and the work scuttled. The proposed resolution does not block anybody from using Microdata, nor does it try to stop or block the Microdata work from happening in the HTML WG. The objection asks that the W3C decide what the best path forward for Web developers is based on a fairly complicated set of predicted outcomes. This is not an easy decision. The objection is intended to ensure that the HTML Working Group has this discussion before we proceed to Candidate Recommendation with Microdata.

<manu1> I'd like the W3C to work as well, and I think publishing two specs that accomplish basically 
        the same thing in basically the same way shows breakage.
<annevk> Bit late for that. XDM vs DOM, XPath vs Selectors, XSL-FO vs CSS, XSLT vs XQuery, 
         XQuery vs XQueryX, RDF/XML vs Turtle, XForms vs Web Forms 2.0, 
         XHTML 1.0 vs HTML 4.01, XML 1.0 4th Edition vs XML 1.0 5th Edition, 
         XML 1.0 vs XML 1.1, etc.

[link to full conversation]

While W3C does have a history of publishing competing specifications, there have been features in each competing specification that were compelling enough to warrant the publication of both standards. For example, XHTML 1.0 provided a standard set of rules for validating documents that was aligned with XML and a decentralized extension mechanism that HTML4.01 did not. Those two major features were viewed as compelling enough to publish both specifications as Recommendations via W3C.

For authors, the differences between RDFa and Microdata are so small that, for 99% of documents in the wild, you can convert a Microdata document to an RDFa Lite 1.1 document with a simple search and replace of attribute names. That demonstrates that the syntaxes for both languages are different only in the names of the HTML attributes, and that does not seem like a very compelling reason to publish both specifications as Recommendations.

Microdata’s processing algorithm is vastly simpler, which makes the data
extracted more reliable and, when something does go wrong, makes it easier for 1) users to debug their own data, and 2) easier for me to debug it if they can’t figure it out on their own.

Microdata’s processing algorithm is simpler for two major reasons:

The complexity of implementing a processor has little bearing on how easy it is for developers to author documents. For example, XHTML 1.0 had a simpler processing model which made the data that was extracted more reliable and when something went wrong, it was easier to debug. However, HTML5 supported more use cases and recovers from errors in cases where it can, which made it more popular with Web developers in the long-run.

Additionally, authors of Microdata and RDFa should be using tools like RDFa Play to debug their markup. This is true for any Web technology. We debug our HTML, JavaScript, and CSS by loading it into a browser and bringing up the debugging tools. This is no different for Microdata and RDFa. If you want to make sure your markup does what you want, make sure to verify it by using a tool and not by trying to memorize the processing rules and running them through your head.

For what it is worth, I personally think RDFa is generally a technically better solution. But as Marcos says, “so what”? Our job at W3C is to make standards for the technology the market decides to use.

If we think one of these technologies is a technically better solution than the other one, we should signal that realization at some level. The most basic thing we could do is to make one an official Recommendation, and the other a Note. I also agree that our job at W3C is to make standards that the technology market decides to use, but clearly this particular case isn’t that cut-and-dried. Schema.org’s only option in the beginning was to use Microdata, and since authors didn’t want to risk not showing up in the search engines, they used Microdata. This forced the market to go in one direction.

This discussion would be in a different place had Google kept the playing field level. That is not to say that Google didn’t have good reasons for making the decisions that they did at the time, but those reasons influenced the development of RDFa, and RDFa Lite 1.1 was the result. The differences between Microdata and RDFa have been removed and a new question is in front of us: given two almost identical technologies, should the W3C publish two specifications that do almost exactly the same thing in almost exactly the same way?

… the [HTML] Working Group explicitly decided not to pick a winner between HTML Microdata and HTML+RDFa

The question before the HTML WG at the time was whether or not to split Microdata out of the HTML5 specification. The HTML Working Group did not discuss whether the publishing track for the Microdata document should be the W3C Note track or the W3C Recommendation track. At the time the decision was made, RDFa Lite 1.1 did not exist, RDFa Lite 1.1 was not a W3C Recommendation, nor did the RDFa and Microdata functionality so greatly overlap as they do now. Additionally, the HTML WG decision at that time states the following under the “Revisiting the issue” section:

“If Microdata and RDFa converge in syntax…”

Microdata and RDFa have effectively converged in syntax. Since Microdata can be interpreted as RDFa based on a simple search-and-replace of attributes that the languages have effectively converged on syntax except for the attribute names. The proposal is not to have work on Microdata stopped. Let work on Microdata proceed in this group, but let it proceed on the W3C Note publication track.

Closing Statements

I felt uneasy raising this issue because it’s a touchy and painful subject for everyone involved. Even if the discussion is painful, it is a healthy one for a standardization body to have from time to time. What I wanted was for the HTML Working Group to have this discussion. If the upcoming poll finds that the consensus of the HTML Working Group is to continue with the Microdata specification along the Recommendation track, I will not pursue a W3C Formal Objection. I will respect whatever decision the HTML Working Group makes as I trust the Chairs of that group, the process that they’ve put in place, and the aggregate opinion of the members in that group. After all, that is how the standardization process is supposed to work and I’m thankful to be a part of it.

The Problem with RDF and Nuclear Power

Full disclosure: I am the chair of the RDFa Working Group, the JSON-LD Community Group, a member of the RDF Working Group, as well as other Semantic Web initiatives. I believe in this stuff, but am critical about the path we’ve been taking for a while now.

The Resource Description Framework (a model for publishing data on the Web) has this horrible public perception akin to how many people in the USA view nuclear power. The coal industry campaigned quite aggressively to implant the notion that nuclear power was not as safe as coal. Couple this public misinformation campaign with a few nuclear-power-related catastrophes and it is no surprise that the current public perception toward nuclear power can be summarized as: “Not in my back yard”. Nevermind that, per tera-watt, nuclear power generation has killed far fewer people since its inception than coal. Nevermind that it is one of the more viable power sources if we gaze hundreds of years into Earth’s future, especially with the recent renewed interest in Liquid Flouride Thorium Reactors. When we look toward the future, the path is clear, but public perception is preventing us from proceeding down that path at the rate that we need to in order to prevent more damage to the Earth.

RDF shares a number of these similarities with nuclear power. RDF is one of the best data modeling mechanisms that humanity has created. Looking into the future, there is no equally-powerful, viable alternative. So, why has progress been slow on this very exciting technology? There was no public mis-information campaign, so where did this negative view of RDF come from?

In short, RDF/XML was the Semantic Web’s 3 Mile Island incident. When it was released, developers confused RDF/XML (bad) with the RDF data model (good). There weren’t enough people and time to counter-act the negative press that RDF was receiving as a result of RDF/XML and thus, we are where we are today because of this negative perception of RDF. Even Wikipedia’s page on the matter seems to imply that RDF/XML is RDF. Some purveyors of RDF think that the public perception problem isn’t that bad. I think that when developers hear RDF, they think: “Not in my back yard”.

The solution to this predicament: Stop mentioning RDF and the Semantic Web. Focus on tools for developers. Do more dogfooding.

To explain why we should adopt this strategy, we can look to Tesla for inspiration. Elon Musk, founder of PayPal and now the CEO of Tesla Motors, recently announced the Tesla Supercharger project. At a high-level, the project accomplishes the following jaw-dropping things:

  1. It creates a network of charging stations for electric cars that are capable of charging a Tesla in less than 30 minutes.
  2. The charging stations are solar powered and generate more electricity than the cars use, feeding the excess power into the local power grid.
  3. The charging stations are free to use for any person that owns a Tesla vehicle.
  4. The charging stations are operational and available today.

This means that, in 4-5 years, any owner of a Tesla vehicle be able to drive anywhere in the USA, for free, powered by the sun. No person in their right mind (with the money) would pass up that offer. No fossil fuel-based company will ever be able to provide “free”, clean energy. This is the sort of proposition we, the RDF/Linked Data/Semantic Web community, need to make; I think we can re-position ourselves to do just that.

Here is what the RDF and Linked Data community can learn from Tesla:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of RDF, theoretical value, or design. Deliver production-ready, open-source software tools.
  3. Build a network of believers by spending more of your time working with Web developers and open-source projects to convince them to publish Linked Data. Dogfood our work.

Here is how we’ve applied these lessons to the JSON-LD work:

  1. We don’t mention RDF in the specification, unless absolutely necessary, and in many cases it isn’t necessary. RDF is plumbing, it’s in the background, and developers don’t need to know about it to use JSON-LD.
  2. We purposefully built production-ready tools for JSON-LD from day one; a playground, multiple production-ready implementations, and a JavaScript implementation of the browser-based API.
  3. We are working with Wikidata, Wikimedia, Drupal, the Web Payments and Read Write Web groups at W3C, and a number of other private clients to ensure that we’re providing real value and dogfooding our work.

Ultimately, RDF and the Semantic Web are of no interest to Web developers. They also have a really negative public perception problem. We should stop talking about them. Let’s shift the focus to be on Linked Data, explaining the problems that Web developers face today, and concrete, demonstrable solutions to those problems.

Note: This post isn’t meant as a slight against any one person or group. I was just working on the JSON-LD spec, aggressively removing prose discussing RDF, and the analogy popped into my head. This blog post was an exercise in organizing my thoughts on the matter.

HTML5 and RDFa 1.1

Full disclosure: I’m the chair of the newly re-chartered RDFa Working Group at the W3C as well as a member of the HTML WG.

The newly re-chartered RDFa Working Group at the W3C published a First Public Working Draft of HTML5+RDFa 1.1 today. This might be confusing to those of you that have been following the RDFa specifications. Keep in mind that HTML5+RDFa 1.1 is different from XHTML+RDFa 1.1, RDFa Core 1.1, and RDFa Lite 1.1 (which are official specs at this point). This is specifically about HTML5 and RDFa 1.1. The HTML5+RDFa 1.1 spec reached Last Call (aka: almost done) status at W3C via the HTML Working Group last year. So, why are we doing this now and what does it mean for the future of RDFa in HTML5?

Here’s the issue: the document was being unnecessarily held up by the HTML5 specification. In the most favorable scenario, HTML5 is expected to become an official standard in 2014. RDFa Core 1.1 became an official standard in June 2012. Per the W3C process, HTML5+RDFa 1.1 would have had to wait until 2014 to become an official W3C specification, even though it would be ready to go in a few months from now. W3C policy states that all specs that your spec depends on must reach the official spec status before your spec becomes official. Since HTML5+RDFa 1.1 is a language profile for RDFa 1.1 that is layered on top of HTML5, it had no choice but to wait for HTML5 to become official. Boo.

Thankfully the chairs of the HTML WG, RDFa WG, and W3C staff found an alternate path forward for HTML5+RDFa 1.1. Since the specification doesn’t depend on any “at risk” features in HTML5, and since all of the features that RDFa 1.1 uses in HTML5 have been implemented in all of the Web browsers, there is very little chance that those features will be removed in the future. This means that HTML5+RDFa 1.1 could become an official W3C specification before HTML5 reaches that status. So, that’s what we’re going to try to do. Here’s the plan:

  1. Get approval from W3C member companies to re-charter the RDFa WG to take over publishing responsibility of HTML5+RDFa 1.1. [Done]
  2. Publish the HTML5+RDFa 1.1 specification under the newly re-chartered RDFa WG. [Done]
  3. Start the clock on a new patent exclusion period and resolve issues. Wait a minimum of 6 months to go to W3C Candidate Recommendation (feature freeze) status, due to patent policy requirements.
  4. Fast-track to an official W3C specification (test suite is already done, inter-operable implementations are already done).

There are a few minor issues that still need to be ironed out, but the RDFa WG is on the job and those issues will get resolved in the next month or two. If everything goes according to plan, we should be able to publish HTML5+RDFa 1.1 as an official W3C standard in 7-9 months. That’s good for RDFa, good for Web Developers, and good for the Web.

Mythical Differences: RDFa Lite vs. Microdata

Full disclosure: I’m the current chair of the standards group at the World Wide Web Consortium that created the newest version of RDFa.

RDFa 1.1 became an official Web specification last month. Google started supporting RDFa in Google Rich Snippets some time ago and has recently announced that they will support RDFa Lite for schema.org as well. These announcements have led to a weekly increase in the number of times the following question is asked by Web developers on Twitter and Google+:

“What should I implement on my website? Microdata or RDFa?”

This blog post attempts to answer the question once and for all. It dispels some of the myths around the Microdata vs. RDFa debate and outlines how the two languages evolved to solve the same problem in almost exactly the same way.

 

Here’s the short answer for those of you that don’t have the time to read this entire blog post: Use RDFa Lite – it does everything important that Microdata does, it’s an official standard, and has the strongest deployment of the two.

Functionally Equivalent

Microdata was initially designed as a simple subset of RDFa and Microformats, primarily focusing on the core features of RDFa. Unfortunately, when this was done, the choice was made to break compatibility with RDFa and effectively fork the specification. Conversely, RDFa Lite highlights the subset of RDFa that Microdata did, but does it in a way that does not break backwards compatibility with RDFa. This was done on purpose, so that Web developers wouldn’t have a hard decision in front of them.

RDFa Lite contains all of the simplicity of Microdata coupled with the extensibility of and compatibility with RDFa. This is an important point that is often lost in the debate – there is no solid technical reason for choosing Microdata over RDFa Lite anymore. There may have been a year ago, but RDFa Lite made a few tweaks in such a way as to achieve feature-parity with Microdata today while being able to do much more than Microdata if you ever need the flexibility. If you don’t want to code yourself into a corner – use RDFa Lite.

To examine why RDFa Lite is a better choice, let’s take a look at the markup attributes for Microdata and the functionally equivalent ones provided by RDFa Lite:

Microdata 1.0 RDFa Lite 1.1 Purpose
itemid resource Used to identify the exact thing that is being described using a URL, such as a specific person, event, or place.
itemprop property Used to identify a property of the thing being described, such as a name, date, or location.
itemscope not needed Used to signal that a new thing is being described.
itemtype typeof Used to identify the type of thing being described, such as a person, event, or place.
itemref not needed Used to copy-paste a piece of data and associate it with multiple things.
not supported vocab Used to specify a default vocabulary that contains terms that are used by markup.
not supported prefix Used to mix different vocabularies in the same document, like ones provided by Facebook, Google, and open source projects.

As you can see above, both languages have exactly the same number of attributes. There are nuanced differences on what each attribute allows one to do, but Web developers only need to remember one thing from this blog post: Over 99% of all Microdata markup in the wild can be expressed in RDFa Lite just as easily. This is a provable fact – replace all Microdata attributes with the equivalent RDFa Lite attributes, add vocab="http://schema.org/" to the markup block, and you’re done.

At this point, you may be asking yourself why the two languages are so similar. There is almost 8 years of history here, but to summarize: RDFa was created around the 2004 time frame, Microdata came much later and used RDFa as a design template. Microdata chose a subset of the original RDFa design to support, but did so in an incompatible way. RDFa Lite then highlighted the subset of the functionality that Microdata did, but in a way that is backwards compatible with RDFa. RDFa Lite did this while keeping the flexibility of the original RDFa intact.

That leaves us where we are today – with two languages, Microdata and RDFa Lite, that accomplish the same things using the same markup patterns. The reason both exist is a very long story involving politics, egos, and a fair amount of disfunctionality between various standards groups – all of which doesn’t have any impact on the actual functionality of either language. The bottom line is that we now have two languages that do almost exactly the same thing. One of them, RDFa Lite 1.1, is currently an official standard. The other one, Microdata, probably won’t become a standard until 2014.

Markup Similarity

The biggest deployment of Microdata on the Web is for implementing the schema.org vocabulary by Google. Recently, with the release of RDFa Lite 1.1, Google has announced their intent to “officially” support RDFa as well. To see what this means for Web developers, let’s take a look at some markup. Here is a side-by-side comparison of two markup examples – one in Microdata and another in RDFa Lite 1.1:

Microdata 1.0 RDFa Lite 1.1
<div itemscope itemtype="http://schema.org/Product">
  <img itemprop="image" src="dell-30in-lcd.jpg" />
  <span itemprop="name">Dell UltraSharp 30" LCD Monitor</span>
</div>
<div vocab="http://schema.org/" typeof="Product">
  <img property="image" src="dell-30in-lcd.jpg" />
  <span property="name">Dell UltraSharp 30" LCD Monitor</span>
</div>

If the markup above looks similar to you, that was no accident. RDFa Lite 1.1 is designed to function as a drop-in replacement for Microdata.

The Bits that Don’t Matter

Only two features of Microdata aren’t supported by RDFa Lite; itemref and itemscope. Regarding itemref, the RDFa Working Group discussed the addition of that property and, upon reviewing Microdata markup in the wild, saw almost no use of itemref in production code. The schema.org examples steer clear of using itemref as well, so it was fairly clear that itemref is, and will continue to be, an unused feature of Microdata. The itemscope property is redundant in RDFa Lite and is thus unnecessary.

5 Reasons

For those of you that still are not convinced, here are the top five reasons that you should pick RDFa Lite 1.1 over Microdata:

  1. RDFa is supported by all of the major search crawlers, including Google (and schema.org), Microsoft, Yahoo!, Yandex, and Facebook. Microdata is not supported by Facebook.
  2. RDFa Lite 1.1 is feature-equivalent to Microdata. Over 99% of Microdata markup can be expressed easily in RDFa Lite 1.1. Converting from Microdata to RDFa Lite is as simple as a search and replace of the Microdata attributes with RDFa Lite attributes. Conversely, Microdata does not support a number of the more advanced RDFa features, like being able to tell the difference between feet and meters.
  3. You can mix vocabularies with RDFa Lite 1.1, supporting both schema.org and Facebook’s Open Graph Protocol (OGP) using a single markup language. You don’t have to learn Microdata for schema.org and RDFa for Facebook – just use RDFa for both.
  4. RDFa Lite 1.1 is fully upward-compatible with RDFa 1.1, allowing you to seamlessly migrate to a more feature-rich language as your Linked Data needs grow. Microdata does not support any of the more advanced features provided by RDFa 1.1.
  5. RDFa deployment is greater than Microdata. RDFa deployment continues to grow at a rapid pace.

Hopefully the reasons above are enough to convince most Web developers that RDFa Lite is the best bet for expressing Linked Data in web pages, boosting your Search Engine Page rank, and ensuring that you’re future-proofing your website as your data markup needs grow over the next several years. If it’s not, please leave a comment below explaining why you’re still not convinced.

If you’d like to learn more about RDFa, try the rdfa.info website. If you’d like to see more RDFa Lite examples and play around with the live RDFa editor, check out RDFa Play.

Thanks to Tattoo Tabatha for the artwork in this blog piece.

Blindingly Fast RDFa 1.1 Processing

The fastest RDFa processor in the world just got a big update – librdfa 1.1 has just been released! librdfa is a SAX-based RDFa processor, written in pure C – which makes it very portable to a variety of different software and hardware architectures. It’s also tiny and fast – the binary is smaller than this web page (around 47KB), and it’s capable of extracting roughly 5,000 triples per second per CPU core from an HTML or XML document. If you use Raptor or the Redland libraries, you use librdfa.

The timing for this release coincides with the push for a full standard at W3C for RDFa 1.1. The RDFa 1.1 specification has been in feature-freeze for over a month and is proceeding to W3C vote to finalize it as an officially recognized standard. There are now 5 fully conforming implementations for RDFa in a variety of languages – librdfa in C, PyRDFa in Python, RDF::RDFa in Ruby, Green Turtle in JavaScript, and clj-rdfa in Clojure.

It took about a month of spare-time hacking on librdfa to update it to support RDFa 1.1. It has also been given a new back-end document processor. A migration from libexpat to libxml2 was performed in order to better support processing of badly authored HTML documents as well as well formed XML documents. Support for all of the new features in RDFa 1.1 have been added, including the @vocab attribute, @prefix, and @inlist. Full support for RDFa Lite 1.1 has also been included. A great deal of time was also put into making sure that there were absolutely no memory leaks or pointer issues across all 700+ tests in the RDFa 1.1 Test Suite. There is still some work that needs to be done to add HTML5 @datetime attribute support and fix xml:base processing in SVG files, but that’s fairly small stuff that will be implemented over the next month or two.

Many thanks to Daniel Richard G., who updated the build system to be more cross-platform and pure C compliant on a variety of different architectures. Also thanks to Dave Longley who fixed the very last memory leak, which turned out to be a massive pain to find and resolve. This version of librdfa is ready for production use for processing all XML+RDFa and XHTML+RDFa documents. This version also supports both RDFa 1.0 and RDFa 1.1, as well as RDFa Lite 1.1. While support for HTML5+RDFa is 95% of the way there, I expect that it will be 100% in the next month or two.

Google Indexing RDFa 1.0 + schema.org Markup

Full disclosure: I am the chair of the RDF Web Applications Working Group at the World Wide Web Consortium – RDFa is one of the technologies that we’re working on.

Google is building a gigantic Knowledge Graph that will change search forever. The purpose of the graph is to understand the conceptual “things” on a web page and produce better search results for the world. Clearly, the people and companies that end up in this Knowledge Graph first will have a huge competitive advantage over those that do not. So, what can you do today to increase your organization’s chances of ending up in this Knowledge Graph, and thus ending up higher in the Search Engine Result Pages (SERPs)?

One possible approach is to mark your pages up with RDFa and schema.org. “But wait”, you might ask, “schema.org doesn’t support RDFa, does it?”. While schema.org launched with only Microdata support, Google has said that they will support RDFa 1.1 Lite, which is slated to become an official specification in the next couple of months.

However, that doesn’t mean that the Google engineers are sitting still while the RDFa 1.1 spec moves toward official standard status. RDFa 1.0 became an official specification in October 2008. Many people have been wondering if Google would start indexing RDFa 1.0 + schema.org markup while we wait for RDFa 1.1 to become official. We have just discovered that Google is not only indexing schema.org expressed as RDFa 1.0, but they’re enhancing search result listings based on data gleaned from schema.org markup expressed as RDFa 1.0!

Here’s what it looks like in the live Google search results:

Enhanced Google search result showing event information
The image above shows a live, enhanced Google search result with event information extracted from the RDFa 1.0 + schema.org data, including date and location of the event.

Enhanced Google search result showing recipe preparation time information
The image above shows a live, enhanced Google search result with recipe preparation time information, also extracted from the RDFa 1.0 + schema.org data that was on the page.

Enhanced Google search result showing detailed event information with click-able links
The image above shows a live, enhanced Google search result with very detailed event information gleaned from the RDFa 1.0 + schema.org data, including date, location and links to the individual event pages.

Looking at the source code for the pages above, a few things become apparent:

  1. All of the pages contain a mixture of RDFa 1.0 + schema.org markup. There is no Microformats or Microdata markup used to express the data shown in the live search listings. The RDFa 1.0 + schema.org data is definitely being used in live search listing displays.
  2. The Drupal schema.org module seems to be used for all of the pages, so if you use Drupal, you will probably want to install that module if you want the benefit of enhanced Google search listings.
  3. The search and social companies are serious about indexing RDFa content, which means that you may want to get serious about adding it into your pages before your competitors do.

Google isn’t the only company that is building a giant global graph of knowledge. Last year, Facebook launched a similar initiative called the Open Graph, which is also built on RDFa. The end-result of all of this work are better search listings, more relevant social interactions, and a more unified way of expressing “things” in Web pages using RDFa.

Does your website talk about any of the following things: Applications, Authors, Events, Movies, Music, People, Products, Recipes, Reviews, and/or TV Episodes? If so, you should probably be expressing that structured data as RDFa so that both Facebook and Google can give you better visibility over those that don’t in the coming years. You can get started by viewing the RDFa schema.org examples or reading more about Facebook’s Open Graph markup. If you don’t know anything about RDFa, you may want to start with the RDFa Lite document, or the RDFa Primer.

Many thanks to Stéphane Corlosquet for spotting this and creating the Drupal 7 schema.org module. Also thanks to Chris Olafson for spotting that RDFa 1.0 + schema.org markup is now consistently being displayed in live Google search results.

Searching for Microformats, RDFa, and Microdata Usage in the Wild

A few weeks ago, we announced the launch of the Data Driven Standards Community Group at the World Wide Web Consortium (W3C). The focus is on researching, analyzing and publicly documenting current usage patterns on the Internet. Inspired by the Microformats Process, the goal of this group is to enlighten standards development with real-world data. This group will collect and report data from large Web crawls, produce detailed reports on protocol usage across the Internet, document yearly changes in usage patterns and promote findings that demonstrate that the current direction of a particular specification should be changed based on publicly available data. All data, research, and analysis will be made publicly available to ensure the scientific rigor of the findings. The group will be a collection of search engine companies, academic researchers, hobbyists, protocol designers and specification editors in search of data that will guide the Internet toward a brighter future.

We had launched the group with the intent of regularly analyzing the Common Crawl data set. The goal of Common Crawl is to build and maintain an open crawl of the web that can be used by researchers, educators and innovators. The crawl currently contains roughly 40TB of compressed data, around 5 billion web pages, and is hosted on Amazon’s S3 service. To analyze the data, you have to write a small piece of analysis software that is then applied to all of the data using Amazon’s Elastic Map Reduce service.

I spent a few hours a couple of nights ago and wrote the analysis software, which is available as open source on github. This blog post won’t go into how the software was written, but rather the methodology and data that resulted from the analysis. There were three goals that I had in mind when performing this trial run:

  • Quickly hack something together to see if Microformats, RDFa and Microdata analysis was feasible.
  • Estimate the cost of performing a full analysis.
  • See if the data correlates with the Yahoo! study or the Web Data Commons project.

Methodology

The analysis software was executed against a very small subset of the Common Crawl data set. The directory that was analyzed (/commoncrawl-crawl-002/2010/01/07/18/) contained 1,273 ARC files, each weighing in at 100MBs each for around 124GBs of data processed. It took 8 EC2 machines a total of 14 hours and 23 minutes to process the data, for a grand total of 120 CPU hours utilized.

The analysis software streams each file from disk, decompresses it and breaks each file into the data that was retrieved from a particular URL. The file is checked to ensure that it is an HTML or XHTML file, if it isn’t, it is skipped. If the file is an XHTML or HTML file, an HTML4 DOM is constructed from the file using a very forgiving tag soup parser. At that point, CSS selectors are executed on the resulting DOM to search for HTML elements that contain attributes for each language. For example, the CSS selector “[property]” is executed to retrieve a count of all RDFa property attributes on the page. The same was performed for Microdata and Microformats. You can see the exact CSS queries used in the source file for the markup language detector.

Findings

Here are the types of documents that we found in the sample set:

Document Type Count Percentage
HTML or XHTML 10,598,873 100%
Microformats 14,881 0.14%
RDFa 4726 0.045%
Microdata* 0 0%
* The sample size was clearly too small since there were reports of Microdata out in the wild before this point in time in 2010.

The numbers above clearly deviate from both the Yahoo! study and the Web Data Commons project. The problem with our data set was that it was probably too small to really tell us anything useful, so please don’t use the numbers in this blog post for anything of importance.

The analysis software also counted the RDFa 1.1 attributes:

RDFa Attribute Count
property 3746
about 3671
resource 833
typeof 302
datatype 44
prefix 31
vocab 1

The property, about, resource, typeof, and datatype attributes have a usage pattern that is not very surprising. I didn’t check for combinations of attributes like property and content on the same element due to time constraints. I only had one night to figure out how to write the software, write it and run it. This sort of co-attribute detection should be included in future analysis of the data. What was surprising was that the prefix and vocab attributes were used somewhere out there before the features were introduced into RDFa 1.1, but not to the degree that it would be of concern to the people designing the RDFa 1.1 language.

The Good and the Bad

The good news is that it does not take a great deal of effort to write a data analysis tool and run it against the Common Crawl data set. I’ve published both our methodology and findings such that anybody could re-create them if they so desired. So, this is good for open Web Science initiatives everywhere.

However, there is bad news. It cost around $12.46 USD to run the test using Amazon’s Elastic Map Reduce system. The Common Crawl site states that they believe that it would cost roughly $150 to process the entire data set, but my calculations show a very different picture when you start doing non-trivial analysis. Keep in mind that 124GBs was processed of a total 40TB of data. That is, only about 0.31% of the data set was processed for $12.46. To process the entire Common Crawl corpus, it would cost around $4,020 USD. Clearly far more than what any individual would want to spend, but still very much within the reach of small companies and research groups.

Funding a full analysis of the entire Common Crawl dataset seemed within reach, but after discoverng what the price would be, I’m having second thoughts about performing the full analysis without a few other companies or individuals pitching in to cover the costs.

Potential Ways Forward

We may have run the analysis in a way that caused the price to far exceed what was predicted by the Common Crawl folks. I will be following up with them to see if there is a trick to reducing the cost of the EC2 instances.

One option would be to bid a very low price for Amazon EC2 Spot Instances. The down-side is that processing would happen only when nobody else was willing to bid the price we would and therefore, the processing job could take weeks. Another approach would have us use regular expressions to process the document instead of building an in-memory HTML DOM for the document. Regular expressions would be able to detect RDFa/Microdata and Microformats using far less CPU than the DOM-based approach. Yet another approach would have an individual or company with $4K to spend on this research project fund the analysis of the full data set.

Overall, I’m excited that doing this sort of analysis is becoming available to those of us without access to Google or Facebook-like resources. It is only a matter of time before we will be able to do a full analysis on the Common Crawl data set. If you are reading this and think you can help fund this work, please leave a comment on this blog, or e-mail me directly at: msporny@digitalbazaar.com.

Web Data Commons Launches

Some interesting numbers were just published regarding Microformats, RDFa and Microdata adoption as of October 2010 (fifteen months ago). The source of the data is the new CommonCrawl dataset, which is being analyzed by the Web Data Commons project. They sampled 1% of the 40 Terabyte data set (1.3 million pages) and came up with the following number of total statements (triples) made by pages in the sample set:

Markup Format Statements
Microformats 30,706,071
RDFa 1,047,250
Microdata 17,890
Total 31,771,211

Based on this preliminary data, of the structured data on the Web: 96.6% of it was Microformats, 3.2% of it was RDFa, and 0.05% of it was Microdata. Microformats is the clear winner in October 2010, with the vast majority of the data consisting of markup of people (hCard) and their relationships with one another (xfn). I also did a quick calculation on percentage of the 1.3 million URLs that contain Microformats, RDFa and Microdata markup:

Format Percentage of Pages
Microformats 88.9%
RDFa 12.1%
Microdata 0.09%

These findings deviate wildly from the findings by Yahoo around the same time. Additionally, the claim that 88.9% of all pages on the Web contain Microformats markup, even though I’d love to see that happen, is wishful thinking.

There are a few things that could have caused these numbers to be off. The first is that the Web Data Commons’ parsers are generating false positives or negatives, resulting in bad statement counts. A quick check of the data, which they released in full, will reveal if this is true. The other cause could be that the Yahoo study was flawed in the same way, but we may never know if that is true because they will probably never release their data set or parsers for public viewing. By looking at the RDFa usage numbers (3.2% for the Yahoo study vs. 12.1% for Web Data Commons) and the Microformats usage numbers (roughly 5% for the Yahoo study vs. 88.9% for Web Data Commons), the Web Data Commons numbers seem far more suspect. Data publishing in HTML is taking off, but it’s not that popular yet.

I would be wary of doing anything with these preliminary findings until the Web Data Commons folks release something more final. Nevertheless, it is interesting as a data point and I’m looking forward toward the full analysis that these researchers do in the coming months.

RDFa 1.1 Lite

Summary: RDFa 1.1 Lite is a simple subset of RDFa consisting of the following attributes: vocab, typeof, property, rel, about and prefix.

During the schema.org workshop, a proposal was put forth by RDFa’s resident hero, Ben Adida, for a stripped down version of RDFa 1.1, called RDFa 1.1 Lite. The RDFa syntax is often criticized as having too much functionality, leaving first-time authors confused about the more advanced features. This lighter version of RDFa will help authors easily jump into the Linked Data world. The goal was to create a very minimal subset that will work for 80% of the folks out there doing simple markup for things like search engines.

vocab, typeof and property

RDFa, like Microformats and Microdata, allow us to talk about things on the Web. Typically when we talk about a thing, we use a particular vocabulary to talk about it. So, if you wanted to talk about People, the vocabulary that you would use would specify terms like name and telephone number. When we want to mark up things on the Web, we need to do something very similar, which is specify which Web Vocabulary that we are going to be using. Here is a simple example that specifies a vocabulary that we intend to use to markup things in the paragraph:

<p vocab="http://schema.org/">
   My name is Manu Sporny and you can give me a ring via 1-800-555-0155.
</p>

As you will note above, we have specified that we’re going to be using the vocabulary that can be found at the http://schema.org/ Web address. This is a vocabulary that has been released by Google, Microsoft and Yahoo! to talk about common things on the Web that Search Engines care about – things like People, Places, Reviews, Recipes, and Events. Once we have specified the vocabulary, we need to specify the type of the thing that we’re talking about. In this particular case, we’re talking about a Person.

<p vocab="http://schema.org/" typeof="Person">
   My name is Manu Sporny and you can give me a ring via 1-800-555-0155.
</p>

Now all we need to do is specify which properties of that person we want to point out to the search engine. In the following example, we mark up the person’s name and phone number:

<p vocab="http://schema.org/" typeof="Person">
   My name is 
   <span property="name">Manu Sporny</span> 
   and you can give me a ring via
   <span property="telephone">1-800-555-0155</span>.
</p>

Now, when somebody types in “phone number for Manu Sporny” into a search engine, the search engine can more reliably answer the question directly, or point the person searching to a more relevant Web page.

rel

At times, two things on the Web may be related to one another in a specific way. For example, the current page may describe a thing that has a picture of it somewhere else on the Web.

<p vocab="http://schema.org/" typeof="Person">
   My name is 
   <span property="name">Manu Sporny</span> 
   and you can give me a ring via
   <span property="telephone">1-800-555-0155</span>.
   <img rel="image" src="http://manu.sporny.org/images/manu.png" />
</p>

The example above links the Person on the page to the image elsewhere on the Web using the “image” relationship. A search engine will now be able to divine that the Person on the page is depicted by the image that is linked to in the page.

about

If you want people to link to things on your page, you can identify the thing using a hash and a name. For example:

<p vocab="http://schema.org/" about="#manu" typeof="Person">
   My name is 
   <span property="name">Manu Sporny</span> 
   and you can give me a ring via
   <span property="telephone">1-800-555-0155</span>.
   <img rel="image" src="http://manu.sporny.org/images/manu.png" />
</p>

So, if we assume that the markup above can be found at http://example.org/people, then the identifier for the thing is the address, plus the value in the about attribute. Therefore, the identifier for the thing on the page would be: http://example.org/people#manu. This feature is similar to, but not exactly like, the id attribute in HTML.

prefix

In some cases, a vocabulary may not have all of the terms an author needs when describing their thing. The last feature in RDFa 1.1 Lite that some authors might need is the ability to specify more than one vocabulary. For example, if we are describing a Person and we need to specify that they have a blog, we could do something like the following:

<p vocab="http://schema.org/" prefix="foaf: http://xmlns.com/foaf/0.1/" about="#manu" typeof="Person">
   My name is 
   <span property="name">Manu Sporny</span> 
   and you can give me a ring via
   <span property="telephone">1-800-555-0155</span>.
   <img rel="image" src="http://manu.sporny.org/images/manu.png" />
   I have a <a rel="foaf:weblog" href="http://manu.sporny.org/">blog</a>.
</p>

The example assigns a short-hand prefix to the Friend-of-a-Friend vocabulary, foaf, and uses that prefix to specify the weblog vocabulary term. Since schema.org doesn’t have a clear way of expressing a person’s blog location, we can depend on FOAF to get the job done.

One of the other nice things about RDFa 1.1 (and RDFa 1.1 Lite) is that a number of useful and popular prefixes are pre-defined, so you can skip declaring them altogether and just use the prefixes.

Simplicity

That’s it – that is RDFa 1.1 Lite. It consists of six simple attributes; vocab, typeof, property, rel, about, and prefix. RDFa 1.1 Lite is completely upwards compatible with the full set of RDFa 1.1 attributes.

The Others

The RDFa 1.1 attributes that have been left out of RDFa 1.1 Lite are: content, datatype, resource, and rev. The great thing about RDFa 1.1 Lite and RDFa 1.1 is that you can always choose to use one or more of those advanced attributes and a conformant RDFa 1.1 processor will always pick up on the advanced attributes. That means that you don’t have to do anything to switch back and forth between RDFa 1.1 Lite and the full version of RDFa 1.1

Standardizing Payment Links

Why online tipping has failed.

TL;DR – Standardizing Payment Links for the Web is not enough – we must also focus on listing and transacting assets that provide value to people.

Today was a busy day for PaySwarm/Bitcoin integration. We had a very productive discussion about the PaySwarm use cases, which includes supporting Bitcoin as an alternative currency to government-backed currencies. I also had a very interesting discussion with Amir Taaki, who is one of the primary developers behind Bitcoin, about standardization of a Bitcoin IRI scheme for the Internet. Between those two meetings, Dan Brickley asked an interesting question:

danbri Sept 23rd 2011 12:36pm: @manusporny any thoughts re bitcoin/foaf, vs general ways of describing online payability? @graingert @melvincarvalho

This question comes up often during the PaySwarm work – we’ve been grappling with it for a number of years.

Payment Links Solutions

Payment links have quite a history on the Internet. People have been trying to address payment via the Web for over a decade now with many failures and lessons learned. The answer to the question really boils down to what you’re trying to do with the payment. I’ll first answer what I think Dan was asking: “Do you think we should add a bitcoin term to the Friend of a Friend Vocabulary? Should we think about generalizing this to all types of payment online?”

First, I think that adding a bitcoin term to the FOAF Vocabulary would be helpful for Bitcoin, but a bit short-sighted. This is typically what people wanting to be paid by Bitcoin addresses do today:

Support more articles like this by donating via Bitcoin: 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa

If you wanted to donate, you would copy/paste the crazy gobbledey-gook text starting with 1A1z and dump that into your Bitcoin client. One could easily make something like this machine-readable via HTML+RDFa to say that they can be paid at a particular Bitcoin address. To make the HTML above machine-readable, one could do the following:

<div about="#dan-brickley">
Support more articles like this by donating via Bitcoin: 
   <span property="foaf:bitcoin">1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa</span>
<div/>

However, that wouldn’t trigger a Bitcoin client to be launched when clicked. The browser would have to know to do something with that data and we’re many years away from that happening. So, using some sort of new Bitcoin IRI scheme that was discussed today might be a better short-term solution:

<div about="#me">
Support more articles like this by 
   <a rel="foaf:tipjar" href="bitcoin:1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa?amount=0.25">donating via Bitcoin</a>.
<div/>

This is a pretty typical approach to an online donation system or tipjar. You see PayPal buttons like this on a number of websites today. There are a few problems with it:

  • What happens when there are multiple Bitcoin addresses? How does a Web Browser automatically choose which one to use? It would be nice if we could integrate payment directly into the browser, but if we do that, we need more information associated with the Bitcoin address.
  • What if we want to use other payment systems? The second approach is better because it’s not specific to Bitcoin – it uses an IRI – but can we be more generalized than that? Requiring the creation of a new IRI scheme for every new payment protocol seems like overkill.
  • How should this be used to actually transact a digital good? Is it only good for tipjars? How does this work in a social setting – that is, do most people tip online?

The answer to the first question can be straight forward. A browser can’t automatically choose which Bitcoin address to use unless there is more machine-readable information in the page about each Bitcoin address or unless you can follow-your-nose to the Bitcoin address. You can’t do the latter yet with Bitcoin, so the former is the only option. For example, if the reason for payment were outlined for each Bitcoin address in the page, an informative UI could be displayed for the person browsing the page.

Luckily, this is pretty easy to do with HTML+RDFa, but it does require slightly more markup to associate descriptions with Bitcoin addresses. However, what if we want to move beyond tips? Just describing a payment endpoint is often confusing to people that want to pay for some specific good. Browsers or Bitcoin software may need to know more about the transaction to produce a reasonable summary or receipt to the person browsing, and that can’t be done with the markup of a single link.

The second question is a bit more difficult to answer. It would be short-sighted to just have a vocabulary term for Bitcoin. What happens if Bitcoin fails for some reason in the future? Should FOAF also add a term for Ven payments? What about PaySwarm payments? The FOAF vocabulary already has a term for a tipjar, so why not just use that coupled with a special IRI for the payment method? What may be better is a new term for “preferred payment IRI” – maybe “foaf:financialAccount” could work? Or maybe we should add a new term to the Commerce Vocabulary?

Depending on a payment protocol-specific IRI would require every payment method on the Web to register a new Internet protocol scheme. This makes the barrier to entry for new payment protocols pretty high. There is no reason why many of these payment mechanisms cannot be built on top of HTTP or other existing Internet standards, like SIP. However, If we have multiple payment protocols that run over HTTP, how can we differentiate one from another? Perhaps each payment mechanism needs it’s own vocabulary term, for example this for Bitcoin:

<a rel="bitcoin:payment" href="bitcoin:1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa?amount=0.25">tip me</a>.

and this for Ven:

<a rel="ven:payment" href="https://www.vencurrency.com/confirm/transfer/?request_key=8df6e4c0240365e425d6cf4839e5266e">tip me</a>.

and this for PaySwarm:

<a rel="ps:payment" href="https://dev.payswarm.com/i/manu/accounts/tips">tip me</a>.

The key thing to remember with all of these payment protocols is that what you do with the given IRI is different in each case. That is, the payment protocol matters and thus we may not want to generalize using the first two, but we do want a generalized solution. PaySwarm is a little different, in that it is currency agnostic. The standard aims to enable payments in Bitcoin, Ven, Bernal Bucks, or any current or future currency. So, one could just specify a person to pay via PaySwarm, like so:

<a rel="ps:payment" href="https://dev.payswarm.com/i/manu">tip me</a>.

The financial account could be selected automatically based on the type of payment currency. If the person is transmitting Bitcoins, a selection of target Bitcoin accounts could be automatically discovered by retrieving the contents of the URL above and extracting all accounts with a currency type of “Bitcoin”. So, that may be a good technical solution, but that doesn’t mean it is a good social solution. The hard problem remains – most people don’t tip online.

Tipping is a Niche Solution

Sure, there are solutions like Flattr and PayPal donations. These solutions will always be niche transactions because they’re socially awkward financial transactions. People like paying for refined goods – they only give on very rare occasions, usually when the payment is for a specific good. Even tipping wait staff at a restaurant isn’t the same as a tip on a website. When you tip wait staff, you are reimbursing them for the time and courtesy that they provided you. You are paying them for something that is scarce, for something that has value to you.

Now, think of how often you tip online vs. how often do you actually buy things online? A good summary of why people rarely tip online can be found on Gregory Rader’s blog – first in why people have a hard time paying for unrefined goods and why tips and donations rarely work for small websites. The core of what I took away from Greg’s articles is that, generally speaking, asking for tips on a small website is easily dismiss-able if done correctly and incredibly awkward if done incorrectly. You’re not getting the same sort of individual attention from a website than you are when you tip at a restaurant. You are far less likely to tip online than you are during a face-to-face encounter at a restaurant. Anonymity protects you from feeling bad about not tipping online.

People have a much easier time making a payment for something of perceived value online, even if it is virtual. These goods include songs, shares in a for-profit project, a pre-release for a short film, an item in a game, or even remotely buying someone a coffee in exchange for a future article that one may write. In order to do this, however, we must be able to express the payment in a less abstract form than just a simple Payment Link. It helps to be able to describe an asset that is being transacted so that there is less confusion about why the transaction is happening. Describing a transaction in detail also helps make the browser UIs more compelling, which results in a greater degree of trust that you’re not being scammed when you decide to enter into a transaction with a website.

Refined Payments

So, if we have a solution for Payment Links on the Web, we need to make sure that they:

  1. Are capable of expressing that they are for something of refined value, even if virtual.
  2. Are machine-readable and can be described in great detail.

The Web has a fairly dismal track record of tipping for content – people expect most unrefined content to be free. So, applying plain old Payment Links to that problem will probably not have the effect that most people expect it will have. The problem isn’t with ease of payment – the problem is a deeper social issue of paying for unrefined content. The solution is to be able to describe what is being transacted in far more detail, marked up in a form that is machine-readable and currency agnostic.

Expressing a PaySwarm Asset and Listing in a page, with Bitcoin or Ven or US Dollars as the transaction currency is one such approach that meets this criteria. The major draw-back being that expressing this information on a page is far more complicated than just expressing a Payment Link. So, perhaps we need both PaySwarm and Payment Links, but we should know that the problem space for Payment Links are much more socially complex than they may seem at first.

To answer Dan Brickley’s question more directly: I don’t think FOAF should add a “bitcoin” vocabulary term. Perhaps it should add something like “financialAccount”. However, once that term has been added, exactly what problem do you hope to solve with that addition?

An Uber-comparison of RDFa, Microdata and Microformats

Full disclosure: I am the current Chair of the group at the World Wide Web Consortium that created RDFa. That said, all of this is my personal opinion. I am not speaking on behalf of the W3C or my company, Digital Bazaar. This is just my personal take on the recent events that are unfolding. If you would like to keep up with these events as they happen, you can follow me on Twitter.

There has been a recent discussion at the World Wide Web Consortium (W3C) about the state of RDFa, Microdata and Microformats. The Technical Architecture Group (TAG) is concerned about the W3C publishing two specifications that achieve effectively the same thing in incompatible ways. They are suggesting that both RDFa 1.1 and Microdata, in their current state, should not proceed as official specifications until they become more compatible with one another. The W3C intends to launch a quick examination of the situation to determine whether or not there is room for convergence of these technologies.

To those that are not following this stuff closely, it can be difficult to understand all of the technical reasons this issue has been raised. This post attempts to clarify those technical issues by providing an easy-to-read list of similarities and differences between RDFa, Microdata and Microformats. A simple table summarizing all features across each structured data syntax is listed below. Each feature is linked to a brief explanation of the feature toward the bottom of the page.

Thanks to Jeni Tennison for doing a separate technical analysis for the W3C TAG. This article builds upon her hard work, but has been heavily modified and thus should not be considered as her thoughts on the matter. Writing this article was a fairly large undertaking and there are bound to be issues with parts of the article. Please let me know if there are errors by commenting on the post and I will do my best to fix them and clarify when necessary.

Structured Data in a Nutshell

Note: This post frequently uses the term IRI. For those not familiar with the term IRI, it means “Internationalized Resource Identifier” which is basically a fancy way of saying “a URL that allows western language characters as well as characters from any language in the world, such as Arabic, Japanese Katakana, Chinese ideograms, etc”. The URL in the location bar in your browser is a valid IRI.

Feature RDFa 1.1 Microdata 1.0 Microformats 1.0
Relative Complexity High Medium Low
Data Model Graph Tree Tree
Item optionally identified by IRI Yes Yes No
Item type optionally specified by IRI Yes Yes No
Item properties specified by IRI Yes Yes No
Multiple objects per page Yes Yes Yes
Overlapping objects Yes Yes No
Plain Text properties Yes Yes Yes
IRI properties Yes Yes* No
Typed Literal properties Yes No No
XML Literal properties Yes No No
Language tagging Yes Yes Inconsistent
Override text and IRI content Yes No Text only
Clear mapping to RDF Yes Problematic No
Target Languages 8
(XHTML1, HTML4, HTML5, XHTML5, XML, SVG, ePub, OpenDocument)
2
(HTML5, XHTML5)
4
(XHTML1, HTML4, HTML5, XHTML5)
New Attributes 8
about, datatype, profile, prefix, property, resource, typeof, vocab
5
itemid, itemprop, itemref, itemscope, itemtype
0
Re-used Attributes 5
content, href, rel, rev, src
5
content, src, href, data, datetime
4
class, title, rel, href
Multiple IRI types per object Yes RDF only No
Multiple statements per element Yes No Yes
“Locally scoped” vocabulary terms Yes, via vocab Yes, via itemscope No
Item Chaining Yes Basic No
Transclusion No Yes Yes, via include pattern
Compact IRIs Yes No No
Prefix rebinding Yes No No
Vocabulary Mashups Yes No No
HTML5 time element support Not yet Yes No
Different attributes for different property types Yes
property for text, rel/rev for URLs, resource/content for overrides
No Yes
class for text and rel for URLs
Transform to JSON Yes (RDFa API) Yes (Parser and Microdata DOM API) No
DOM API Yes Yes No
Unified Parser Yes Yes No

Relative Complexity

Relative Complexity is a fuzzy measure of how difficult it is to achieve mastery of a particular structured data syntax. Microformats is by far the easiest to pick up and use. Microdata is a big step up and a bit more complex. RDFa is the most complex to master. There are design trade-offs, the simpler the syntax, the fewer structured data markup scenarios are supported. The more complex the syntax, the more structured data markup scenarios that are supported, but at the cost of making it more difficult for Web developers to master the syntax.

Data Model

The Web is a graph of information. There are nodes (web pages) and edges (links) that connect all of the information together. RDFa uses a graph to model the Web. Microdata and Microformats use a special subset of a graph called a rooted graph, or tree. There are benefits and drawbacks to each approach.

Item optionally identified by IRI

Being able to identify an item on the Web is very useful. If we weren’t able to identify web pages in a universal way, the Web wouldn’t exist as it does today. That is, we couldn’t send someone a link, have them open it and find the same information that we found. The same concept applies to “things” described in Web pages. If we identify these things with IRIs, it becomes easier to be specific about the “thing” we’re talking about.

RDFa example:

<div about="http://example.com/people#manu">...

Microdata example:

<div itemscope itemtype="http://example.com/types/Person" itemid="http://example.com/people#manu">...

Microformats example:

Not supported

Item type optionally specified by IRI

The ability to identify the type of an item on the Web is useful. In Object Oriented Programming (OOP) parlance, this is the concept of a Class. Using an IRI to specify the type of an item lets us universally identify that type on the Web. Instead of a machine having to guess whether an item of type “Person” specified on a Web page is the same type that is familiar to it, we can instead give the item a type of http://example.org/types/Person. Giving the item an IRI type allows us to be sure that two machines are using the same type information.

RDFa example:

<div typeof="http://example.com/types/Person">...

Microdata example:

<div itemscope itemtype="http://example.com/types/Person">...

Microformats example:

Not supported

Item properties specified by IRI

The ability to identify a property, also known as a vocabulary term, associated with an item on the Web is useful. In Object Oriented Programming (OOP) parlance, this is the concept of a member variable. Using an IRI to specify the property of an item lets us universally identify that property on the Web. Instead of a machine having to guess whether a property of type “name” specified on a Web page is the same property that is familiar to it, we can instead refer to the property using an IRI, like http://example.org/terms/name. Giving the property an IRI allows us to be sure that two machines are using the same vocabulary term in a program.

RDFa example:

<span property="http://example.org/terms/name">Manu Sporny</span>

Microdata example:

<span itemprop="http://example.org/terms/name">Manu Sporny</span>

Microformats example:

Not supported

Multiple objects per page

Web pages often describe multiple “things” on a page. The ability to express this information as structured data is a natural extension of a Web page.

RDFa example:

<div about="#person1">...</div>
...
<div about="#person2">...</div>

Microdata example:

<div itemscope itemtype="http://example.com/types/Person" itemid="#person1">...</div>
...
<div itemscope itemtype="http://example.com/types/Person" itemid="#person2">...</div>

Microformats example:

<div class="hcard">.../div>
...
<div class="hcard">.../div>

Overlapping objects

At times, the HTML markup on a page will contain two pieces of overlapping information. For example, two people may be marked up on a web page. Ensuring that the structured data syntax is able to specify which person is being described by the HTML is important because the syntax should not force a Web developer to change the layout of their page.

RDFa example:

<div about="#person1">... Information about Person 1 ...
   <div about="#person2">...</div> ... Information about Person 2 ...
</div>

Microdata example:

<div itemscope itemtype="http://example.com/types/Person" itemid="#person1">
   ... Information about Person 1 ...
   <div itemscope itemtype="http://example.com/types/Person" itemid="#person2">...</div>
      ... Information about Person 2 ...
</div>

Microformats example:

Not supported

Plain Text properties

Most item attributes, such as a person’s name, can be expressed using plain text. It is important that these text attributes can be picked up from the page.

RDFa example:

<span property="name">Manu Sporny</span>

Microdata example:

<span itemprop="name">Manu Sporny</span>

Microformats example:

<span class="fn">Manu Sporny</span>

IRI properties

At times it is important to differentiate between an IRI and plain text. For example, the text string sip:msporny@digitalbazaar.com could be a text string or it could be a valid IRI. While the ability to differentiate may seem trivial, guessing what a valid IRI is and isn’t will never be future proof. It is helpful to be able to understand if a value is a piece of text or an IRI in the data model.

RDFa example:

<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-AT-SA-3.0</a>

While Microdata does allow one to differentiate between IRIs and strings in the syntax, the JSON-based serialization converts all IRIs to string values. This is problematic because it is impossible to differentiate between a string that looks like and IRI and an actual IRI in the JSON serialization. IRI properties are preserved correctly in the RDF serialization of Microdata.

Microdata example:

<a itemprop="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-AT-SA-3.0</a>

While Microformats allow you to use IRI information, there is no official data model or mapping to RDF or JSON. Everything is treated as a text string and application logic must be written to determine if a particular data item is meant to be an IRI or text. So, while the markup below is valid – the IRI will be expressed as a text string, not an IRI.

Microformats example:

<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-AT-SA-3.0</a>

Typed Literal properties

Typed literals allow you to express typing information about a property. This is important when you need to specify things like units of measure, or specific kinds of numbers, in a way that doesn’t depend on understanding the language in the unit of measure. For example: Is “+353872206327″ an integer or a phone number? Is “.1E-1″ a float or a text string? Is “false” a boolean value or a part of a sentence? Another example concerns measurements like the kilogram, a unit of weight measurement that can be displayed in a variety of different ways around the world. Being able to express this unit of measurement in structured data in a language-neutral and measurement-neutral way makes it easier for machines to understand the unit of measurement without having to understand the underlying language.

<span property="measure:weight" datatype="measure:kilograms">40</span> килограммов

Microdata example:

Not supported

Microformats example:

Not supported

XML Literal properties

XML Literals are used for properties that contain markup, such as the content of a blog post, SVG or MathML markup that should be preserved in the final output of the structured data parser. This is useful when you want to preserve all markup.

The Quadratic Formula

The formula above is expressed like so in RDFa and MathML:

<span property="math:formula" datatype="rdf:XMLLiteral">
<math mode="display" xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow>
    <mi>x</mi>
    <mo>=</mo>
    <mfrac>
      <mrow>
        <mo form="prefix">−<!-- − --></mo>
        <mi>b</mi>
        <mo>±<!-- ± --></mo>
        <msqrt>
          <msup>
            <mi>b</mi>
            <mn>2</mn>
          </msup>
          <mo>−<!-- − --></mo>
          <mn>4</mn>
          <mo>⁢<!-- &InvisibleTimes; --></mo>
          <mi>a</mi>
          <mo>⁢<!-- &InvisibleTimes; --></mo>
          <mi>c</mi>
        </msqrt>
      </mrow>
      <mrow>
        <mn>2</mn>
        <mo>⁢<!-- &InvisibleTimes; --></mo>
        <mi>a</mi>
      </mrow>
    </mfrac>
  </mrow>
</math>
</span>

Microdata example:

Not supported

Microformats example:

Not supported

Language tagging

The ability to specify language information for plain text is important when pulling data in from the Web. At times, words that are spelled the same in western character sets can mean different things. For example, the word “chat” in English (to have a conversation) is a very different meaning from the word “chat” (cat) in French.

RDFa example:

<span property="name" lang="en">Manu Sporny</span>

Microdata example:

<span itemprop="name" lang="en">Manu Sporny</span>

Language information support is only on a per-microformat basis. Some Microformats do not make any statements about supporting multiple language tags.

Microformats example:

<span class="fn" lang="en">Manu Sporny</span>

Override text and IRI content

At times, the text content in the page is not what you want the machine to extract when reading the structured data. It is important to have a way to override both the text content, and the URL content in an element.

RDFa example:

<span property="candles" content="14">fourteen</span>
...
<a rel="homepage" href="http://example.org/short-url" 
      resource="http://example.org/2011/path-to-real-url">My Homepage</a>

Microdata example:

Not supported

Microformats only supports overriding text content in an element.

Microformats example:

<abbr property="candles" title="14">fourteen</abbr>

Clear mapping to RDF

The Resource Description Framework, or RDF, has been the standard model for the Semantic Web for over a decade. At times it can be overkill for simple structured data projects, but there are many times where it is necessary for some of the more involved to advanced structured data use cases. There is a fairly large, well-developed set of tools for RDF. It is beneficial if the structured data mechanism has a clear way of mapping the syntax to the RDF data model in a way that is useful to the set of existing RDF processing tools.

Since RDFa is built on RDF, the mapping to RDF is well specified. While it is possible to map Microformats to RDF, there is no standard way of doing so. Microdata does map to RDF, but there are a few bugs that are of concern. Namely, Microdata auto-generates RDF property URLs in a way that is not useful to many of the existing RDF processing tools. The issues that have raised objections in the past relate to the usefulness of/centralization of/dereferenceability of the generated IRIs. It has been argued that the IRIs designated for properties in Microdata are problematic as-is and need to be changed. The following example demonstrates how properties in RDFa map to easy-to-understand URLs:

<section vocab="http://schema.org/" typeof="Person">
   <h1 property="name">John Doe</h1>
</section>

which results in the following IRI for the “name” property in RDFa:

http://schema.org/name

This URI is not centrally controlled. It fits in well with the RDF stack. De-referencing the URI leads to a location that is under the vocabulary maintainers control. The Microdata mapping to RDF is a bit less straightforward:

<section itemscope itemtype="http://schema.org/Person">
   <h1 itemprop="name">John Doe</h1>
</section>

The following URI is generated for the “name” property in Microdata:

http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fschema.org%2FPerson%23%3Aname

This URI is centrally controlled. It requires extensive mapping to be useful for most RDF stacks. De-referencing the URI leads to a location not under the vocabulary maintainers control.

Target Languages

Most structured data languages are meant to express data in a variety of different languages. RDFa is designed and is officially specified to work in a variety of different languages including HTML5, XHTML1, HTML4, SVG, ePub and OpenOffice Document Format. Microdata was built and specified for HTML5. Microformats re-uses attributes in HTML that have been in use for over a decade.

Having a structured data syntax support as many Web document formats as possible is good for the web because it reduces the tooling necessary to support structured data on the Web.

New Attributes

The complexity of a structured data syntax can be viewed, in part, by how many attributes a Web developer needs to understand to properly use the language. New attributes, while providing new functionality, do increase the cognitive load on the Web developer.

Re-used Attributes

All of the structured data languages re-use a subset of attributes that contain information important to structured data on the Web. There is a delicate balance between re-using too many attributes and creating new attributes.

Multiple IRI types per item

Web developers need to be able to specify that an item on a page is associated with more than one type. That is, a business can be both an “AutoPartsStore” and a “RepairShop”.

RDFa example:

<div typeof="AutoPartsStore RepairShop">...

In Microdata, you can only express multiple types for a single object using itemid to tie the information together and then only see the result in the RDF output. The DOM API would generate two separate items for the markup below, while the RDF output would generate only one item.

Microdata example:

<div itemscope itemid="#fixit" itemtype="http://example.com/types/AutoPartsStore">...</div>
<meta itemscope itemid="#fixit" itemtype="http://example.com/types/RepairShop" />

Microformats example:

Not supported

Multiple statements per element

It is advantageous to use as much of the existing information in an HTML document as possible. At times, one element can contain more than a single piece of structured data. For example, a link can contain both the name of a person as well as a link to their homepage. A structured data syntax should re-use as much of this information as possible.

RDFa example:

<a rel="homepage" href="http://manu.sporny.org/" property="name">Manu Sporny</a>

Microdata example:

Not supported

Microformats example:

<a rel="homepage" href="http://manu.sporny.org/" class="fn">Manu Sporny</a>

“Locally scoped” vocabulary terms

Locally scoped vocabulary terms allow you to create new vocabulary terms on-the-fly that are picked up by the structured data parsers. The use case for this is questionable, as it is considered good practice to have a vocabulary that allows any person or machine to dereference the URL and find out more about the vocabulary term.

RDFa example:

<div vocab="http://schema.org/" typeof="Person">
   <span property="favoriteSquash">Butternut Squash</a>
</div>

Microdata example:

<div itemscope itemtype="http://schema.org/Person">
   <span itemprop="favoriteSquash">Butternut Squash</a>
</div>

Microformats example:

Not supported

Item Chaining

Chaining allows the object of a particular statement to become the subject of the next statement. It is often useful when relating multiple items to a single item or when linking multiple items, like social networks, together. For example, “Manu knows Ivan who knows Sandro who knows Mike”.

<div about="#manu" rel="knows">
   <div about="#ivan" rel="knows">
      <div about="#sandro" rel="knows">
         <div about="#mike">
         ...
</div>

Microdata supports basic chaining, but doesn’t support hanging-rels or reverse chaining.

Microdata example:

<div itemscope itemid="#manu" itemtype="http://schema.org/Person">
   <div itemscope itemid="#ivan" itemprop="knows">
      <div itemscope itemid="#sandro" itemprop="knows">
         <div itemscope itemid="#mike" itemprop="knows">
         </div>
      </div>
   </div>
</div>

It is questionable whether or not Microformats even supports basic chaining. If somebody has a good chaining example for Microformats, please let me know and I’ll put it below.

Microformats example:

No examples of chaining.

Transclusion

Transclusion allows a Web author to specify a set of properties once in a page, such as a business address, and copy those properties to multiple items in a page. RDFa allows doing this by reference, not by making a copy. Microdata allows transclusion both by reference and by copy. Microformats allows transclusion both by reference and by copy.

RDFa example:

Transclusion by copy not supported.

Microdata example:

<span itemscope itemtype="http://microformats.org/profile/hcard" 
      itemref="home"><span itemprop="fn">Jack</span></span>
<span itemscope itemtype="http://microformats.org/profile/hcard" 
      itemref="home"><span itemprop="fn">Jill</span></span>
<span id="home" itemprop="adr" itemscope><span 
      itemprop="street-address">Bottom of the Hill</span></span>

Microformats example:

<span class="vcard">
  <span class="fn n" id="james-hcard-name">
    <span class="given-name">James</span> <span class="family-name">Levine</span>
  </span>
</span>
...
<span class="vcard">
 <object class="include" data="#james-hcard-name"></object>
 <span class="org">SimplyHired</span>
 <span class="title">Microformat Brainstormer</span>
</span>

Compact IRIs

Compact IRIs allow Web developers to compress URLs so that they are easier to author. This allows more compact markup and reduces errors because it is no longer necessary to type out full URLs.

RDFa example:

<div prefix="dc: http://purl.org/dc/terms/">
...
   <span property="dc:title">...
   <span property="dc:creator">...
   <span property="dc:abstract">...
</div>

Microdata example:

Not supported

Microformats example:

Not supported

Prefix rebinding

Enabling prefix declaration and rebinding supports decentralized vocabulary development and management. Prefix rebinding allows Web developers to create vocabularies that are specific to their domain of expertise and use them in a way that is inter-operable with other RDFa processors. Microdata and Microformats do not specify a prefix declaration and rebinding mechanism. Microdata does allow custom vocabularies using the itemtype attribute and therefore does support decentralized vocabulary development, but not decentralized vocabulary management, unless full IRIs are used to express the vocabulary terms.

RDFa example:

<div prefix="dc: http://purl.org/dc/terms/">
...

Microdata example:

Not supported

Microformats example:

Not supported

Vocabulary Mashups

Enabling multiple Web vocabularies to be mashed together into simple vocabulary terms is useful when creating application specific “vocabulary profiles”. Using a vocabulary profile, these simple vocabulary terms can be re-mapped to full vocabulary term IRIs which is useful to Web developers that need to simplify markup for a particular business unit, but ensure that the data generated maps to the correct Web vocabularies when used on the open Web.

For example, assume that a Web developer wants to map the vocabulary term “name” to “http://schema.org/name”, and “nickname” to “http://xmlns.com/0.1/foaf/nick”, and “hangout” to “http://example.com/myvocab#homebase”. These mappings could be accomplished in a simple-to-use vocabulary profile like so:

RDFa example:

<div profile="http://example.com/my-rdfa-profile">
...
   <span property="name">...
   <span property="nickname">...
   <span property="hangout">...
</div>

Microdata example:

Not supported

Microformats example:

Not supported

HTML5 time element support

There is a new element in HTML5 called time. This element is used to express human-readable dates and times and also contains a machine-readable value. This element was created as a response to the difficulty that the Microdata community was having when marking up dates and times. The only specification that makes use of the element currently is the Microdata specification. However, there is currently an issue logged against HTML5+RDFa that requests the inclusion of this element so that RDFa processors may understand it. Microformats do not use this element yet, partly because it does not exist in HTML4.

RDFa example:

Not supported

Microdata example:

<time datetime="2011-06-25" pubdate>June 25th 2011</time>

Microformats example:

Not supported

Different attributes for different property types

There is a design trade-off in structured data languages. As the number of statements that a single element can express increases, so does the number of attributes used to express statements. As the number of ways that an element’s value can be overridden increases, so does the number of attributes used to perform the override. Microdata keeps things simple by allowing only one statement to be made per element. Microformats allows class for text, rel for IRIs and title to override text content. RDFa uses the property attribute for text, rel and rev to specify URLs, and resource and content to override IRI and text content, respectively.

Transform to JSON

JSON is a heavily used data transport format used on the Web. It fits nicely into programming environments, so it is beneficial if a structured data syntax can be easily transformed into JSON. Microdata has a native mapping from the parser output to JSON, as well as a DOM API that allows items to be retrieved from the page. The RDFa API provides a mechanism to retrieve data from a page and then serialize that data to JSON.

DOM API

The ability to extract and utilize structured data from a web page in a browser setting is useful for improving interfaces and interactive applications. Microdata provides a simple Microdata DOM API for retrieving items from a web page. RDFa provides a more comprehensive RDFa DOM API for retrieving structured data from a web page. Microformats do not provide an API for extracting structured data from a web page.

Unified Parser

Having a solid set of tooling for handling structured data is important. One of the most important set of tooling are the parsers that are able to process Web documents and extract structured data from those web documents. Both RDFa and Microdata have a unified parser specification, which makes it easier to create inter-operable tools. Microformats require that separate parsers are created for each data format. This may change with the Microformats 2 work, but for now, there is no unified parser specification for Microformats.

Closing

This document will be updated as errors or omissions are found. It can be considered an up-to-date comparison between RDFa, Microdata and Microformats as of June 2011. A follow-up blog post will explain how these structured data languages could be combined into a single structured data language for the Web, achieving the W3C TAG’s goal for unification of the syntaxes used to express structured data on the Web.

Microformats 2 and RDFa Collaboration

During the recent schema.org kerfuffle, Tantek Çelik and I found ourselves agreeing with each other on the fundamentals of how a Web vocabulary should be developed. Like any technology standard meant for the world to use, we hoped that it would be developed transparently and scientifically. Tantek asked me to review the new Microformats 2 work and I thought it would be interesting to see what they’ve been up to recently.

I’ve been a contributing member of the Microformats community for some time, having participated in the design work for the hAudio, hVideo, hMedia, hProduct, hRecipe, currency, collection and measurement Microformats, among others. I’ve documented the process, commented on inconsistencies in the community, been critical of the confusing spec-creation steps, raised governance and technical issues, pushed the community to more clearly address patent and copyright concerns as well as admit that the lack of a unified parsing model is holding Microformats back. I have been harsh about how the community was run, but continued to participate because there were a number of redeeming qualities in the Microformats movement.

All of the frustration with the various inconsistencies, the administrators, and lack of progress led to me to take a hiatus from the community. I think many others in the community felt this frustration around the same time, as you can see the discussion average of 125 messages per month drop to an average of 10 per month and stay there to this day. When I took a leave from the Microformats work, I joined the RDFa Working Group at the W3C where I now Chair the group that created RDFa. In 2007, my company was working on expressing music on the Web as structured data and RDFa seemed like a much better way to do it, so we shifted our focus to RDFa and distributed vocabulary development. Fast forward to today and both PaySwarm and MusicBrainz publish all of their data as RDFa. However, with the recent launch of schema.org, an interesting question was pushed into the public view once again: What is the best way to develop a Web vocabulary for structured data in HTML if millions of people are going to depend on it?

The Microformats 2 work attempts to address a number of concerns that have been raised in the community over the past several years. Most of these issues were logged during a period of peak activity in the community, between 2007 and 2009, during the development of the hAudio, hVideo, hMedia, hProduct, collection and measurement Microformats. Here’s a quick breakdown of my initial thoughts on the Microformats 2 work:

The Good

There are a number of really great things proposed for Microformats 2 that could breathe new life into the community.

  • Unified parsing model – Microformats 2 has it – this is one of the best changes to the new direction.
  • Flat set of properties – All Microformats are treated as objects with a flat set of properties. This maps to JSON nicely and is another move in the right direction.
  • Hungarian prefixing – All Microformats 2.0 markup will now have an h-* prefix for the Microformat, a p-* prefix for string properties, a u-* prefix for URLs, and a d-* prefix for datetimes.
  • Vendor extensions. – I hope this catches on – it allows a path toward experimentation which we desperately needed for the PaySwarm work. The Microformats community has a saying, “Pave the cowpaths”. This philosophy effectively boils down to ensuring that standards are rooted in existing practice. However, you can’t pave cowpaths that aren’t there yet. Typically, innovation requires the first cow to start making the cowpath. It would be nice to have an open community that you can innovate within – this could provide that mechanism. Moo.
  • Separation of Syntax from Vocabularies – Tantek mentioned that the Microformats 2 work would separate vocabularies from syntax. I couldn’t find that statement on the page, but I think it would be great to do that. I’ve always believed that the real contribution of the Microformats community to the Web was in the development of well-researched Web vocabularies. We now have syntaxes that are capable of expressing Microformats; RDFa and Microdata. Why do we need yet another syntax? The part of this new Microformats 2 reboot I’m most interested in participating in is the vocabulary part. Specifically, porting all of the Microformats Vocabularies over to RDFa 1.1 Profiles. The markup would be almost exactly the same as what is proposed on the Microformats 2 wiki page (example below).

Meh

Some of the changes to Microformats aren’t really necessary, nor do I think that they will result in stronger uptake of Microformats.

  • Root Class Name Only – Microformats aren’t that difficult to publish. Simplifying them down to one tag will probably not result in much uptake or data that is interesting or helpful.
  • “hcard” instead of “vcard” – Yes, it was a point of confusion. I don’t think it really prevented people from implementing Microformats.

The Bad

Some of the most important things that the Microformats community needs to change are not addressed. I’d like to see them addressed before assuming that new work done in that community will have a lasting impact:

  • The Administrators – One of the strongest criticisms by the community has always been the status of the self-appointed leaders. They do a good job most of the time, but having a mechanism where the community elects the leaders and administrators would get us closer to a meritocracy. Not allowing the community to govern itself shows that you don’t trust the membership of the community. If you don’t trust us, how can we trust you? If there is a “you” and a “them”, then it becomes easy to have a “you versus them” situation. The Microformats community could learn a great deal from the Debian community in this respect.
  • The Process – I had previously complained that it was not very clear what you need to meet each hurdle in the Microformats process. This seems to have been clarified with the new Microformats 2 work. I’m still concerned that too much is left in the hands of the “leaders”. There was a great deal of what I felt was “moving the goalposts” when developing hAudio. The process kept changing. If the process keeps changing, it can mean that all of your hard work may not end up making it to the “official” Microformats standard stage. So, I am suspect of the process if the community has no power over who gets to change the process and when.
  • Open Innovation – How does one innovate in the Microformats community? That is, how do we have an open discussion about the Commerce, Signature and PaySwarm Web vocabularies in the Microformats community? We’re trying to solve a real-world problem – Universal Payment on the Web. We need to have an open discussion about the Web vocabularies used to accomplish this goal. How can we have this discussion in the Microformats community?
  • Collaboration – How can the RDFa community, Microdata folks and the Microformats community work together? I’d really like all of us to work together. I’ve been trying to make this happen for several years now, each attempt met with varied levels of failure. Our continued track record of not reaching out and working with one another on a regular basis is damaging structured data adoption on the Web – and each community feels as if they are blame-less for the current state of affairs. “If only they’d listen to us, we wouldn’t be in this mess!”. Schema.org is just one signal that all of us need to come together and work on a unified way forward.

Working Together

So, how do we collaborate on this? We have added Microformats-like features to RDFa over the past few years because we wanted RDFa 1.1 markup to be just as easy as Microformats markup. This example is used on the Microformats 2 page:

<h1 class="h-card">
 <span class="p-fn">
  <span class="p-given-name">Chris</span>
  <abbr class="p-additional-name">R.</abbr>
  <span class="p-family-name">Messina</span>
 </span>
</h1>

The markup above can be easily expressed in RDFa 1.1, using RDFa Profiles like so:

<h1 typeof="hcard">
 <span property="fn">
  <span property="given-name">Chris</span>
  <abbr property="additional-name">R.</abbr>
  <span property="family-name">Messina</span>
 </span>
</h1>

This is useful to the Microformats 2 work because every RDFa 1.1 compliant parser could easily become a compliant Microformats 2 parser. Food for thought.

Let’s try to work together on this. As a first step, I think that the RDFa community could easily generate RDFa Profiles for Microformats. This would give people the ability to use Microformats either in the Microformats 2 syntax, or in RDFa 1.1 syntax. That would drive further adoption of the Microformats vocabularies – which would be great for both communities. How can we make this happen?

Thanks to DL and DIL for reviewing this post.

5 RDFa Features Inspired by Microdata and Microformats

Full disclosure: I am the current Chair of the group at the World Wide Web Consortium that created RDFa. That said, this is my personal blog – I am not speaking on behalf of the W3C, RDFa Working Group, RDF Web Apps Working Group or my company, Digital Bazaar.

I’ve seen a few comments by Web authors and developers like this over the past several years:

as a web developer, I have to say … w3c was neglecting web developers with rdfa for last X years.” — Andraz Tori

Every time that statement is made in my presence, I attempt to calmly explain that this is not true. Sometimes it’s a bad experience that the person has had with a standards body, but most of the time the commenter doesn’t understand how the Internet and the Web are built. Here’s the explanation I typically give:

The RDFa Working Group cares very deeply about what Web developers have to say. All RDFa Working Group meetings are publicly recorded and available, anyone can join the public mailing list and contribute, we have a public issue tracker. There is nothing to stop anyone from participating and contributing. If people demonstrate deep knowledge of structured data and contribute frequently, they’re usually asked to join the Working Group as Invited Experts. We are required to address all public input – you cannot get a Web/Internet spec until you do that. If you don’t prove that you have addressed all public input, you don’t get an official spec – it’s as simple as that.

The reason that we take public input very seriously is because we want to create a standard that works for the most number of people while keeping the complexity of the specification to a manageable level. That is, when forced to decide between the two, we put Web publishers and developers first – and parser implementers second.

You don’t need to know much about the history of Microformats, RDFa and Microdata to understand this post. Microformats and RDFa came about at roughly the same time, around 2004. RDFa has had a number of its features inspired by Microformats. Microdata started off as direct modifications to RDFa that removed some of the features that the RDFa folks felt were necessary. RDFa has also pulled in a number of newer features from Microdata. The rest of the article describes what these features are, where they came from, and why we included them.

1. Profiles and Terms

I’ve spent a good deal of time in the Microformats community. When I was asked to join the RDFa Working Group, a great deal of that thinking came along with me. Luckily, many of the others in the RDFa Working Group shared much of this thinking about Microformats. One of the most striking features of Microformats is the simplicity of the markup and the vocabularies. These Web vocabularies are typically expressed using Profiles. Here is the list of Microformats profiles.

We wanted to provide the same sort of simple Markup in RDFa 1.1, so we introduced the concept of Terms and RDFa Profiles. This feature allows you to use Microformats-like markup in RDFa 1.1:

<body profile="http://microformats.org/profile/hcard">
...
<div typeof="vcard">
    <span property="fn">Tantek Çelik</span> is known on Twitter as <span property="nickname">t</span>.
</div>

2. Absolute IRIs

RDFa 1.0 allowed people to compact IRIs so that fewer mistakes are made when typing in a whole bunch of property names. The Microdata folks felt that compact IRIs are problematic because prefixes can be re-bound to different values. People carelessly copying and pasting source code could accidentally mess up what the chunk of HTML is supposed to mean if they forget to declare the prefixes, or declare them differently. IRI support in all RDFa 1.1 attributes was added to address this concern. If Web developers are generating code that they expect people to cut-and-paste, and they don’t like CURIEs, they can use absolute IRIs instead. This means that the following markup:

<div prefix="dc: http://purl.org/dc/terms/">
   <h2 property="dc:title">My Blog Post</h2> by 
   <h3 property="dc:creator">Alice</h3>
   ...
</div>

Can be written like this, instead:

<div>
   <h2 property="http://purl.org/dc/terms/title">My Blog Post</h2> by 
   <h3 property="http://purl.org/dc/terms/creator">Alice</h3>
   ...
</div>

The markup above doesn’t need to have prefixes declared, nor is it susceptible to some types of careless cut-and-pasting.

3. The @vocab Attribute

Microdata and Microformats are clever in the way that you don’t need to use CURIEs or URIs to express properties. Unfortunately, for the RDFa folks, those properties are not very good semantic web identifiers because they are not dereferenceable. That is, a human could not stick a shortened vocabulary term from Microdata into a Web Browser and find out what that term is all about. A machine could not follow the Microdata vocabulary term URL and hope to find anything useful at the end of the URL. The ability to follow any URL and find out more about it is often refered to as “follow-your-nose”, and is an important part of the design of RDFa.

The RDFa 1.1 work focused on pulling this feature over from Microdata’s itemtype attribute, but also ensuring that it would work for follow-your-nose. The following markup demonstrates how an RDFa 1.1 processor can use Microdata-like markup when using a single vocabulary, but still support follow-your-nose:

<div vocab="http://schema.org/">
   <ul>
      <li typeof="Person">
        <a rel="url" href="http://example.com/bob/">Bob</a>
      </li>
      <li typeof="Person">
        <a rel="url" href="http://example.com/eve/">Eve</a>
      </li>
      <li typeof="Person">
        <a rel="url" href="http://example.com/manu/">Manu</a>
      </li>
   </ul>
</div>

If we take the http://schema.org/Person term, we can plug that into a Web browser and find out more about the vocabulary term. Unfortunately, schema.org doesn’t provide a machine-readable version of their vocabulary. For an example of a human-and-machine readable vocabulary, please see http://purl.org/media/audio.

4. Web Apps API

Web developers typically don’t want to be bothered with the document markup when they are programming. The Microdata specification provides a DOM API in order to read items into JavaScript objects so that structured data in the page can be processed by Web Applications. This was clearly one of the key differentiators of Microdata in the beginning, and seemed to be a feature that many Web developers were excited about. Of particular note was that Microformats historically have not had a clear generic parsing model or an API, which may have held back their adoption in Web Applications. These two shortcomings are being actively discussed in the microformats-2 work.

The RDFa Working Group paid close attention to these developments, learned from them, and finally concluded that an RDFa DOM API was necessary in order to make the use of RDFa for Web Developers easier. For example, to find out all of the subjects on the page that contain a name, one need only do something like this:

thingsWithNames = document.data.getSubjects("foaf:name");

To get all of the names associated with a particular thing, a Web developer could do this:

var thingNames = document.data.getValues(thing, "foaf:name");

5. Projections/JSON-mapping

Everyone loves JSON. It is a simple data format that is incredibly expressive, compact and maps easily to JavaScript, Python, Ruby and many other scripting languages. Microdata has a native mapping from markup on the page to a JavaScript object and JSON serialization. The RDFa Working Group saw this as a powerful feature, but also thought that Web Developers should have the ability to map objects to whatever layout made the most sense to them. The concept of a Projection was proposed and now closely mirrors all of the benefits provided by the Microdata-to-JSON mapping, along with giving developers the added benefit of freely “projecting” objects from structured data in a Web page.

For example, developers could get all people on the page like so:

var people = document.data.getProjections("rdf:type", "foaf:Person");

or they could build specific objects, and access the object’s members like so:

var albert = document.data.getProjection("#albert", {"name": "foaf:name"});
var name = albert.name;

This feature is detailed in the RDFa API right now, but may become more generalized and apply to any structured data language like Microformats or Microdata.

Closing Thoughts

The RDFa Working Group cares very deeply about what Web developers have to say. All three syntaxes for structured data on the Web today have cross-pollinated with one another – that’s a good thing. We feel that with RDFa 1.1, we took some of the best features of Microdata and Microformats and made them better. We provide functionality in a way that allows Web Developers to use as few or as many of these features as they so desire. We continue to listen and improve RDFa 1.1 in order to make it an effective tool for Web authors, publishers and developers. After all, one of the goals of the RDFa Working Group is to discover and standardize what the Web community wants – to make authoring and using RDFa content easier.

Thanks to DL, MB, DB, and DIL for reviewing the post and providing feedback and change suggestions.

The False Choice of Schema.org

Full disclosure: I am the current Chair of the group at the World Wide Web Consortium that created RDFa. That said, all of this is my personal opinion – I am not speaking on behalf of the W3C or my company, Digital Bazaar. I am biased, but also have been around long enough to know when freedom of choice on the Web is being threatened.

Some of you may have heard that Microsoft, Google and Yahoo have just released a new uber-vocabulary for the Web. As the site explains, if you use schema.org, you will get a better looking search listing on all of the search listings for Bing, Google and Yahoo. While this may sound good on the surface, it is very bad news for choice on the Web. There are few points that I’d like to make in this post:

  1. RDFa and Microdata markup are similar for the schema.org use cases – they should both be supported.
  2. Microdata doesn’t scale as easily as RDFa – early successes will be followed by stagnation and vocabulary lock-in.
  3. All of us have the power to change this as the Web community – let’s do that. We will release a plan shortly.

The schema.org site makes it appear as if you must pick sides and use Microdata if you want preferential treatment. This is a false choice! They even state that you cannot use RDFa and Microdata and Microformats on the same page as it will confuse their parsers – forcing Web designers to exclusively use Microdata or be lost in the morass of search listings [Edit: Google has since retracted this statement.]. The entire Web community should decide which features should be supported – not just Microsoft or Google or Yahoo. We must not let the rug be pulled out from under us, we must band together and make our voices heard. We must make it very clear that we want to use what is best for us, not what a few people think is best for three large corporations.

Google and Yahoo already support Microformats, Microdata and RDFa in their advanced search services (Google Rich Snippets and Yahoo Search). So, why is it that we cannot continue to use what has been working for our organizations? Of the three, RDFa supports far more communities and is currently used far more heavily than Microdata. So, what possible reasons could they have to now exclude RDFa? Why exclude Microformats?

The patent licensing section alone sent shivers down my spine, but the most glaring concern is the reasoning to use Microdata.

Complexity

Q: Why microdata? Why not RDFa or microformats?
Focusing on microdata was a pragmatic decision. Supporting multiple syntaxes makes documentation for webmasters more complex and introduces more overhead in terms of defining new formats.

Being pragmatic is about balance. It’s true that supporting multiple syntaxes makes documentation and management more complex, so reduction can be good, but at what cost?

RDFa is extensible and very expressive, but the substantial complexity of the language has contributed to slower adoption.

Yes, it is extensible and very expressive. However, I don’t buy the “more complex” argument at all for the schema.org use case. The RDFa 1.1 community has been extremely focused on Web developer feedback and simplifying the markup. For example, take this Microdata snippet from schema.org:

<div itemscope itemtype="http://schema.org/CreativeWork">
   <img itemprop="image" src="videogame.jpg" />
   <span itemprop="name">Resistance 3: Fall of Man</span>
   by <span itemprop="author">Sony</span>,
   Platform: Playstation 3
   Rated:<span itemprop="contentRating">Mature</span>
</div>

and compare against the RDFa 1.1 equivalent:

<div vocab="http://schema.org/" typeof="CreativeWork">
   <span rel="image"><img src="videogame.jpg" /></span>
   <span property="name">Resistance 3: Fall of Man</span>
   by <span property="author">Sony</span>,
   Platform: Playstation 3
   Rated:<span property="contentRating">Mature</span>
</div>

The complexity difference between the two languages for the simple use cases is negligible. Make no mistake – there are politics being played here and we will eventually get to the bottom of this. When you get to more advanced use cases, such as mixing vocabularies, RDFa really shines. In Microdata, your choice in vocabulary is exclusive. In RDFa, your choice in vocabulary is inclusive. That is, you can mix-and-match vocabularies that suit your organization far more easily in RDFa than you can in Microdata. Vocabulary mixing will become far more prevalent as structured data grows on the Web.

Some have argued that some of the more involved features are complex, but the counter argument has always been: Well, don’t use those features. Those features aren’t just there to be purely complex – they were specifically requested by the Web community when building RDFa. Microdata is lacking many of those community-requested features, which does make it simpler, but it also makes it so that it doesn’t solve the problems that the “complex” features were designed for. RDFa is designed to solve a wider range of problems than just those of the search companies. Yes, complexity is bad – but so is cutting features that the Web community has specifically requested and needs to make structured data on the Web everything that it can be.

Adoption

RDFa is extensible and very expressive, but the substantial complexity of the language has contributed to slower adoption.

The “slower adoption” statement is pure bunk. Of Microformats, RDFa and Microdata – RDFa is the only one that has experienced triple digit growth in the last year – 510% growth over the last year, to be exact. There are no such figures for Microdata. If you are going to claim that something has slow adoption, then you have to measure it against something else. Where is the public, hard data to demonstrate that Microdata is growing faster than Microformats and RDFa? Both the Microformats and RDFa communities have provided hard numbers in a public forum. I would suspect that these numbers have not been published for Microdata because they do not exist. If the numbers do exist, they should be made public so that we may check the veracity of this claim.

So it leaves us guessing, slower adoption compared to what? Since when did triple digit adoption figures become not good enough? With claims that go counter to publicly available hard data, it makes it seem as if something fishy is going on here. These numbers will probably not matter in the long run. If Google, Microsoft and Yahoo all said that you need to embed their proprietary markup in pages, people would do it if it meant higher search ranking. The adoption rate of any markup would increase if Google, Microsoft and Yahoo mandated it. That doesn’t mean that it would result in something that is better for the Web.

The False Choice

We will also be monitoring the web for RDFa and microformats adoption and if they pick up, we will look into supporting these syntaxes.

Since Google and Microsoft and Yahoo have said that the new schema.org vocabulary expressed in RDFa isn’t supported, people won’t use it. There are no examples in schema.org RDFa on the entity pages and it’s not even clear if they will index RDFa that expresses schema.org structured data. They’ve created a catch-22 situation. RDFa and Microformats adoption for schema.org will not pick up because they go out of their way to not support it. Even if they were to support extracting the schema.org vocabulary as RDFa, I don’t know how much more RDFa and Microformats would have to “pick up” to qualify. If triple-digit growth isn’t enough, then what is?

Microformats were created in an open and community-driven way. RDFa was created in an open and community-driven way. Schema.org was not, and if it catches on, expect to see it not scale over the long term and an increase in vocabulary lock-in to the major search companies. Which are you going to choose? Facebook’s Like button markup, or Google/Microsoft/Yahoo’s Microdata markup – you are being put into the position of choosing one of those exclusively.

We the publishers, developers and authors of the Web have the power to change this. We need to make it clear that we want to be able to express structured data in whatever language we choose. We create the content that Google, Microsoft and Yahoo index, it is a two-way conversation. Google and Microsoft do not tell us what is worth indexing, that is our choice – our freedom to decide.

Action

Don’t let this freedom be taken away from us and from the rest of the Web. Schema.org is the work of only a handful of people under the guise of three very large companies. It is not the community of thousands of Web Developers that RDFa and Microformats relied upon to build truly open standards. This is not how we do things on the Web.

The feedback form for schema.org is below. Let them know that you want RDFa supported for schema.org as a first class language. Tell them that you want new Microformats continued to be supported if you use it. Let them know that you want to see data backing up the claim that Microdata is the best and only choice. Let them know that you want the vocabularies provided on the schema.org site to go through a public review process. Ask them why they aren’t reusing the good work done by the Microformats community or the many Web vocabulary authors that have already put multiple years into creating solid Web vocabularies. Let them know that you don’t think that a handful of people should decide what will be used by hundreds of millions of people. We should be a part of this decision – let them know that.

We’re getting a plan of action together for those that care about freedom of choice on the Web. I’ll tweet a call to action via @manusporny when it is ready, roughly 1-2 weeks from now.

Thanks to B, T, M, D, and D for reviewing this post and suggesting changes.

Linked JSON: RDF for the Masses

There are times when we can see ourselves doing things that will be successful, and then there are times when we can see ourselves screwing it all up. I’ve just witnessed the latter in the RDF Working Group at the World Wide Web Consortium and thought that it may help to do a post-mortem on what went wrong. This was a social failure, not a technical one. Unlike technical failures, social failures are so much more complicated – so, let’s see if we can find out what went wrong.

Background

I spend a great deal of my time trying to convince technology leaders at large companies like Google, the New York Times, Facebook, Twitter, Sony, Universal and Warner Brothers to choose a common path forward that will help the Web flourish. Most of that time is spent at the World Wide Web Consortium (W3C), in standards working groups, trying to predict and build the future of the Web. I’m currently the Chair of the RDF Web Applications Working Group, formerly known as the RDFa Working Group. My participation covers many different working groups at the W3C; RDFa, HTML5, RDF, WebID, Web Apps, Social Web, Semantic Web Coordination, and a few others. The hope is that all of these groups are building technologies that will actually make all of our lives easier – especially for those that create and build the Web.

The Pull of Linked Data

There is a big push on the Web right now to publish data in an inter-operable way. RDFa is a good example of this new push to get as much Linked Data out there as possible. Our latest work in the RDF Working Group was to try and find a way to bring Linked Data to JSON. That is, we were given the task of figuring out a way to get companies like Google, Yahoo!, The New York Times, Facebook and Twitter to publish their data in a standards-compliant format that the rest of the world could use. We’ve already convinced some of these large companies to publish their data in RDFa. This was a huge win for the Web, but it was only a fraction of the interesting data out there. The rest of it is locked up in Web Services – in volumes of JSON data that are passed back and forth via JSON-REST APIs every day.

Wouldn’t it be great if we had something like RDFa for JSON? A way for a standard software stack to extract globally meaningful objects from Web Services? In fact, that is what JSON-LD was designed to do. There are also a number of other JSON formats that could be read not only as JSON, but as RDF. If we could get the world to start publishing their JSON data as Linked Data, we would have more transparency and more inter-operable systems. The rate at which we re-use data from other JSON-based systems would grow by leaps and bounds.

This is what the charge of the RDF Working Group was, and at the Face-to-Face meeting a little over a week ago, we failed miserably to deliver on that promise.

Failure Timeline

Here is a quick run-down of what happened:

  • March 2010: Work starts on JSON-LD – focusing on an easy-to-use, stripped down version of Linked Data for Web Developers. The work builds on previous work done by lots of smart people across the Web.
  • Summer 2010: An W3C RDF Workshop finds that there is a deep desire in the community for a JSON-based RDF format.
  • January 2011: The RDF Working Group starts up – starts to analyze 10 different RDF in JSON format proposals. There is general confusion in the group as to the exact community we’re attempting to address. Some think it’s people that are already using RDF/Graph Stores and SPARQL, others believe we are attempting to bring independent Web Developers into the world of Linked Data. I was of the latter mindset – we don’t need to convince people that are already using RDF to keep using RDF.
  • March 2011: Arguments continue about what features of JSON we’ll use and whether or not we are just creating another triple-based serialization for RDF, or if we are creating an easier to use form of Linked Data in JSON.
  • April 2011: At the RDF Face-to-Face, a show of hands decides to place the JSON work intended for independent Web Developers on the back burner for a year or more. The reason was that there was no consensus that we were solving a problem that needed to be solved.

Before I get into what went wrong, I don’t intend any of this to be bashing anyone in the RDF Working Group. They’re all good people that want to do good things for the Web. Many of them have put years of work into RDF – they want to see it succeed. They are also very smart people – they are the worlds leading experts in this stuff. There were no politics or back-room dealing that occurred. The criticism is more about the group dynamic – why we failed to deliver what some of us saw as our primary directive in the group.

What Went Wrong?

How did we go from knowing that people wanted to get Linked Data out of JSON to deciding to back-burner the work on providing just that to the people that build the Web? I pondered what went wrong for about a week and came up with the following list:

  • I failed to gather support and evidence that people wanted to get Linked Data out of JSON. I place most of the blame on myself for not educating the group before the decision needed to be made. I wouldn’t be saying this if the vote was close, but when it came time to show who supported the work – out of a group of 20-some-odd people, only two raised their hands. One of those people was me. I should have spent more time getting the larger companies to weigh in on the matter. I should have had more documentation and evidence ready on why the world needed to get Linked Data out of JSON. I should have had more one-on-one conversations with the people that I could see struggling with why we needed Linked Data for JSON. I assumed that it was obvious that the world needed this and that assumption came back to kick our collective asses.
  • A lack of Web App developers in the RDF Working Group helped compound the problem stated above. Most of the group didn’t understand why just serializing triples to JSON wasn’t good enough as most of them had APIs to make sense of the triples. They were also not convinced that we needed to bring Web App developers into the RDF community. RDF is already successful, right? Wrong. Every RDF serialization format is losing out to JSON when it comes to data interchange – not by a little, but by a staggering margin. The RDF community is so pathetically tiny compared to the Web App development community. The people around the world that use JSON as their primary data serialization format are easily 100 fold greater than those using RDF. I’m convinced that there is a problem. I don’t think that the majority of traditional RDF community thinks that there is a problem.
  • Lacking a common vision will kill a community. It has been said that standards groups should not innovate, but instead they should standardize solutions that are already out in the marketplace. There are days where I believe this – the TURTLE work has been easy to move forward in the RDF Working Group. There are also days where I know this is not true. Standards groups can be fantastic innovators – just look at the WHATWG, CSS, RDFa, Web Applications, and HTML5 Working Groups. At the heart of the matter is whether or not a group has a common vision. If you don’t have a common vision, you go nowhere. We didn’t have a common vision for the Linked Data in JSON work.
  • Only one company in the group was depending on the technology to be completed in order to ship a product. That company was Digital Bazaar, for the PaySwarm work. None of the other companies really have any skin in the game. Sure some of them would like to see something developed, but they’re not dependent on it. One recipe for disaster is to get a group of people together to work on something without hardly any negative consequence for failure.
  • I pushed JSON-LD too hard when discussing the various possibilities. I pushed it because I thought it was the best solution, and still do. I think my sense of urgency came across as being too pushy and authoritarian. This strategy, if you could call it that, backfired. Rather than open up a debate on the proper Linked Data JSON format, it seemed as if some people refused to have any sort of debate on the formats and instead chose to debate which community we were attempting to address in order to slow down the decision process until they could catch up with the state of all of the serialization formats.
  • Old school RDF people compose the majority of the RDF Working Group. It’s hard to pinpoint, but I saw what I could only describe as an “old world” mentality in the RDF Working Group. Browser-based APIs and development weren’t that important to them. They had functioning graph storage engines, operational SPARQL query engines, and PhDs to solve all of the hard problems that they may find in their everyday usage of RDF. Independent Web developers rarely have all of these advantages – many of them have none of these advantages. Many Web developers only have a browser, JavaScript, some server side code and JSON.parse() for performing data serialization and deserialization. JSON coupled with REST is simple, fast, stable and works for most everything we do. To solve 80% of our problems, there is no need for the added complexity that the “old school” RDF crowd brings to the table.
  • The RDF Working Group didn’t do their homework. We are all busy, I get that. However, even after two months, it was painfully clear that many in the group had not taken the time to understand the proposals on the table in any amount of depth. In some cases, I’m convinced that some did not even look at the proposals before passing judgement on whether or not the solution was sound.
  • Experts tend to over-analyze and cripple themselves and their colleagues with all of the potential failure scenarios. There were assertions in the group at times that, while had a basis of validity, were not constructive and came across as typical academic nay-saying. It is easier to find reasons why a particular direction will not succeed when you’re an expert. This nay-saying was very active in the RDF Working Group. We didn’t have a group that was saying “Yes, we can make this happen.” Instead, we had a minority that set the tone for the group by repeating “I don’t know if this’ll work, let’s not do it.”

I think the RDF Working Group has lost it’s way – we have forgotten the end-goal of enabling everyone on the Web to use Linked Data. We have chosen to deal with the easier problems instead of taking the biggest problem (adoption) seriously. There are many rational arguments to be made about why we’re not doing the work – none of those reasons are going to help spread Linked Data outside the modestly sized community that it enjoys at the moment. We need to get Web Apps developers using Linked Data – JSON is one way to accomplish that goal. It is a shame that we’re passing up this opportunity.

All is Not Lost

The RDF Working Group is only working on one interesting thing right now – and that’s how to represent multiple graphs in the various RDF serializations. Call them Named Graphs, or Graph Literals, or something else – but at least we’re taking that bull by the horns. As for the rest of the work that the RDF Working Group plans to do – it’s uninspired. I joined the group hoping to breathe some new life into RDF – make it exciting and useful to JavaScript developers. Instead, we ended up spending most of our time polishing things that are already working for most people. Don’t get me wrong, it’s good that some of these things are being polished – but it’s not going to impact RDF adoption rates in any significant way.

All is not lost. We decided to create a public linked data in JSON mailing list (not activated yet) where the people that would like to see something come of JSON in Linked Data could continue the work. We’re already revving JSON-LD and updating it to reflect issues that we discovered over the past several months. That’s where I’ll be spending most of my effort on Linked Data in JSON from now on – the RDF Working Group has demonstrated that we can’t accomplish the goal of growing the Linked Data community there.

Basing Design Decisions on “Those Six Guys”

As I explained in yesterday’s post, one of the core features of RDFa is under attack in the HTML Working Group. I use the phrase “under attack” loosely because I can’t imagine that the Chairs for the HTML Working Group are going to remove CURIEs based on Ian Hickson’s (the editor of the HTML5 specification) current proposal to eviscerate RDFa. But hey, stranger things have happened. I’m concerned about this possibility because I’m also the current Chair of the RDFa Working Group. Removing CURIEs would break all of the currently deployed RDFa content out there (430+ million web pages), so I doubt it’ll be removed. Personally, I think that the case against CURIEs is so weak that it is laughable… or cry-able. Honestly, I switch how it affects me from day to day – working on standards does that to you, it makes you manic. Today, it’s laughter – so let’s laugh a bit, while we can.

Ground Rules

Fair warning: This post is probably going to get a bit ranty from this point on. I try not to rant publicly too often because it can be construed as petty whining, or at worst, used as a reason to divide communities and people. So, know that this rant is mostly petty whining with a sprinkling of irony. It’s hard to not understand what’s going on here and not have a good laugh about it. This rant is mostly about a few key individuals in the Web standards community and how, at times, fantastically ridiculous things can emerge from multi-year conflicts between them. Nerd fight!

I also don’t want this to come across as a slam against Ian Hickson. I respect the guy because he’s moving HTML5 forward and he’s pissed off enough people along the way to make people passionate about Web standards again. I like that people get excited about HTML5, even though many of them don’t know what in the hell it means. Hell, even I don’t know what it means other than it’s supposed to cure this nasty rash I picked up in Nicaragua.

Back to our favorite non-benevolent, benevolent dictator. I don’t appreciate many of the political tactics and doublespeak Ian uses in the name of moving the Web forward, but I tend to not care about most of the crap he pulls unless it causes the good people in the Standards Community grief. There’s even a really fun (NOT SAFE FOR WORK) website that follows the many antics of Ian Hickson and friends. So, hat’s off to Ian for getting shit done.

I Do Science. Now You Can Too!

Now, let’s have some fun. One of the arguments that Ian has been making ever since we pushed for RDFa to go into HTML5 goes something like this:

“In a usability study for microdata, it was discovered that authors in fact have no difficulty dealing with straight URLs rather than shortening them with prefixes.”

The usability study alluded to was done by Google and was used to determine a few of the features that one can find in Microdata (which is a competing specification to RDFa in HTML5). When it was performed, Ian was quick to point out that a study had been performed by Google on Web developers not having an issue with typing out full URLs, but no data was released for many months. Some of the reasons cited were privacy concerns, it was an internal Google study, “I know better than you”, etc.

However, that didn’t stop Ian from referencing the usability study when discussing what he deemed to be faults in RDFa. In fact, the blog post referenced by the change proposal (which was put together fourteen days ago) to remove CURIEs from RDFa was the first time that I had ever seen the raw data, or the number of people that had taken part in the study. I passed it by others in the community and it was the first time that they had seen the data as well. There are many, many people that track this stuff and the fact that this was posted in October 2009 and we are just now seeing the data is… well, I just can’t explain how the entire RDFa community missed this vital piece of information that we have been looking for for the past two years. Anyway, there it is – all the data on the thorough set of tests run by Google, across potentially hundreds of participants, all for figuring out how key features of Microdata would work.

Bah – Confidence, Schmonfidence.

One of the first things you look for when somebody alludes to a “scientific study” are the number of participants. The other thing is “confidence level”, which is how sure you can be that what you find is in common with the general population you’re testing. So, a quick calculation would show us that if we had 100,000 people in the world that write HTML by hand on occasion, and if we wanted to be at least 90% confident of what we find, that we would need roughly 383 people to participate in the study.

So, I scan the study and find them… all six of them, who will henceforth be known as those six guys. Note: I’m using the gender-neutral form of guys so as to not be a sexist asshole.

Those six guys are far less than the 383 people that you would need for a decent scientific study. Design decisions were made for Microdata based on those six guys. There were supposed to be seven people partaking in the test, but one of them was a no-show. Perhaps it was because they knew some basic statistics and understood that the study was a waste of their time, perhaps it was because they didn’t want to feel the pressure of making a design decision for the billions of people in the world that use the Web, who knows!

Sure that I had missed something, I read through the study again. I re-checked my calculations on sample size, attempted to figure out what the confidence level for those six guys could be when applied to smaller populations, everything was pointing to something fantastically wrong having happened. Here we were, the RDFa Community, having to defend ourselves against an unknown Google study over the past several years where we believed everything had been done with the mythical exacting precision honed on every problem to cross Google’s path. There was some sort of deep Google A/B testing that was applied to this study, of this, we were certain.

Take a wild guess at how confident you can be with a population of 100,000 people and then sampling only 6 of them? By sampling those six guys, your minimum confidence level is 54%. That’s almost equivalent to the confidence you get by flipping a fucking coin. Those six guys don’t represent science, they represent random chance. You could put together a test with three donkeys and five chinchillas and be more confident about your findings than the Microdata Usability Study. I suggested the donkey-chinchilla metric to the engineering team at our company and like most of my brilliant ideas, they chose to ignore it, or me. It’s difficult to tell when people refuse to make eye contact with you.

One important key to success is self-confidence. Another is not being wrong.

Ian goes on to draw conclusions from those six guys:

“One thing we weren’t trying to test but which I was happy to see is that people really don’t have any problems dealing with URLs as property names. In fact, they didn’t even complain about URLs being long, which reassured me that Microdata’s lack of URL shortening mechanisms is probably not an issue.

Wait a sec. “people“? “they“? Who in the hell are we talking about here? Are we talking about those six guys? I’m pretty sure that’s who we’re talking about, not the general Web Developer population. I showed this to some other folks that have a grasp of college-level statistics and it was fun to see them wince and then watch as their heads, figuratively, imploded. One absolutely should not draw any conclusions from the horribly flawed Microdata Usability Study. Ian had asked me a few weeks ago why we hadn’t done a usability study like the Microdata one and not having known the number of participants in the usability study, I replied that we just didn’t have the resources that Google did to carry out such a comprehensive study.

But here we are – a brave new world of “scientific inquiry”! So, I headed off with a spring in my step, determined to do as good of a job as the Google Microdata Study. I e-mailed, IMed, phoned and talked my way through until I had responses from 12 people, twice as many as the Google study! I asked them whether or not they found URIs hard to type and found CURIEs useful. BIAS ALERT: These are all people that have deployed or are successfully deploying RDFa as a part of a product. The answer in all cases was unequivocal: “Don’t you have anything better to do with your time? Why in the hell are you wasting my time with these questions? Of course CURIEs are necessary to ease authoring, isn’t it obvious?”

So, take that Microdata Usability study’s 54% confidence level! Based on my in-depth study, I’m at least 65% confident that CURIEs are useful.

Yes, I’ve done Hallway testing and yes, I’m aware of things like the Nielsen/Landauer formulas for Usability Testing. Yes, I think that Usability Testing is very important – when done correctly and with a complete set of alternatives. A single test with six people does not qualify. So, what’s the lesson here: Oh yes – it’s that anecdotal evidence and studies based on those six guys are worth as much as the time it takes to slap them together. That is, it’s next to worthless.

What is worth something is hard data that is statistically significant – such as there are at least 430+ million web pages containing RDFa and CURIEs today. That there are currently 23,913 RDFa-enabled Drupal 7 sites using CURIEs right now, which will grow to 350,000+ sites in 2 years. That Google, Yahoo, Bing, Facebook, Flickr, Overstock.com, Best Buy, Tesco, Newsweek, O’Reilly, The Public Library of Science, the US Whitehouse, and the UK Government among tens of thousands of other websites are successfully using RDFa and CURIEs.

Remove CURIE support from HTML5+RDFa and most of those site’s meta-data will go dark. I’m laughing about the prospect of that today, but only because it seems laughable. If CURIEs are removed from HTML+RDFa, I cannot imagine the shit-storm that’s going to rain down on the RDFa Working Group and the HTML Working Group. Haha. *sob*

The Case for Curies

Should we provide a way for Web developers to shorten URLs? This question is at the core of a super-geeky argument spanning three years about how we can make the Web better. This URL shortening technology was quietly released with RDFa in 2008 and is known as Compact URI Expression, also known as the CURIE.

We, the Linked Data community, thought that this would be handy for Web developers that want to add things like Google Rich Snippet markup to their web pages. Rich snippets basically allow Web developers to mark up things like movies, people, places and events on their pages so that a search engine can find them and display better search listings for people looking for those particular movies, people, places or events. Basically, RDFa helps people find your web page more easily because the search engines can now better understand what’s on your Web page.

There are over 430 million Web pages that use RDFa today, based just on Drupal 7′s release numbers (Drupal 7 includes RDFa by default), there will be over 350,000 websites publishing RDFa in 2 years. So, RDFa and CURIEs are already out there, they’re being used successfully, and they’re helping search engines better classify the Web.

It may come as a surprise to you, then, that the HTML Working Group (the people that manage the HTML5 standard) is currently entertaining a proposal to remove CURIEs from HTML+RDFa. Wait, What!? You read that correctly – this would break most of those 430 million web pages out there. Based on the way HTML5 works, if a Web server tells your web browser that it’s sending a text/html document to the browser, the browser is supposed to use HTML5 to interpret the document. If HTML5+RDFa doesn’t have CURIE support, that means that all CURIEs are not going to be recognized. All web pages that are currently deployed and using CURIEs correctly and being served as text/html (which is most of them, by the way) will break.

What are CURIEs and why are they useful?

CURIEs are a pretty simple concept. Let’s say that you want to talk about a set of geographic coordinates and you want a search engine to find them easily in your Web page. The example below talks about geographic coordinates for Blacksburg, Virginia, USA. Let’s say that anytime someone searches for something near “Blacksburg” in a search engine that you also want your page to show up in the search results. You would have to tag those coordinates with something like the following in your HTML:

<div about="#bburg">
   <span property="latitude">37.229N</span>
   <span property="longitude">-80.414W</span>
<div>

Unfortunately, it’s not as simple as the HTML code above because the search engine doesn’t know for certain that you mean the same “latitude” and “longitude” that it’s looking for. To solve this problem, the Web uses URLs to uniquely identify these terms, so the HTML code ends up looking like this:

<div about="#bburg">
   <span property="http://www.w3.org/2003/01/geo/wgs84_pos#latitude">37.229N</span>
   <span property="http://www.w3.org/2003/01/geo/wgs84_pos#longitude">-80.414W</span>
<div>

Now the thing that sucks about the code above is that you will have to type those ridiculously long URLs into your HTML every time that you wanted to talk about latitude and longitude. This is exactly what you have to do in Microdata today. There is a better way in RDFa – by using CURIEs, we can shorten what we have to type like so:

<div prefix="geo: http://www.w3.org/2003/01/geo/wgs84_pos#" about="#bburg">
   <span property="geo:latitude">37.229N</span>
   <span property="geo:longitude">-80.414W</span>
<div>

Notice how we define a prefix called geo and then we use that prefix for both latitude and longitude. Now, we didn’t save a great deal of typing above, but imagine if you had to type out 100 or 200 of these types of URLs during the week. How often do you think you would mistype the long form of the URL vs. the CURIE? How difficult would it be to spot errors in the long form of the URL? How much extra typing would you have to do for the long form of the URL? CURIEs make your life easier by reducing the mistakes that you might make when typing out the long form URL and they also save Web developers time when deploying RDFa pages.

In fact, most everyone that we know that uses and deploys RDFa really, really likes CURIEs. Add that to the millions of pages on the Web that use CURIEs correctly and the growing support for them (by deployment numbers) and you would think that this new, useful technology is a done deal.

So, Who Thinks That CURIEs Are Bad?

Ian Hickson, the lead editor of the HTML5 specification, really doesn’t like CURIEs and is leading the charge to remove them from HTML+RDFa. You can read his reasoning here, but in a nut-shell here they are and some quick rebuttals to each point:

  • “People might carelessly copy-paste RDFa markup.” – People can do this with HTML5, JavaScript, and XML too, quite easily. Just because people may be careless is not a reason to rip out a technology that is currently being used correctly. Besides, there are tools to tell you when something goes wrong with your RDFa.
  • “CURIEs are too difficult for people to understand.” – I have a hard time believing that Web developers are so thick that they can’t understand how to use a CURIE. Web developers are smart and most of them get stuff right, if not the first time, soon thereafter. Sure, some people will get it wrong at first, but that’s a part of the learning process. I would imagine that many Web developers’ first real-world HTML page would be riddled with issues after the first cut – that’s why we have tools to let us know when things go wrong.
  • “Other technologies don’t use prefixes.” – CSS, C++, JavaScript, XHTML – all of these use re-bindable variables (aka: prefixes). But let’s assume that they didn’t, that still doesn’t mean that prefixes are bad. That’s like saying – your locomotion contraption has this new fangled thing called a “wheel” – none of our horses or riding livestock has this mechanism – therefore it must be bad.
  • “People may forget to define prefixes.” – Yes, and I may forget to put my pants on before I go out of the house. I’ll find out I’ve made a mistake soon enough because there are real-world consequences for making mistakes like this (especially in the winter-time). If someone forgets to define a prefix, their data won’t show up and their search ranking will stay low. That is, if they don’t first use the tools given to them to make sure that they did the right thing.
  • “CURIEs are unnecessary, people don’t have a problem typing out full URLs.” – I don’t know about you, but I have a very big problem remembering this URL: http://www.w3.org/2003/01/geo/wgs84_pos#latitude vs. the CURIE for that URL – geo. It was this last one, particularly, that pegged my irony meter. I’ll explain below.

Now, I’m not saying that these dangers are not out there. They certainly are – language design is always a balance between risk and reward, trade-offs in design purity vs. trade-offs in practicality. The people that are building and improving RDFa believe that we have struck the right balance and we should keep CURIEs, even if there is a chance that someone, somewhere will mess it up at some point.

However, I’d like to point out one fatal flaw in the argument that full URLs are not difficult to work with and that CURIEs are not necessary – it’s based upon some “research” that Ian points to in his last point. It’ll be covered in a blog post that I’ll post tomorrow.

Editorial: The follow-up blog post discussing the irony of the anti-CURIE stance is now available.

JSON-LD: Cycle Breaking and Object Framing

We’ve been doing a bit of research at Digital Bazaar on how to best meld graph-based object models with what most developers are familiar with these days – JSON-based object programming (aka: associative-array based object models). We want to enable developers to use the same data models that they use in JavaScript today, but to work with arbitrary graph data.

This is an issue that we think is at the heart of why RDF has not caught on as a general data model – the data is very difficult to work with in programming languages. There is no native data structure that is easy to work with without a complex set of APIs.

When a JavaScript author gets JSON-LD from a remote source, the graph that the JSON-LD expresses can take a number of different but valid forms. That is, the information expressed by the graph can be identical, but each graph can be structured differently.

Think of these two statements:

The Q library contains book X.
Book X is contained in the Q library.

The information that is expressed in both sentences is exactly the same, but the structure of each sentence is different. Structure is very important when programming. When you write code, you expect the structure of your data to not change.

However, when we program using graphs, the structure is almost always unknown, so a mechanism to impose a structure is required in order to help the programmer be more productive.

The way the graph is represented is entirely dependent on the algorithm used to normalize and the algorithm used to break cycles in the graph. Consider the following example, which is a graph with three top-level objects – a library, a book and a chapter. Each of the items is related to one another, thus the graph can be expressed in JSON-LD in a number of different ways:

{
   "#":
   {
      "dc": "http://purl.org/dc/elements/1.1/",
      "ex": "http://example.org/vocab#"
   },
   "@":
   [
      {
         "@": "http://example.org/test#library",
         "a": "ex:Library",
         "ex:contains":  ""
      },
      {
         "@": "",
         "a": "ex:Book",
         "dc:contributor": "Writer",
         "dc:title": "My Book",
         "ex:contains": ""
      },
      {
         "@": "http://example.org/test#chapter",
         "a": "ex:Chapter",
         "dc:description": "Fun",
         "dc:title": "Chapter One"
      }
   ]
}

The JSON-LD graph above could also be represented like so:

{
   "#":
   {
      "dc": "http://purl.org/dc/elements/1.1/",
      "ex": "http://example.org/vocab#"
   },
   "@": "http://example.org/test#library",
   "a": "ex:Library",
   "ex:contains":
   {
      "@": "",
      "a": "ex:Book",
      "dc:contributor": "Writer",
      "dc:title": "My Book",
      "ex:contains":
      {
         "@": "http://example.org/test#chapter",
         "a": "ex:Chapter",
         "dc:description": "Fun",
         "dc:title": "Chapter One"
      }
   }
}

Both of the examples above express the exact same information, but the graph structure is very different. If a developer can receive both of the objects from a remote source, how do they ensure that they only have to write one code path to deal with both examples?

That is, how can a developer reliably write the following code:

// print all of the books and their corresponding chapters
var library = jsonld.toObject(jsonLdText);
for(var bookIndex = 0; bookIndex < library["ex:contains"].length;
    bookIndex++)
{
   var book = library["ex:contains"][bookIndex];
   var bookTitle = book["dc:title"];
   for(var chapterIndex = 0; chapterIndex < book["ex:contains"].length;
       chapterIndex++)
   {
      var chapter = book["ex:contains"][chapterIndex];
      var chapterTitle = chapter["dc:title"];
      console.log("Book: " + bookTitle + " Chapter: " + chapterTitle);
   }
}

The answer boils down to ensuring that the data structure that is built for the developer from the JSON-LD is framed in a way that makes property access predictable. That is, the developer provides a structure that MUST be filled out by the JSON-LD API. The working title for this mechanism is called "Cycle Breaking and Object Framing" since both mechanisms must be operable in order to solve this problem.

The developer would specify a Frame for their language-native object like the following:

{
   "#": {"ex": "http://example.org/vocab#"},
   "a": "ex:Library",
   "ex:contains":
   {
      "a": "ex:Book",
      "ex:contains":
      {
         "a": "ex:Chapter"
      }
   }
}

The object frame above asserts that the developer expects to get a library containing one or more books containing one or more chapters returned to them. This ensures that the data is structured in a way that
is predictable and only one code path is necessary to work with graphs that can take multiple forms. The API call that they would use would look something like this:

var library = jsonld.toObject(jsonLdText, objectFrame);

The discussion on this particular issue is continued on the JSON-LD mailing list.