The Need for Data-Driven Standards

Summary: The way we create standards used by Web designers and authors (e.g. HTML, CSS, RDFa) needs to employ more publicly-available usage data on how each standard is being used in the field. The Data-Driven Standards Community Group at the W3C is being created to accomplish this goal – please get an account and join the group.

Over the past month, there have been two significant events demonstrating that the way that we are designing languages for the Web could be improved upon. The latest one was the swift removal of the <time> element from HTML5 and then the even swifter re-introduction of the same. The other was a claim by Google that Web authors were getting a very specific type of RDFa markup wrong 30% of the time, which went counter to the RDFa Community’s experiences. Neither side’s point was backed up with publicly-available usage data. Nevertheless, the RDFa Community decided to introduce RDFa Lite with the assumption that Google’s private analysis drew the correct conclusions and understanding that the group would verify the claims, somehow, before RDFa Lite became an official standard.

Here is what is wrong with the current state of affairs: No public data or analysis methodologies were presented by people on either side of the debate, and that’s just bad science.

Does it do What you Want it to do?

How do you use science to design a Web standard such as HTML or RDFa? Let’s look, first, at the kinds of technologies that we employ on the Web.

A dizzying array of technologies were just leveraged to show this Web page to you. To take bits off of a hard drive and blast them toward you over the Web at, quite literally, the speed of light is an amazing demonstration of human progress over the last century. The more you know about how the Web fits together, the more amazing it is that it works with such dependability – the same way for billions of people around the world, each and every day.

There are really two sides to the question of how well the Web “works”. The first side questions how well it works from a technical standpoint. That is, does the page even get to you? What is the failure rate of the hardware and software relaying the information to you? The other side asks the question of how easy it was for someone to author the page in the first place. The first side has to do more with back-end technologies, the second with front-end technologies. It is the design of these front-end technologies that this blog post will be discussing today.

Let’s take a look at the technologies that went into delivering this page to you and try to put them into two broad categories; back-end and front-end.

Here are the two (incomplete) lists of technologies that are typically used to get a Web page to you:

Back-end Technologies: Ethernet, 802.11, TCP/IP, HTTP, TrueType, JavaScript*, PHP*
Front-end Technologies: HTML, CSS, RDFa, Microdata, Microformats

* It is debatable that JavaScript and PHP should also go in the front-end category, but since you can write a web page without them, let’s keep things simple and ignore them for now.

Back-end technologies, such as TCP/IP, tend to more prescriptive and thus easier to test. The technology either works reliably, or it doesn’t. There is very little wiggle room in most back-end technology specifications. They are fairly strict in what they expect as input and output.

Front-end technologies, such as HTML and RDFa, tend to be more expressive and thus much more difficult to test. That is, the designers of a language know what the intent of particular elements and attributes are, but the intent can be mis-interpreted by the people that use the technology. Much like the English language can be butchered by people that don’t write good, the same principle applies to front-end technologies. An example of this is the rev attribute in HTML – experts know its purpose, but it has traditionally not been used correctly (or at all) by Web authors.

So, how do we make sure that people are using the front-end technologies in the way that they were intended to be used?

Data Leads the Way

Many front-end technology standards, like HTML and RDFa, are frustrating to standardize because the language designers rarely have a full picture of how the technology will be used in the future. There is always an intent to how the technology should be used, but how it is used in the field can deviate wildly from the intent.

During standards development, it is common to have a gallery of people yelling “You’re doing it wrong!” from the sidelines. More frustratingly, for everyone involved, some of them may be right, but there is no way to tell which ones are and which ones are not. This is one of the places that the scientific process can help us. Data-driven science has a successful track record of answering questions that are difficult for language designers, as individuals with biases, to answer. Data can help shed light on a situation when your community of authors cannot.

While the first draft of a front-end language won’t be able to fully employ data-driven design, most Web standards go through numerous revisions. It is during the design process of the latter revisions that one can utilize good usage data on the Web to influence a better direction for the language.

Unfortunately, good data is exactly what is missing from most of the front-end technology standardization work that all of us do. The whole <time> element fiasco could have been avoided if the editor of the HTML5 specification had just pointed to a set of public data that showed, conclusively, that very few people were using the element. The same assertion holds true for the changes to the property attribute in RDFa. If Google could have just pointed us to some solid, publicly-available data, it would have been easy for us to make the decision to extend the property attribute. Neither happened because we just don’t have the infrastructure necessary to do good data-driven design, and that’s what we intend to change.

Doing a Web-scale Crawl

The problem with getting good usage data on Web technologies is that none of us have the crawling infrastructure that Google or Microsoft have built over the years. A simple solution would be to leverage that large corporate infrastructure to continuously monitor the Web for important changes that impact Web authors. We have tried to get data from the large search companies over the years. Getting data from large corporations is problematic for at least three reasons. The first is that there are legal hurdles that both people outside the organization and people inside the organization must overcome to publish any data publicly. These hurdles often take multiple months to overcome. The second is that some see the data as a competitive advantage and are unwilling to publish the data publicly. The third is that the raw data and the methodology are not always available, resulting in just the publication of the findings, which puts the public in the awkward position of having to trust that a corporation has their best interests in mind.

Thankfully, new custom search services have recently launched that allow us to do Web-Scale crawls. We have an opportunity now to create excellent, timely crawl data that can be publicly published. One of these new services is called 80legs, which does full, customized crawls of the Web. The other is called Common Crawl, which indexes roughly 5 billion web pages and is provided as a non-profit service to researchers. These two places are where we are going to start asking the questions that we should have been asking all along.

What Are We Looking For?

To kick-start the work, there is interest in answering the following questions:

  1. How many Web pages are using the <time> element?
  2. How many Web pages are using ARIA accessibility attributes?
  3. How many Web pages are using the <article> and <aside> elements?
  4. How many sites are using OGP vs. markup?
  5. How many Web pages are using the RDFa property attribute incorrectly?

Getting answers to these questions will allow the front-end technology developers to make more educated decisions about the standards that all of us will end up using. More importantly, having somewhere that we can all go and ask these questions is vitally important to the standards that will drive the future of the Web.

The Data-Driven Standards Community Group

I propose that we start a Data-Driven Standards Community Group at the W3C. The Data Driven Standards Community Group will focus on researching, analyzing and publicly documenting current usage patterns on the Internet. Inspired by the Microformats Process, the goal of this group is to enlighten standards development with real-world data. This group will collect and report data from large Web crawls, produce detailed reports on protocol usage across the Internet, document yearly changes in usage patterns and promote findings that demonstrate that the current direction of a particular specification should be changed based on publicly available data. All data, research, and analysis will be made publicly available to ensure the scientific rigor of the findings. The group will be a collection of search engine companies, academic researchers, hobbyists, web authors, protocol designers and specification editors in search of data that will guide the Internet toward a brighter future.

If you support this initiative, please go to the W3C Community Groups page and join to show you support the group.

Leave a Comment

Let us know your thoughts on this post but remember to play nicely folks!