Web Data Commons Launches

Some interesting numbers were just published regarding Microformats, RDFa and Microdata adoption as of October 2010 (fifteen months ago). The source of the data is the new CommonCrawl dataset, which is being analyzed by the Web Data Commons project. They sampled 1% of the 40 Terabyte data set (1.3 million pages) and came up with the following number of total statements (triples) made by pages in the sample set:

Markup Format Statements
Microformats 30,706,071
RDFa 1,047,250
Microdata 17,890
Total 31,771,211

Based on this preliminary data, of the structured data on the Web: 96.6% of it was Microformats, 3.2% of it was RDFa, and 0.05% of it was Microdata. Microformats is the clear winner in October 2010, with the vast majority of the data consisting of markup of people (hCard) and their relationships with one another (xfn). I also did a quick calculation on percentage of the 1.3 million URLs that contain Microformats, RDFa and Microdata markup:

Format Percentage of Pages
Microformats 88.9%
RDFa 12.1%
Microdata 0.09%

These findings deviate wildly from the findings by Yahoo around the same time. Additionally, the claim that 88.9% of all pages on the Web contain Microformats markup, even though I’d love to see that happen, is wishful thinking.

There are a few things that could have caused these numbers to be off. The first is that the Web Data Commons’ parsers are generating false positives or negatives, resulting in bad statement counts. A quick check of the data, which they released in full, will reveal if this is true. The other cause could be that the Yahoo study was flawed in the same way, but we may never know if that is true because they will probably never release their data set or parsers for public viewing. By looking at the RDFa usage numbers (3.2% for the Yahoo study vs. 12.1% for Web Data Commons) and the Microformats usage numbers (roughly 5% for the Yahoo study vs. 88.9% for Web Data Commons), the Web Data Commons numbers seem far more suspect. Data publishing in HTML is taking off, but it’s not that popular yet.

I would be wary of doing anything with these preliminary findings until the Web Data Commons folks release something more final. Nevertheless, it is interesting as a data point and I’m looking forward toward the full analysis that these researchers do in the coming months.

2 Comments

Got something to say? Feel free, I want to hear from you! Leave a Comment

  1. Dan Brickley says:

    Great to see numbers grounded in public datasets. However re Microdata it’s worth stressing that there is now a *lot* of the stuff out there, largely due to its use in schema.org dating from June 2011. The CommonCrawl collection analysed here (as you say) predates this by some months. Until schema.org, in my experience Microdata was only of interest to standards nerds such as ourselves.

    I mention this not to frame things in terms of which format is ‘winning’ some notional race; rather to emphasise that there’s enough of each out there now that we’d collectively do well to find ways of abstracting over the differences. The work at http://www.w3.org/blog/SW/2012/01/12/drafts-published-by-the-w3c-html-data-task-force-html-data-guide-and-microdata-to-rdf-transform/ is a great start there, but hopefully we’ll also see front-end .js libraries that also happily accept a variety of encodings, protecting developers from low-level differences.

  2. Note this comment:
    “..currently adoption is in the order of thousands of sites and billions of pages now ..”

    From:
    http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2011_12_01 .

    I think they’ve stopped counting re. Microdata due to effects of mass Schema.org adoption :-)

Leave a Comment

Let us know your thoughts on this post but remember to play nicely folks!