Some interesting numbers were just published regarding Microformats, RDFa and Microdata adoption as of October 2010 (fifteen months ago). The source of the data is the new CommonCrawl dataset, which is being analyzed by the Web Data Commons project. They sampled 1% of the 40 Terabyte data set (1.3 million pages) and came up with the following number of total statements (triples) made by pages in the sample set:
Based on this preliminary data, of the structured data on the Web: 96.6% of it was Microformats, 3.2% of it was RDFa, and 0.05% of it was Microdata. Microformats is the clear winner in October 2010, with the vast majority of the data consisting of markup of people (hCard) and their relationships with one another (xfn). I also did a quick calculation on percentage of the 1.3 million URLs that contain Microformats, RDFa and Microdata markup:
|Format||Percentage of Pages|
These findings deviate wildly from the findings by Yahoo around the same time. Additionally, the claim that 88.9% of all pages on the Web contain Microformats markup, even though I’d love to see that happen, is wishful thinking.
There are a few things that could have caused these numbers to be off. The first is that the Web Data Commons’ parsers are generating false positives or negatives, resulting in bad statement counts. A quick check of the data, which they released in full, will reveal if this is true. The other cause could be that the Yahoo study was flawed in the same way, but we may never know if that is true because they will probably never release their data set or parsers for public viewing. By looking at the RDFa usage numbers (3.2% for the Yahoo study vs. 12.1% for Web Data Commons) and the Microformats usage numbers (roughly 5% for the Yahoo study vs. 88.9% for Web Data Commons), the Web Data Commons numbers seem far more suspect. Data publishing in HTML is taking off, but it’s not that popular yet.
I would be wary of doing anything with these preliminary findings until the Web Data Commons folks release something more final. Nevertheless, it is interesting as a data point and I’m looking forward toward the full analysis that these researchers do in the coming months.