A few weeks ago, we announced the launch of the Data Driven Standards Community Group at the World Wide Web Consortium (W3C). The focus is on researching, analyzing and publicly documenting current usage patterns on the Internet. Inspired by the Microformats Process, the goal of this group is to enlighten standards development with real-world data. This group will collect and report data from large Web crawls, produce detailed reports on protocol usage across the Internet, document yearly changes in usage patterns and promote findings that demonstrate that the current direction of a particular specification should be changed based on publicly available data. All data, research, and analysis will be made publicly available to ensure the scientific rigor of the findings. The group will be a collection of search engine companies, academic researchers, hobbyists, protocol designers and specification editors in search of data that will guide the Internet toward a brighter future.
We had launched the group with the intent of regularly analyzing the Common Crawl data set. The goal of Common Crawl is to build and maintain an open crawl of the web that can be used by researchers, educators and innovators. The crawl currently contains roughly 40TB of compressed data, around 5 billion web pages, and is hosted on Amazon’s S3 service. To analyze the data, you have to write a small piece of analysis software that is then applied to all of the data using Amazon’s Elastic Map Reduce service.
I spent a few hours a couple of nights ago and wrote the analysis software, which is available as open source on github. This blog post won’t go into how the software was written, but rather the methodology and data that resulted from the analysis. There were three goals that I had in mind when performing this trial run:
- Quickly hack something together to see if Microformats, RDFa and Microdata analysis was feasible.
- Estimate the cost of performing a full analysis.
- See if the data correlates with the Yahoo! study or the Web Data Commons project.
The analysis software was executed against a very small subset of the Common Crawl data set. The directory that was analyzed (
/commoncrawl-crawl-002/2010/01/07/18/) contained 1,273 ARC files, each weighing in at 100MBs each for around 124GBs of data processed. It took 8 EC2 machines a total of 14 hours and 23 minutes to process the data, for a grand total of 120 CPU hours utilized.
The analysis software streams each file from disk, decompresses it and breaks each file into the data that was retrieved from a particular URL. The file is checked to ensure that it is an HTML or XHTML file, if it isn’t, it is skipped. If the file is an XHTML or HTML file, an HTML4 DOM is constructed from the file using a very forgiving tag soup parser. At that point, CSS selectors are executed on the resulting DOM to search for HTML elements that contain attributes for each language. For example, the CSS selector “
[property]” is executed to retrieve a count of all RDFa
property attributes on the page. The same was performed for Microdata and Microformats. You can see the exact CSS queries used in the source file for the markup language detector.
Here are the types of documents that we found in the sample set:
|HTML or XHTML||10,598,873||100%|
The numbers above clearly deviate from both the Yahoo! study and the Web Data Commons project. The problem with our data set was that it was probably too small to really tell us anything useful, so please don’t use the numbers in this blog post for anything of importance.
The analysis software also counted the RDFa 1.1 attributes:
datatype attributes have a usage pattern that is not very surprising. I didn’t check for combinations of attributes like
content on the same element due to time constraints. I only had one night to figure out how to write the software, write it and run it. This sort of co-attribute detection should be included in future analysis of the data. What was surprising was that the
vocab attributes were used somewhere out there before the features were introduced into RDFa 1.1, but not to the degree that it would be of concern to the people designing the RDFa 1.1 language.
The Good and the Bad
The good news is that it does not take a great deal of effort to write a data analysis tool and run it against the Common Crawl data set. I’ve published both our methodology and findings such that anybody could re-create them if they so desired. So, this is good for open Web Science initiatives everywhere.
However, there is bad news. It cost around $12.46 USD to run the test using Amazon’s Elastic Map Reduce system. The Common Crawl site states that they believe that it would cost roughly $150 to process the entire data set, but my calculations show a very different picture when you start doing non-trivial analysis. Keep in mind that 124GBs was processed of a total 40TB of data. That is, only about 0.31% of the data set was processed for $12.46. To process the entire Common Crawl corpus, it would cost around $4,020 USD. Clearly far more than what any individual would want to spend, but still very much within the reach of small companies and research groups.
Funding a full analysis of the entire Common Crawl dataset seemed within reach, but after discoverng what the price would be, I’m having second thoughts about performing the full analysis without a few other companies or individuals pitching in to cover the costs.
Potential Ways Forward
We may have run the analysis in a way that caused the price to far exceed what was predicted by the Common Crawl folks. I will be following up with them to see if there is a trick to reducing the cost of the EC2 instances.
One option would be to bid a very low price for Amazon EC2 Spot Instances. The down-side is that processing would happen only when nobody else was willing to bid the price we would and therefore, the processing job could take weeks. Another approach would have us use regular expressions to process the document instead of building an in-memory HTML DOM for the document. Regular expressions would be able to detect RDFa/Microdata and Microformats using far less CPU than the DOM-based approach. Yet another approach would have an individual or company with $4K to spend on this research project fund the analysis of the full data set.
Overall, I’m excited that doing this sort of analysis is becoming available to those of us without access to Google or Facebook-like resources. It is only a matter of time before we will be able to do a full analysis on the Common Crawl data set. If you are reading this and think you can help fund this work, please leave a comment on this blog, or e-mail me directly at: firstname.lastname@example.org.