JSON-LD Best Practice: Context Caching

An important aspect of systems design is understanding the trade-offs you are making in your system. These trade-offs are influenced by a variety of factors: latency, safety, expressiveness, throughput, correctness, redundancy, etc. Systems, even ones that do effectively the same thing, prioritize their trade-offs differently. There is rarely “one perfect solution” to any problem.

It has been asserted that Unconstrained JSON-LD Performance Is Bad for API Specs. In that article, Dr. Chuck Severance asserts that the cost of JSON-LD parsing is 2,000 times more costly in real time and 70 times more costly in CPU time than pure JSON processing. Sounds bad, right? So, lets unpack these claims and see where the journey takes us.

TL;DR: Don’t ever put a system that uses JSON-LD into production without thinking about your JSON-LD context caching strategy. If you want to use JSON-LD, your strategy should probably either be: Cache common contexts and do JSON-LD processing or don’t do JSON-LD processing, but enforce an input format via something like JSON Schema on your clients so they’re still submitting valid JSON-LD.

The Performance Test Suite

After reading the article yesterday, Dave Longley (the co-inventor of JSON-LD and the creator of the jsonld.js and php-json-ld libraries) put together a test suite that effectively re-creates the test that Dr. Severance ran. It processes a JSON-LD Schema.org Person object in a variety of ways. We chose to change the object because the one that Dr. Severance chose is not necessarily a common use of JSON-LD (due to the extensive use of CURIEs) and we wanted a more realistic example (that uses terms). The suite first tests pure JSON processing, then JSON-LD processing w/ a cached schema.org context, and then JSON-LD processing with an uncached schema.org context. We ran the tests using the largely unoptimized JSON-LD processors written in PHP and Node vs. a fully optimized JSON processor written in C. The tests used PHP 5.6, PHP 7.0, and node.js 4.2.

So, with Dr. Severance’s data in hand and our shiny new test suite, let’s do some science!

The Results

The raw output of the test suite is available, but we’ve transformed the data into a few pretty pictures below.

The first graph shows the wall time performance hit using basic JSON processing as the baseline (shown as 1x). It then compares that against JSON-LD processing using a cached context and JSON-LD processing using an uncached context. Wall time in this case means the time taken if you were to start a stop watch from the start of the test to the end of the test. Take a look at the longest bars in this graph:

As we can see, there is a significant performance hit any way you look at it. Wall time spent on processing JSON-LD with an uncached context in PHP 5 is 7,551 times slower than plain JSON processing! That’s terrible! Why would anyone choose to use JSON-LD with such a massive performance hit!

Even when you take out the time spent just sitting around, the CPU cost (for running the network code) is still pretty terrible. Take a look at the longest bars in this graph:

CPU processing time for JSON vs. JSON-LD with an uncached context in PHP 5 is 260x slower. For PHP 7 it’s 239x slower. For Node 4.2, it’s 140x slower. Sounds pretty dismal, right? JSON-LD is a performance hog… but hold up, let’s examine why this is happening.

JSON-LD adds meaning to your data. The way it does this is by associating your data with something called a JSON-LD context that has to be downloaded by the JSON-LD processor and applied to the JSON data. The context allows a system to formally apply a set of rules to data to determine whether a remote system and your local system are “speaking the same language”. It removes ambiguity from your data so that you know that when the remote system says “homepage” and your system says “homepage”, that they mean the same thing. Downloading things across a network is an expensive process, orders of magnitude slower than having something loaded in a CPUs cache and executed without ever having to leave home sweet silicon home.

So, what happens when you tell the program to go out to the network and fetch a document from the Internet for every iteration of a 1,000 cycle for loop? Your program takes forever to execute because it spends most of it’s time in network code and waiting for I/O from the remote site. So, this is lesson number 1. JSON-LD is not magic. Things that are slow (because of physics) are still slow in JSON-LD.

Best Practice: JSON-LD Context Caching

Accessing things across a network of any kind is expensive, which is why there are caches. There are primary, secondary, and tertiary caches in our CPUs, there are caches in our memory controllers, there are caches on our storage devices, there are caches on our network cards, there are caches in our routers, and yes, there are even caches in our JSON-LD processors. Use those caches because they provide a huge performance gain. Let’s look at that graph again, and see how much of a performance gain we get by using the caches (look at the second longest bars in both graphs):

CPU processing time for JSON vs. JSON-LD with a cached context in PHP 5 is 67x slower. For PHP 7 it’s 35x slower. For Node 4.2, it’s 18x slower. To pick the worst case, 67x slower (using a cached JSON-LD Context) is way better than 7,551x slower. That said, 67x slower still sounds really scary. So, let’s dig a bit deeper and put some processing time numbers (in milliseconds) behind these figures:

These numbers are less scary. In the common worst case, where we’re using a cached context in the slowest programming language tested, it will take 2ms to do JSON-LD processing per CPU core. If you have 8 cores, you can process 8 JSON-LD API requests in 2ms. It’s true that 2ms is an order of magnitude slower than just pure JSON processing, but the question is: is it worth it to you? Is gaining all of the benefits of using JSON-LD for your industry and your application worth 2ms per request?

If the answer is no, and you really need to shave off 2ms from your API response times, but you still want to use JSON-LD – don’t do JSON-LD processing. You can always delay processing until later by just ensuring that your client is delivering valid JSON-LD; all you need to do that is apply a JSON Schema to the incoming data. This effectively pushes JSON-LD processing off to the API client, which has 2ms to spare. If you’re building any sort of serious API, you’re going to be validating incoming data anyway and you can’t get around that JSON Schema processing cost.

I’ve never had a discussion with someone where 2 milliseconds was the deal breaker between JSON-LD processing and not doing it. There are many things in software systems that eat up more than 2 milliseconds, but JSON-LD still gives you the choice of doing the processing at the server, pushing that responsibility off to the client, or a number of other approaches that provide different trade-offs.

But Dr. Severance said…

There are a few parting thoughts in Unconstrained JSON-LD Performance Is Bad for API Specs that I’d be remiss in not addressing.

JSON-LD evangelists will talk about “caching” – this of course is an irrelevant argument because virtually all of the shared hosting PHP servers do not allow caching so at least in PHP the “caching fixes this” is a useless argument. Any normal PHP application in real production environments will be forced to re-retrieve and re-parse the context documents on every request / response cycle.

phpFastCache exists, use it. If for some reason you can’t, and I know of no reason you couldn’t, cache the context by writing it to disk and retrieving it from disk. Most modern operating systems will optimize this down to a very fast read from memory. If you can’t write to disk in your shared PHP hosting environment, switch to a provider that allows it (which are most of them).

even with cached pre-parsed [ed: JSON-LD Context] documents the additional order of magnitude is due to the need to loop through the structures over and over, to detect many levels of *potential* indirection between prefixes, contexts, and possible aliases for prefixes or aliases.

That is not how JSON-LD processing works. Rather than go into the details, here’s a link to the JSON-LD processing algorithms.

json_decode is written in C in PHP and jsonld_compact is written in PHP and if jsonld_compact were written in C and merged into the PHP core and all of the hosting providers around the world upgraded to PHP 12.0 – it means that perhaps the negative performance impact of JSON-LD would be somewhat lessened “when pigs fly”.

You can do JSON-LD processing in 2ms in PHP 5, 0.7ms in PHP 7, and 1ms in Node 4. You don’t need a C implementation unless you need to shave those times off of your API calls.

If the JSON-LD community actually wants its work to be used outside the “Semantic Web” backwaters – or in situations where hipsters make all the decisions and never run their code into production, the JSON-LD community should stand up and publish a best practice to use JSON-LD in a way that maintains compatibility with JSON – so that APIs and be interoperable and performant in all programming languages. This document should be titled “High Performance JSON-LD” and be featured front and center when talking about JSON-LD as a way to define APIs.

I agree that we need to write more about high performance JSON-LD API design, because we have identified a few things that seem best-practice-y. The problem is that we’ve been too busy drinking our “tripel” mocha lattes and riding our fixies to the latest butchershop-by-day-faux-vapor-bar-by-night flashmob experiences to get around to it. I mean, we are hipsters after all. Play-play balance is important to us and writing best practices sounds like a real drag. :)

In all seriousness, JSON-LD is now published by several million sites. We know that people want some best practices from the JSON-LD community. Count this blog post as one of them. If you use JSON-LD, and that includes the developers that created those several million sites that publish JSON-LD, you are a part of the JSON-LD community. We created JSON-LD to help our fellow developers. If you think you have settled on a best practice with JSON-LD, don’t wait for someone else to write about it; please pay it forward and blog about it. Your community of fellow JSON-LD developers will thank you.

5 Comments

Got something to say? Feel free, I want to hear from you! Leave a Reply to ManuSporny

  1. Adrian Pohl says:

    Re. publication of JSON-LD on the web, yesterday new statistics from the Web Data Commons corpus have been published – including numbers for embedded JSON-LD. In this crawl from November 2015 596,229 domains and 35,486,192 URLs have been enriched with JSON-LD, see http://webdatacommons.org/structureddata/2015-11/stats/stats.html.

    Also, I noticed a typo: It should read “cached” instead of “uncached” in “CPU processing time for JSON vs. JSON-LD with an uncached context in PHP 5 is 67x slower.”

    • ManuSporny says: (Author)

      Thanks for the link to the Web Data Commons stats for 2015 on JSON-LD. Usage on 596,229 domains isn’t bad for a spec that is just barely two years old. :)

      re: typo – thanks for catching that; I’ve fixed it in the article text.

  2. Manu and Dave – Thank you so much for doing this. Your analysis is far more thorough than my banging around. I am glad the conversation has started and frankly would *love* it if the JSON-LD community could lead us through the pitfalls of finding our way from JSON to JSON-LD safely. A couple of points. (1) I looked at phpFastCache some time ago (independent of JSON-lD) and concluded it was a little “early” for production use – if you think it is solid and real and does not cause ISPs to shut you down because of too much memory use, I want to use it for lots of things. (2) Purely caching of JSON-LD contexts does not solve the problem of “I can’t parse this API data unless (a) my network is up and (b) someone else’s server is up” – that is a concern. (3) your most optimistic performance numbers are still very concerning for a standard like IMS Caliper which might easily have a a service receiving 100K requests per second (4) Statements like “lots of folks are putting up JSON-LD completely misses the point that the performance impact is of those consuming data – not those who produce the data – unconstrained JSON-LD is fine for long-lived documents that are rarely accessed but contain complex /evolving data – my issues is *only* with JSON-LD as a hyper-scalable API – I do not think that you and Dave have a case that JSON-LD can scale to millions of requests per minute on reasonable hardware – so you make my original point better than I did because you pulled out all the stops and your results show it is *still* costly to parse JSON-LD at scale (5) I am very interested in your suggestion to use JSON Schema. I am curious if you mean “use JSON Schema instead of JSON-LD” or if you mean “use JSON Schema to enforce a set of constrained JSON-LD rules”. The first version seems like a bad idea to me but the second version is very interesting. As I said in my post I *like* JSON-LD for lots of reasons and am only interested in a safe way to get from JSON to JSON-LD.

Trackbacks for this post

  1. Structured Data News Round Up: April 19th, 2016 - Hunch Manifest Inc
  2. Dr. Chuck's Blog » Blog Archive » Unconstrained JSON-LD Performance Is Bad for API Specs

Leave a Reply to ManuSporny

Let us know your thoughts on this post but remember to play nicely folks!