Benchmarks: You are Doing it Wrong

Wednesday, May 20. 2009

This is part one in a small series about measuring software performance. There’s a lot of common sense covered, but I feel it is necessary to shed some light.

Coffee

Pete needs coffee and his coffee maker broke down. Pete’s browsing through Craigslist. He’s looking for a coffee maker and he’s fine with a used one if he can get it from nearby. While results may vary when Pete’s got his coffee, his brain processes what he sees on a web page in between 200 and 500 milliseconds. Of course this depends on the complexity of the page and outside distractions^{[citation needed]}.

Computers are very limited in what they can calculate but they are incredibly fast and reliable. Human brains are a lot more sophisticated, but not as fast on raw computations. To render the Craigslist homepage takes about 150ms right now (I’m in Berlin) when I ask curl and it takes Safari around 1.4 seconds (1400ms) to display the page.

This in part demonstrates the measuring dilemma. Pete never sees the 150ms response for http://craigslist.org/. He only sees that it takes a bit before his browsers finishes loading. We’ll get back to that later.

The point here is, even if all parts of the system would result in a sub-200ms response time, Pete (and everybody else) would not notice. Pages would change “instantly” as far as he (and everybody else) is concerned. While the fallacies of distributed computing (read: The Internet) will probably never get us there, at some point it does not make any more sense to speed things up because no one will notice.

Moving Parts

Lets take a look what a typical web app looks like. This is not exactly how Craigslist works (because I don’t know how Craigslist works), but it is a close enough approximation to illustrate problems with benchmarking.

You have web server, some middleware, a database. A user request comes in, the web server takes care of the networking and parses the HTTP request. The request gets handed to the middleware layer which figures out what to run; then runs whatever is needed to serve the request. The middleware might talk to your database and other external resources like files or remote web services. The requests bounces back to the web server which sends out any resulting HTML. The HTML includes references to other resources living on your web server, like CSS-, JS- or image files and the process starts anew for every resource. A little different each time, but in general, all requests are similar. And along the way there are caches to store intermediate results to avoid expensive recomputation.

That’s a lot of moving parts. Getting a top-to-bottom profile of all components to figure out where bottlenecks lie is pretty complex (but nice to have). I start making up numbers now, the absolute values are not important, only numbers relative to each other. Say a request takes 1.5 seconds (1500ms) to be fully rendered in a browser.

In a simple case like Craigslist there is the initial HTML, a CSS file, a JS file and the favicon. Except for the HTML, these are all static resources and involve reading some data from a disk (or from memory) and serve it to the browser who then renders it. The most notable things to do for performance are keeping data small (gzip compression, high jpg compression) and avoiding requests all together (HTTP level caching in the browser). Making the web server any faster doesn’t buy us much (yeah, hand wavey, but I don’t want to focus on static resources here. Pete wants his coffee. Let’s say all static resources take 500ms to serve & render.

(Read all about improving client experience with proper use of HTTP from Steve Sounders. The YSlow tool is indispensable for tuning a web site.)

That leaves us with 1000ms for the initial HTML. We’ll chop off 200ms for network latency ^{[cf. Network Fallacies]}. Let’s pretend HTTP parsing, middleware routing & execution and database access share equally the rest of the time, 200ms each.

If you now set out to improve one part of the big puzzle that is your web app and gain 10ms in the database access time, this is probably time not well spent (unless you have the numbers to prove it).

Variables

We established that there are a lot of moving parts. Each part has a variable performance characteristic, based on load, disk I/O, state of various caches (down to CPU L2 caches) and different OS scheduler behaviour based on any input variable. It is nearly impossible to know every interfering factor, so any numbers you ever come up with should be read with a grain of salt. In addition, when my system reports a number of 1000ms and yours reports 1200ms the only thing we can derive from that is our systems are different and we knew that before.

To combat variables, usually profiles are run multiple times (and a lot of times!) to have statistics tell you the margin of error you’re getting. Profiles should also run a long time with the same amounts of data that you will see in production. If you run a quick profile for a few seconds or minutes, you will hit empty caches and get skewed numbers. If your data does not have the same properties as the data you have in your production environment, you’ll get skewed results.

Story time: Chris tried to find out how many documents of a certain size he could write into CouchDB. CouchDB has a feature that generates a UUID for every new document you store. The UUID variant it is using uses a full 128 bits of randomness. The documents are then stored in a b+-tree. Turns out that for a b+-tree, truly random keys for any kind of access are the worst possible case to handle. Chris then switched to pre-genereated sequential ids for his test and got a 10x improvement. Now he’s testing the best case for CouchDB which coincides with the application’s data, but your application might have a different key distribution only resulting in a 2x or 5x improvement or none at all.

In a different case, the amount of data stored and retrieved could easily fit in memory and Linux’ filesystem cache was smart enough to turn all disk access to memory access which is naturally faster. But it doesn’t help if you production setup has more data that fits in memory.

Take home point: Profiling data matters.

The second part of this little series will look at pitfalls when profiling storage systems.

Trade Offs

Tool X might give you 5ms response times and this is an order of magnitude faster than anything else on the market. Programming is all about trade-offs and everybody is bound by the same laws.

On the outside it might appear that everybody who is not using Tool X is a moron. But speed & latency are only part of the picture. We already established that going from 5ms to 50ms might not even be noticeable by anyone using your product. The expense for speed can be multiple things:

Memory; instead of doing computations over and over, Tool X might have a cute caching layer that saves recomputation by storing results in memory. If you are CPU bound, that might be good, if you are memory bound it might not. A trade off.
Concurrency; the clever data structures in Tool X are extremely fast when only one request at a time is processed, and because it is so fast most of the time, it appears as if it would process multiple request in parallel. Eventually though, a high number of concurrent requests fill up the request queue and response time suffers. — A variation on this is that Tool X might work exceptionally well on a single CPU or core, but not on many, leaving your beefy servers idling.
Reliability; making sure data is actually stored is an expensive operation. Making sure a data store is in a consistent state and not corrupted is another. There are two trade offs here: Buffers that store data in memory before committing it to disk to ensure a higher data throughput. In case of a power loss or crash (hard- or software), the data is gone. This may or may not be acceptable for your application. The other is a consistency check that is required to run after a failure. If you have a lot of data, this can take days. If you can afford to be offline, that’s okay, but maybe you can’t afford it.

Make sure to understand what requirements you have and pick the tool that complies instead of taking the one that has the prettiest numbers. Who’s the moron when your web application is offline for a fix up for a day and your customers impatiently wait to get their job done; or worse, you lose their data.

But…My Boss Wants Numbers!

Yeah, you want to know which one of these databases, caches, programming language, language constructs or tools are faster, harder, stronger. Numbers are cool and you can draw pretty graphs that management types can compare and make decisions from.

First thing a good exec knows is that she’s operating on insufficient data (aside, everybody does all the time, but sometimes it is just not apparent to you) and diagrams drawn from numbers are a very distilled view of reality. And graphs from numbers that are effectively made up by bad profiling are not much more than a fairy tale.

If you are going to produce numbers, make sure you understand how much is and isn’t covered by your results. Before passing them on, make sure the receiving person knows as much.

A Call to Arms

I’m in the market for databases and key-value stores. Every solution has a sweet spot in terms of data, hardware, setup and operation and there are enough permutations that you can pick the one that is closest to your problem. But how to find out? Ideally, you download & install all possible candidates, create a profiling test suite with proper testing data, make extensive tests and compare the results. This can easily take weeks and you might not have that much time.

I would like to ask developers ^[*] of storage systems to compile a set of profiling suites that simulate different usage patterns of their system (read-heavy & write-heavy loads, fault tolerance, distributed operation and a lot more). A fault tolerance suite should include steps necessary to get data live again, like any rebuild or checkup time. I would like users of these systems to help their developers to find out how to reliably measure different scenarios.

* I’m working on CouchDB and I’d like to have such a suite very much!

Even better, developers could agree (hehe) on a set of benchmarks that objectively measure performance for easy comparison. I know this is a lot of work and the results can still be questionable (you read the above part, did you?), but it’ll help our users a great when figuring out what to use.

Stay tuned for the next part in this series about things you can do wrong when testing databases & k-v stores.

Posted by Jan | Comments (8)

Comments

Display comments as (Linear | Threaded)

good point on the B+ Tree. I’d be curious to see a comparison of the same test on MongoDB — what sort of numbers you get there

#1 anon on 2009-05-21 14:37 (Reply)

Yeah, it’d be great to run a similar test in MongoDB, maybe send them a request? :)

#1.1 Jan on 2009-05-22 16:33 (Reply)

While you are right to put benchmarking away from an "here to explain everything just by sheer numbers approach" I still object to the numbers you are throwing around. While an initial response time of, say, 1000 ms, for a website might be ok for Pete (well, without his coffee even 2000 ms might working fine here) as soon as you start using UI components that involve a server roundtrip this is far beyond usable.

Users expect some response from a user interface within ~100 ms [citation needed], and everything excessively bigger than this number leaves the feeling that something just didn’t work.

So a static page to be transferred with all components and rendered within 2000 msecs might be ok, but a cute ajaxified drop down box rendered in 500ms definitely is not. It all depends on the circumstances…

(by the way: 200msecs for HTTP parsing?)

#2 eno (Homepage) on 2009-05-27 16:25 (Reply)

Thanks for the comment. The numbers are all fake, just to illustrate that speeding up a single component doesn’t buy you a lot in the whole picture. Nothing else.

#2.1 Jan on 2009-05-27 16:30 (Reply)

In studies I have seen, to feel completely live, you need 30ms or so, and in applications, you see big productivity improvements down to 100ms or less, as the brain doesn’t feel like it is waiting and get distracted.

#2.2 Ronald Pottol on 2009-06-17 07:12 (Reply)

"Turns out that for a b+-tree, truly random keys for any kind of access are the worst possible case to handle."

Can you clarify this a bit? Does this apply even in the case of random reads? Is the problem with the B+ tree of documents, or indexes (views)?

I can sort of see why sequential ids might be helpful when inserting a lot of documents, but I don’t see why it would affects reads very much, especially random reads.

#3 Michael S. (Homepage) on 2009-06-01 12:17 (Reply)

b-trees are used for databases and view indexes. This discussion is relevant to the database b-trees.

B-trees are optimizing for the case of slow lookup and fast IO to/from the underlying storage medium (tape, disk, etc). Speeding up a b-tree on a filesystem works by caching upper nodes in the filesystem cache. Truly random IO will add heavy churn on the FS cache and don’t result in as good a performance increase as you’d get with a cache that holds a set of b-tree nodes that are on the "most used path" for operations on the tree.

#3.1 Anonymous on 2009-06-02 12:11 (Reply)

This article suffers from too client-centric an approach.

A real sysadmin is going to be interested in shaving as much off the run-time of their database-using scripts as possible, because there will be considerably more than one concurrent user.

#4 Tim (Homepage) on 2010-02-23 15:10 (Reply)