Hello, this is Jan Lehnardt and you're visiting my blog. Thanks for stopping by.
plok — It reads like a blog, but it sounds harder!
↑ Archives
Concluding the Erlang Workshop at ICFP in Edinburgh, Kenneth Lundin, lead manager of the Erlang/OTP development team showed us what’s going to be new in Erlang.
The next release R13B02 will be released at the end of September and the following so called service releases will come out roughly every two months.
The next major release R14B s targeted to be released in Q1 of 2010.
Ericsson’s customers see Erlang’s ability to scale near linear with the number of available CPUs and cores as a major selling point. Conclusively, Ericsson continues to improve Erlang’s multi core performance. The R12B02 release will only see a few minor improvements though.
Some bullet points:
erlang:demonitor(Mon, [flush])
gen_server:call()
faster by avoiding a full search of the message queue. So far, the number of times to scan the message queue was reduced from 3 to 1 and they keep working on this.R13B2 will come with a new utility for packaging Erlang applications. It is called RelTool and its goal is to “make it easy to build your own standalone applications”. These applications are a single installation directory for your Erlang application including the full runtime system and all dependent libraries. This makes it trivial to distribute Erlang applications without any dependencies.
Packages are built according to a recipe that for now needs to be handwritten but in the future will be built with a graphical tool. Recipe’s are used to transform a development environment directly into a deployable package for the target platform(s).
Long-term Erlang users were overheard:
This should have happened ten years ago
Kenneth finished with a list of things that Ericsson is working on, but doesn’t have a schedule to release yet.
Further improvement of multi core performance:
And a few general improvements:
Good stuff.
I’m very proud to announce JSConf.eu that I am co-organizing (with Malte & Holger) November 6th/7th here in Berlin.
From the official announcement:
JSConf in Washington, DC in April this year proved to be the homecoming of a JavaScript community that didn’t know it existed. The conference got top marks all around for content, community and fun. With JSConf.eu, we let that spirit live on and give JavaScript a home in Europe. (why let the americans have all the fun? :)
Mark November 7th/8th in your calendar, we’ll be presenting top speakers in the field and a relaxing atmosphere in Berlin, Germany’s pulsing capital. Join us for a two-day, community-focussed indie conference with top technical content around your favourite language and great opportunities to meet like-minded people.
Our speakers so far include:
- Amy Hoy,
- Thomas Fuchs,
- John Resig,
- Dion Almaer &
- Ben Galbraith
I’m looking forward to announce even more great speakers in the next weeks & months. The call for papers is still open:
We’re concerned about diversity in our industry and would like to encourage more women to do talks.
See you in Berlin!
Next week, I’ll be presenting CouchDB at Epicenter “The Irish Software Show”. My talk is on Wednesday, 26th, 14:30.
My friend Francesco is talking about Erlang at 11:00 the same day, don’t miss it!
Hope to see you there!
After the US Spring Tour in April this year, I’m about to embark on the EU Summer Tour.
I’ll be visiting London, Amsterdam, Zurich and the Gran Canaria. Here’s when, how and why:
June 22nd–26th: CouchDB University & Factory, London
The CouchDB University is a a three day training course where J Chris and I teach a select group of students everything about CouchDB. With little prior knowledge, we’ll leave you with being able to build amazing CouchDB applications at small and large scale as well as extend CouchDB itself.
The CouchDB Factory is a track at the Erlang Factory running all day Friday.
June 29th–30th: Kings of Code, Amsterdam
Kings of Code looks like it is going to be a kick-ass web developer conference featuring some of my favourite web people: Geoffrey Grosenbach Joe Stump & Francisco Tolmasky. I’m fairly confident that the other speakers will be among my favourites after Kings of Code :)
J Chris will be talking about CouchDB.
It’s still in discussion, but I might talk about CouchDB and Erlang for web developers on one of the side events.
July 1st–2nd: ICOODB, Zurich
I’ll be taking the night train from Amsterdam zu Zurich to give a three hour tutorial as well as a 60 minute presentation on CouchDB at the International Conference on Object Databases. CouchDB is strictly not an object oriented database, but it stores objects and is of interest to the research community that meets in Zurich.
Prof. Stefan Edlich invited me to speak at ICOODB and I’m very happy I can make it.
July 1st–7th: GUADEC, Gran Canaria
Canonical, the kind folks behind the Ubuntu Linux distribution are pushing CouchDB to become a centerpiece of the Ubuntu desktop data synchronization infrastructure. Merrily sync your contacts, calendar data between your machines, an online backup service and share select data with your peers. And yeah UbuntuOne is also related :)
Canonical is flying me out to attend the Linux Desktop Summit to talk to desktop application developers and show them how cool CouchDB is and where it is useful for them.
Also, Gran Canria, I couldn’t say no. Thank you Canonical!
As much as I am excited about the travels and meeting all you out there, I’ll be missing three weeks in my favourite city, Berlin and it makes me a little sad.
This is part two in a small series about measuring software performance. There’s a lot of common sense covered, but I feel it necessary to shed some light.
If you haven’t, check out part one.
Say you want to find out what’s behind the buzz of all these new #nosql databases. There’s a large number to choose from today. All options come in varying degrees of maturity and characteristics so it’d be nice to know what solves your problem best. A non-exhaustive list of these databases or storage systems include Memcache[DB], Tokyo Cabinet / Tyrant, Project Voldemort, Scalaris, Dynamite, Redis, Persevere, MongoDB, Solr or my favourite CouchDB. And these are just some of the open source ones.
This article is not a comprehensive comparison of any of the mentioned systems. Instead it tries to give you an idea about what to look for when evaluating a storage system or how to take into perspective evaluations and benchmarks others have done.
We’ll look at some of the technical aspects of data storage systems: Applying common sense when reading benchmarks; b-trees and hashing; speed vs. concurrency; networked systems and their problems; low level data storage (disks’n stuff); and data reliability on single-nodes and multi-node systems.
There are a lot of other reasons to decide for or against a project based on a lot of non-technical criteria, but things like commercial support or a healthy open source community are not part of this article.
From time to time you see some crazy numbers posted to the reddits of the internets that claim fantastic performance.
The (imaginary) SuperfastDB can store 450,000 items per second!.
Wow.
No word on where the items are stored (in memory? on a harddrive? Spindles? Solid State?), what an item is exactly and how big it is, the rest of the hardware this was run on and how to reproduce it.
But boy, 450,000 a second!
My shoes can do 650,000 a second, but you’ve got to figure out what.
Context is as important as reproducibility. The last article here established that finding out that my system and your system come up with different numbers is not much of a help. Any sort of serious test must come with a set of scripts or programs and comprehensive instructions on how the tests were run.
Everything “cool” in computer science has been around for 25+ years. Actual innovation is rare. Advancements in hardware and new combinations of existing solutions make for new stuff coming out each day (that’s a good thing), but the fundamental rules are the same for all. We’re all running von Neumann machines, quicksort is still pretty quick and hashes and b-trees rule the storage world.
Let’s recap.
Hashing revolves around the idea of O(1)
lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you O(N)
operations that you can’t ignore in practice.
The other elephant in the room is b-trees. The fundamental idea here is to get to your data in a minimal number of steps traversing a tree because making a step is expensive, but reading your data is very fast comparatively. Steps are expensive because they translate to a head seek (that is the time your spinning hard drive needs to position the reading arm to find the spot to read your data from), but reading from a harddrive once the reading head is in place is fast.
There are a bunch of more interesting lookup structure like R-Trees for spatial queries, but they are mostly used for secondary indexes on top a regular data set that lives in a hash or b-tree.
Concurrency is hard. The devil lies in the details and when briefly looking at things, the details are often overlooked. Suits the devil.
Creating storage systems that assume only one access occurs at a time is relatively easy. If resources are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).
Locking means that the database has to maintain information for everybody who wants to write to a part of the database, and what part it is.
No locking, or optimistic locking or MVCC moves that burden to the person who is trying to write to the database. She must prove that she won’t be overwriting any existing data.
The trade-offs here are a leaner request handing on the server that works well with remote & concurrent clients at the expense of more complexity on the client (the person who wants to store something in our database).
Hybrid approaches are possible too: While MVCC is used internally, the database’s clients can rely on database-side locking (e.g. PostgreSQL or InnoDB).
Just a quick note: We already talk about client and server here. There is a strong case for embedded databases like SQLite that don’t expose a concurrent user model to the outside. The program that needs an embedded database just includes it.
Another approach to using databases is having a dedicated computer running a database system and sharing it over the network with any number of clients using this database server. They can often be “a bunch of servers” or a cluster. More on that later.
A separate database server (networked or not) will need to spend some time to deal with connections, network failures, unspecified client behaviour and so on. The upside is a piece of infrastructure that can be maintained separately. An embedded database will thus be faster but probably won’t solve all of your problems and it will always be tied to your application.
When people tell me “SuperfastDB does 450,000 a second!” I ask “How many fsync()
s is that?”. Let me explain:
A database system uses operating system services to use any hardware. The operating systems exposes a harddrive through a filesystem. The database systems talks to the filesystem and asks it to store or retrieve data in its behalf. The filesystem then goes ahead and tries to satisfy the database’s requests.
(I’ll not talk about databases that can use raw block devices to store data. They exist but they are not as common as those who use the filsystem.)
The filesystem also tries to be clever – for good reasons. When the database requests a piece of data, the filesystem will not only find that piece and return it, it will also store it in a cache to avoid having to actually talk to the harddrive the next time this piece of data gets requested. When the data changes, the filesystem either removes it from the cache or updates it with the harddrive. It might even go further and only store the new data that comes in with a write request into the cache and rely on a periodic task to write all of the cache back to the drive. Writing a bunch of of pieces at once is more efficient than storing each one on its own.
More efficient equals to faster and faster is good, right? Well, it depends: If all goes well, this approach is a nice one. But you know computers, things will not go well 100% of the time. The failure scenarios are endless, but they boil down to the question: “What happens when your machine dies and you have data that has only been written to memory?” — The answer isn’t too hard: That data is lost. If there is a delay between a write request finishing and data being written (or “flushed”) to disk any data that has been “written” during the delay period is subject to loss.
There are cases where this is not a problem; in other cases it is. A developer should have the chance to decide. (Note that even your hardware could be lying to you about having stored data, but I’ll punt on this one, get proper hardware).
So, flushing to disk needs to happen before you can rest assured your data has been stored. Your operating system has an API call that forces the filesystem to write its cache to disk. It is called fsync()
(on UNIX systems) and it is an expensive operation. You can only do so many fsync()
s in a second and it is not a great many.
The 450,000 items were most likely just written to memory and not to disk.
When writing files to disk (at the end of the day, your data ends up in one file or another on the filesystem) that represents what lives in a database, there are multiple options to handle updates.
An update is a change to your data item, for example, a new phone number. The intuitive way to handle this is to go and find the old phone number in the file, and overwrite it with the new number. Easy.
There are several problems with this approach: What to do if the new phone number is longer than the old one (say you added an international calling prefix)? The new number needs to be written to a different place and the change in location must be recorded. Not too big of an issue.
Back to failure scenarios: Again, the reasons can be manifold, but what happens when we’ve (over-)written the first 4 digits of the old with the new number and then the server dies, power goes away or the database server crashes? The next time you want to read the phone number you get a mix of the old and the new one (if you are lucky) and you don’t exactly know that this is the case and which parts are missing. Your database file is inconsistent and you need to run an integrity check to find missing bits and correct half-written bytes. In the worst case that means scanning your entire database file a few times before you resolved all inconstancies. If you have a lot of data, that can take days.
To solve this, you always write the new phone number to a new place in the database file and only when it has been fsync()
ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity check.
The trade-off for consistency is write-speed (remember fsync()
s are expensive) for consistency-check-speed after a failure.
A nice bonus is that if the “new place in the database” is the end of the file, you keep your disk-drive head busy with writing data to disk instead of seeking all over the place (remember: seeks are expensive).
So far, we’ve been looking at scenarios that involve a single database. We learned a great deal (I hope), but in reality we often deal with more than one database. The simplest reason to have two databases is for redundancy. Failures can bring down your database temporarily or even permanently. If it is a temporary issue, waiting a bit (or a bit longer) to get up and running again might be an option, but often, an application or service should be available at all times. A fatal failure where a database server is lost beyond repair, your data is gone if you haven’t stored it in a second place.
“I’ll just make two copies, easy!”. Yup easy, until you look at the details (that damn devil again!).
It’s all about failures again. Consider a single read request. A client connects to a server and asks for a data item. The server looks it up and returns the data to the client. All is well. At any point things can go wrong. The network connection can drop (or slow down so much that client or server assume it dropped), the client can disappear (because of a network failure or crash) as can the server. Clients, servers and the protocols they speak need to be built around the assumption that any of these things (and many more) can go wrong. If any parts is not designed to handle error cases, your system will do funny things, but it won’t reliably store and manage your data.
Add complexity: With each write target (store in two places) the possibility of error and the need for proper error handling grows quadratically. When evaluating a distributed storage system, looking at how errors are handled is vital.
Another reason to distribute data among multiple servers is capacity. The three metrics of interest here are read requests, write requests and data. If you have more requests or data than a single machine can handle, you need to move to multiple machines. Each metric calls for different strategies, but they often go along with each other. The need for fault tolerance that I discussed above needs to be considered alongside.
Growing read capacity is relatively easy once you covered the base case where the source for reading data might not be the same as the the target for writing data and that there can be a mismatch (cf. eventual consistency).
Distributing writes and data works by designating two machines with 50% of the operations. A clever intermediate, a proxy server for example, decides which request goes where and all is well, we can store twice as much and we can store at twice the speed. When we need to grow bigger yet, we add another server and tell the proxy server to distribute the load equally among them. Adding a proxy for distribution introduces a single point of failure and you don’t want these; there’s added complexity with this approach.
The diagram shows that there is another step needed that wasn’t included in the above description. The new “node” needs to have a copy of all data items that are assigned to him and are currently living on the two existing nodes. The process of moving data items to new nodes is called resharding and needs to happen every time a new node is added.
Resharding can be an expensive operation if you have a lot of data. Techniques like consistent hashing help with minimising the amount of items that need to move. If you are looking at a sharding database, you want to understand how the sharding is performed and if you like the trade-offs.
The CAP Theorem states that out of consistency, availability and partition tolerance, a system can choose to support two at any given moment, but never three.
Consistency guarantees that all clients that talk to cluster of nodes will always get to read the same data. Write operations are atomic on all nodes.
Availability guarantees that in any (reasonable) failure scenario, clients are still able to access their data.
Partition tolerance guarantees that when nodes in the cluster lose their network connection and two or more completely separated sub-clusters emerge, the system will still be able to store and retrieve data.
If you are aiming for a comparative benchmark of two or more systems, you should run your procedure by they authors. I found developers are happy to help out with benchmarks by clearing up misconceptions or sharing tricks to speed things up (which you can choose to ignore, if you are looking for out-of-the box comparison, but this is rarely useful).
This is part one in a small series about measuring software performance. There’s a lot of common sense covered, but I feel it is necessary to shed some light.
Pete needs coffee and his coffee maker broke down. Pete’s browsing through Craigslist. He’s looking for a coffee maker and he’s fine with a used one if he can get it from nearby. While results may vary when Pete’s got his coffee, his brain processes what he sees on a web page in between 200 and 500 milliseconds. Of course this depends on the complexity of the page and outside distractions[citation needed].
Computers are very limited in what they can calculate but they are incredibly fast and reliable. Human brains are a lot more sophisticated, but not as fast on raw computations. To render the Craigslist homepage takes about 150ms right now (I’m in Berlin) when I ask curl
and it takes Safari around 1.4 seconds (1400ms) to display the page.
This in part demonstrates the measuring dilemma. Pete never sees the 150ms response for http://craigslist.org/
. He only sees that it takes a bit before his browsers finishes loading. We’ll get back to that later.
The point here is, even if all parts of the system would result in a sub-200ms response time, Pete (and everybody else) would not notice. Pages would change “instantly” as far as he (and everybody else) is concerned. While the fallacies of distributed computing (read: The Internet) will probably never get us there, at some point it does not make any more sense to speed things up because no one will notice.
Lets take a look what a typical web app looks like. This is not exactly how Craigslist works (because I don’t know how Craigslist works), but it is a close enough approximation to illustrate problems with benchmarking.
You have web server, some middleware, a database. A user request comes in, the web server takes care of the networking and parses the HTTP request. The request gets handed to the middleware layer which figures out what to run; then runs whatever is needed to serve the request. The middleware might talk to your database and other external resources like files or remote web services. The requests bounces back to the web server which sends out any resulting HTML. The HTML includes references to other resources living on your web server, like CSS-, JS- or image files and the process starts anew for every resource. A little different each time, but in general, all requests are similar. And along the way there are caches to store intermediate results to avoid expensive recomputation.
That’s a lot of moving parts. Getting a top-to-bottom profile of all components to figure out where bottlenecks lie is pretty complex (but nice to have). I start making up numbers now, the absolute values are not important, only numbers relative to each other. Say a request takes 1.5 seconds (1500ms) to be fully rendered in a browser.
In a simple case like Craigslist there is the initial HTML, a CSS file, a JS file and the favicon. Except for the HTML, these are all static resources and involve reading some data from a disk (or from memory) and serve it to the browser who then renders it. The most notable things to do for performance are keeping data small (gzip compression, high jpg compression) and avoiding requests all together (HTTP level caching in the browser). Making the web server any faster doesn’t buy us much (yeah, hand wavey, but I don’t want to focus on static resources here. Pete wants his coffee. Let’s say all static resources take 500ms to serve & render.
(Read all about improving client experience with proper use of HTTP from Steve Sounders. The YSlow tool is indispensable for tuning a web site.)
That leaves us with 1000ms for the initial HTML. We’ll chop off 200ms for network latency [cf. Network Fallacies]. Let’s pretend HTTP parsing, middleware routing & execution and database access share equally the rest of the time, 200ms each.
If you now set out to improve one part of the big puzzle that is your web app and gain 10ms in the database access time, this is probably time not well spent (unless you have the numbers to prove it).
We established that there are a lot of moving parts. Each part has a variable performance characteristic, based on load, disk I/O, state of various caches (down to CPU L2 caches) and different OS scheduler behaviour based on any input variable. It is nearly impossible to know every interfering factor, so any numbers you ever come up with should be read with a grain of salt. In addition, when my system reports a number of 1000ms and yours reports 1200ms the only thing we can derive from that is our systems are different and we knew that before.
To combat variables, usually profiles are run multiple times (and a lot of times!) to have statistics tell you the margin of error you’re getting. Profiles should also run a long time with the same amounts of data that you will see in production. If you run a quick profile for a few seconds or minutes, you will hit empty caches and get skewed numbers. If your data does not have the same properties as the data you have in your production environment, you’ll get skewed results.
Story time: Chris tried to find out how many documents of a certain size he could write into CouchDB. CouchDB has a feature that generates a UUID for every new document you store. The UUID variant it is using uses a full 128 bits of randomness. The documents are then stored in a b+-tree. Turns out that for a b+-tree, truly random keys for any kind of access are the worst possible case to handle. Chris then switched to pre-genereated sequential ids for his test and got a 10x improvement. Now he’s testing the best case for CouchDB which coincides with the application’s data, but your application might have a different key distribution only resulting in a 2x or 5x improvement or none at all.
In a different case, the amount of data stored and retrieved could easily fit in memory and Linux’ filesystem cache was smart enough to turn all disk access to memory access which is naturally faster. But it doesn’t help if you production setup has more data that fits in memory.
Take home point: Profiling data matters.
The second part of this little series will look at pitfalls when profiling storage systems.
Tool X might give you 5ms response times and this is an order of magnitude faster than anything else on the market. Programming is all about trade-offs and everybody is bound by the same laws.
On the outside it might appear that everybody who is not using Tool X is a moron. But speed & latency are only part of the picture. We already established that going from 5ms to 50ms might not even be noticeable by anyone using your product. The expense for speed can be multiple things:
Memory; instead of doing computations over and over, Tool X might have a cute caching layer that saves recomputation by storing results in memory. If you are CPU bound, that might be good, if you are memory bound it might not. A trade off.
Concurrency; the clever data structures in Tool X are extremely fast when only one request at a time is processed, and because it is so fast most of the time, it appears as if it would process multiple request in parallel. Eventually though, a high number of concurrent requests fill up the request queue and response time suffers. — A variation on this is that Tool X might work exceptionally well on a single CPU or core, but not on many, leaving your beefy servers idling.
Reliability; making sure data is actually stored is an expensive operation. Making sure a data store is in a consistent state and not corrupted is another. There are two trade offs here: Buffers that store data in memory before committing it to disk to ensure a higher data throughput. In case of a power loss or crash (hard- or software), the data is gone. This may or may not be acceptable for your application. The other is a consistency check that is required to run after a failure. If you have a lot of data, this can take days. If you can afford to be offline, that’s okay, but maybe you can’t afford it.
Make sure to understand what requirements you have and pick the tool that complies instead of taking the one that has the prettiest numbers. Who’s the moron when your web application is offline for a fix up for a day and your customers impatiently wait to get their job done; or worse, you lose their data.
Yeah, you want to know which one of these databases, caches, programming language, language constructs or tools are faster, harder, stronger. Numbers are cool and you can draw pretty graphs that management types can compare and make decisions from.
First thing a good exec knows is that she’s operating on insufficient data (aside, everybody does all the time, but sometimes it is just not apparent to you) and diagrams drawn from numbers are a very distilled view of reality. And graphs from numbers that are effectively made up by bad profiling are not much more than a fairy tale.
If you are going to produce numbers, make sure you understand how much is and isn’t covered by your results. Before passing them on, make sure the receiving person knows as much.
I’m in the market for databases and key-value stores. Every solution has a sweet spot in terms of data, hardware, setup and operation and there are enough permutations that you can pick the one that is closest to your problem. But how to find out? Ideally, you download & install all possible candidates, create a profiling test suite with proper testing data, make extensive tests and compare the results. This can easily take weeks and you might not have that much time.
I would like to ask developers [*] of storage systems to compile a set of profiling suites that simulate different usage patterns of their system (read-heavy & write-heavy loads, fault tolerance, distributed operation and a lot more). A fault tolerance suite should include steps necessary to get data live again, like any rebuild or checkup time. I would like users of these systems to help their developers to find out how to reliably measure different scenarios.
* I’m working on CouchDB and I’d like to have such a suite very much!
Even better, developers could agree (hehe) on a set of benchmarks that objectively measure performance for easy comparison. I know this is a lot of work and the results can still be questionable (you read the above part, did you?), but it’ll help our users a great when figuring out what to use.
Stay tuned for the next part in this series about things you can do wrong when testing databases & k-v stores.
Prior last week’s Erlang Factory, we ran the CouchDB University, a three day training course where Chris & I taught a group of eleven all about CouchDB.
First of all, thanks to all students to sign up for this first installment of a commercial CouchDB training. It is great to see that our work gets validated that way. It also shows that CouchDB is a hot topic.
We went through all areas of CouchDB: An introductory overview, the HTTP API basics, setup and administration, basic and advanced view theory and practice, replication and distributed setups, performance tricks, CouchApps and finally internals.
I feel we could have gone on for five days. Despite fearing to not have enough material. It panned out pretty well.
Special thanks to Kevin Ferguson and Paul Davis. Kevin gave an introduction to couchdb-lounge, meebo.com’s CouchDB add-on that adds sharding, auto-replication and failover. Paul gave an tour through couchdb-lucene, Robert Newson’s add-on that adds fulltext search to CouchDB.
The next round of training will be in London in June. If you missed the Palo Alto event and can’t make it to London, please get in touch and we can try and see if we can set something up in your neighbourhood.