Hello, this is Jan Lehnardt and you're visiting my blog. Thanks for stopping by.
plok — It reads like a blog, but it sounds harder!
The databases we use today were designed 20, 30 years ago. Let’s hop into the DeLorean and visit the 1970s.
Computers were massive machines. They were very expensive and there were only a few around, so their availability was scarce. They were not very fast either (any contemporary remote control can outperform them) and they were mostly used in a scientific environment.
A scenario: A scientist does some research (they do this) and he comes up with tables and tables of data and now wants to do some analyzing. He applies for a time slot at one of those massive machines, waits a few weeks and then can submit his data and instructions to run on that machine. If either the data set was too large or the instructions too clumsy and the calculations took longer than the booked time slot … well, you guessed it: Back to the drawing board.
Relational databases were designed in that world. They require huge amounts of work up-front (schema design, data normalization, query design, performance tuning) to allow single queries to run at maximum speed. And this made sense. At the time.
The computing world and the usage patterns today are fundamentally different. Thousands of users update their status on Twitter and Facebook this minute, in parallel. They do that from all over the world with no interruption; the internet is always awake.
Hardware changed, too. Coming from huge single-CPU machines we now have huge amounts of commodity boxes with multi-core and multi-CPU setups. Again, there is a fundamental difference.
What is astounding to me is that we still use databases that were used for a completely different use-case. What is even more astounding is that these databases are up to the task. There is a lot of work involved, but in the end, all high traffic websites use some RDBMS. And I am very impressed with the fact that these RDBMS actually work in this (for them) rather hostile environment.
Now this is not entirely true: “A lot of work involved” in this case means to break with the fundamental design laws of relational data storage: Denormalization, redundancy, eventual consistency. The result is scalability: Making a service run on enough hardware so all users can send their messages, update their profiles and upload their videos.
How can I write eight paragraphs without mentioning CouchDB here? I will just mention a very small anecdote and this is actually more about Erlang than CouchDB: At a consulting gig we have a pretty big iron to run an RDBMS and with years of analysis, fine tuning and caching we are getting a couple of hundred concurrent requests per server. This is enough for the application and probably enough to grow for a few more years.
Doing a simple concurrency benchmark reveals that I can serve around 1000 concurrent read requests from CouchDB from my Mac Mini (The limit here is actually disk I/O since CouchDB does not do any internal caching for anything at the moment (it will, soon enough, don’t worry :) The reads are random enough to eliminate disk and filesystem caches).
This not a fair comparison though, the requests do not do the same calculations so they can’t be compared. But they are not fair either the other way around: CouchDB and the tests are not optimized at all while the other setup is a few years in the making.
What it does though, is illustrate nicely that optimizing for the single query in a relational database only gets you so far and embracing today’s computing world leads to more satisfying results.
And no, I am not saying RDBMSes are a stupid idea or going away anytime soon.
Update 3/6/08: Looks like Joe agrees :)
Thanks for nice entry. Database abstraction aptitudes (like using ORM tools) are clues for RDBMS are insufficient and we will need new DB systems in the future. I haven’t heard CouchDB (A document database server, accessible via a RESTful JSON API) before and it seems interesting. I like to keep track of all changes in the project.
I regularly see MySQL running with 3.5 to 4K worth of concurrent accesses. The scaling limits have more to do with the number of processors then anything else (more processors == more issues).
Set your expectations higher, I suspect you guys will reach it :)
Heya Brian, I bet you are seeing much larger and higher traffic installations than I do :-)
I don’t think that the number of processors help you go around the upfront-consistency that SQL enforces. And the more you use modern systems the more you need to bend an RDBMS to actually work. A different approach really can make a difference in simplicity here. But I know you are fan so thanks for the comment :)
And yeah, we’ve hit ab’s compiled-in limit of 20k concurrent requests to the same resource. We’re aiming much higher :)