The State of the Couch: The Invited Talk at the Erlang Workshop at ICFP in Edinburgh in 2009

Saturday, September 12. 2009

Slides (PDF, 1.6MB)

Transcript:

09:00 am

Good morning, thanks Carla for the introduction. I hope you’re all awake and reasonably caffeinated. First and foremost, thanks for inviting me here, I’m excited you’re interested what I might be talking about.

My name is Jan Lehnardt, I’m a hacker. I make software for fun and profit. I also like to be my own boss, that’s why I put “Entrepreneur” on the slide. I come from The Web. From the traditional computer science perspective The Web is often looked down upon, not taken seriously. After all, it’s just a bunch of HTML pages and people looking at them. This is not hard science is it?

The Web is big. The money put onto solving problems is significant and the problems that come with scale are significant. If you today, in 2009, still look down on The Web as a toy, you might want to reconsider. I’ll come back to The Web in a bit.

Finally, I’m a huge fan of Open Source. Earlier this year I was stuck…well, not “stuck”…I was working on a commercial closed source project for about four weeks. Over the weeks I got grumpier and grumpier, but I couldn’t put my finger on why that happened. When the project was done I got back to work on an Open Source project again. I committed some patches, got feedback, committed some patches from others, got feedback again and while going through all the interaction with smart, fun people, I got happy again. Turns out, working in an Open Source environment makes me happy. This was a personal revelation – although not a surprising one.

There’s a few things I am not: working in the telco industry. Show of hands, who is or has been working in telco (bunch of hands). I’m not a researcher, who here is a researcher? (bunch more). Finally, I’m not an Erlang expert, who here is (come on folks, this is a show of hands-heavy presentation, bunch of hands). Big question: what the hell am I doing here when I’m not one of these and together all of you are?

Who here as seen this logo? (shows CouchDB couch logo without title, 2/3 hands), Who has heard of this name (reveals title, “Apache CouchDB”, all, no surprise, the proceeding list my affiliation as CouchDB). Great! I work on CouchDB. This presentation is about how CouchDB came to be, what some of the important points in its development were and where we are now. I’ll round up with a look at the future, well no, that sounds too fancy, I’ll show a few things I’d like to see in the future.

From the CouchDB website on couchdb.org: “CouchDB is a distributed, fault-tolerant, highly concurrent database…” and there’s a bunch more. The typical reaction of an Erlang person is “Why did they reinvent mnesia?”. To answer that, I have to go back in time a little and explore how CouchDB came to be (cue piano fairy-tale music).

This is Damien Katz, CouchDB’s inventor. He’s the chief architect, lead developer and project lead of CouchDB. There’s a team working on it now, but he’s still in a central position. He sat out to work “something cool”. In his situation he couldn’t see how a regular employment (he was with IBM previously) would allow him to do that. The conclusion is simple: quit, move to a place that’s cheaper to live in than the Boston area and live off your savings while figuring out what “cool stuff” means to you. So that’s what he did. The full story is captured in a video.

Initially he wanted to write a distributed file system for Windows. The reasons are simple. Everybody is using Windows and a distributed file system is a a hot piece of tech. When he started reading the documentation on how to write file systems for Windows he could easily estimate that it would take him about a year to read all the docs. It is super-complicated, there’s a lot of cruft and entirely unworkable for a single person. There’s a reason why nobody is writing file systems for Windows.

Damien reconsidered and settled on a distributed database called “Couch”. “Cluster Of Unreliable Commodity Hardware”. It doesn’t really say what CouchDB does, but it sounds good! The idea that was partly revolutionary at the time (2004) was the Google model of moving to massive amounts of cheap machines instead of buying bigger and bigger machines. Key to the strategy is parallelizing computations in order to cope with the massive amount of data they shove around. Back in the day, this was a new thing, today, everybody is doing it.

Damien published a set of design goals for CouchDB and chief among them was reliability. It is a [fucking] database, it should take care of your data in a reliable way. He set out to start writing a robust storage engine, a query engine and -language called “Fabric”. It was all written in C++ and single threaded. You might guess where this is getting at.

He then set out to “add concurrency”. Being a versed C++ developer having worked on the core of an industry strength database system for years he knew: with the reliability goals he set out for CouchDB, there’s no way he could write a concurrent C++ engine around his single threaded parts. Bare in mind that this is still a one-man show, he might be able to do this in a five year time frame, but that wasn’t really an option. He proved to be able solve impossible programming tasks in the past, but experience now told him: no way.

He went on The Web, researching concurrency more generally and on the programming blog Lambda the Ultimate he found Erlang. Who here is reading LTU? Go, do it!

Damien started evaluating Erlang for CouchDB, reading code, and documentation as much as he could to learn about the concurrency model. Coming from C++ and knowing about Java he was intrigued by process isolation, per-process stack and garbage collection. After a week he concluded:

Erlang so far seems to be a great fit for Couch. It seems like it will support all of my design goals and save me lots of development time compared to Java or C++.

So Erlang it was and in the course of about six weeks, he rewrote most of the CouchDB core in Erlang. That was in 2004.

He started releasing early versions of CouchDB and people started noticing it. That’s how I found it in 2006. I looked at it and liked it, helped with the Unix and Mac OS X ports and then stuck around, IMing with Damien regularly. In 2007 I started giving presentations and one of the topics I happen to know lots about was CouchDB, so I started at a local PHP user group. They liked my talk and blogged about it (The Web at work) and I got invited to Zurich to talk again. Then I applied for conferences and quickly, the word got around that CouchDB is worth a look. I’m not claiming all of CouchDB’s publicity, but in hindsight, I had an impact :)

In 2008 CouchDB becomes Apache CouchDB, IBM hires Damien to work on CouchDB in the process and CouchDB suddenly got a lot more validity. If you download a database from some guy’s blog, will it be there next week? An Apache project has a certain level of maturity and project stability that saves users a lot of headaches.

The State of the Couch

Today we’re here: We got rid of XML in favour of JSON, JSON is great! CouchDB has a HTTP REST API that reaps all the scalability benefits of The Web for CouchDB itself. Fabric has been replaced with JavaScript- and Erlang-based views using MapReduce. CouchDB is based around open standards and is native to The Web. Web developers don’t have to learn a lot of new things, it all feels very natural to them.

CouchDB is also simple: all of the core database is about 15,000 lines of code. This is very little code! Of course Erlang, like any functional language, creates compact code. Nonetheless, this means for easier development, predictability and fewer bugs.

We’re in the process of releasing CouchDB 0.10.0, our first beta release. Earlier releases were traditionally called “alpha” because we didn’t have all the features in that we want for 1.0, but the features we had were already pretty stable. Versions are as early 0.7 have been used in production since late 2007 successfully.

Today we have nine people committing new code and patches into CouchDB, four of them working on CouchDB-related things full time. Even though I’m not actually coding right now while giving this presentation, I consider this CouchDB work. Two of the committers work at least part-time on CouchDB things. We have over 100 production installations, small and big. And the CouchDB community is just amazing, the developers, users, helpers, everybody is really friendly, very energetic and enthusiastic. The community is a a great deal responsible for CouchDB being so much fun to work on.

There are three books being written (well, actually there are two, the one is on hold as long as I speak here). If you get any book, get the O’Reilly one. It is open source and you can read it now or buy a printed copy. The others are worth looking at, too. There’s professional training, consulting & support available. We had our own conference tracks at the two Erlang Factories this year. And we know about a handful of funded start ups who use CouchDB in their core infrastructure.

A lot of big industry players are looking at, testing or are already using CouchDB. The BBC has a massively scalable, fault tolerant, multi-data centre key-value store built on CouchDB for example. The big thing coming up in October is the inclusion of CouchDB into the Ubuntu Linux distribution. Why is this a big thing? CouchDB is not just simply a package that people can install, or is even installed by default, but Ubuntu will ship with a system service for synchronizing personal data. Personal data could be your address book or bookmark collection. Why would you want to sync this data? You might have two or more computers, a laptop and desktop machine and it’s nice to be able to keep both in sync. In addition, Canonical, the makers of Ubuntu, offer a hosted service where individuals and companies keep an online backup of their personal data.

ZDNet estimates about 13 Million Ubuntu users world wide. At least. With the next release, these will be 13 Million CouchDB users. We’re preparing to get a whole lot more bug reports then. While these users don’t necessarily see that Erlang is at work, they very likely see some of the error messages and stack traces, so be prepared, too!

This short video shows a Firefox plugin that Canonical developed that stores Firefox bookmarks in CouchDB. This demo is also using Firefox to show the contents of the bookmarks database through CouchDB’s administration interface Futon. See how the bookmark “CouchDB” also shows up in the CouchDB document in the content area. When the CouchDB value gets changed, Firefox immediately reflects the changes.

What looks like a simple demo is a big deal. Canonical is pushing for a universal data synchronization infrastructure built on CouchDB replication, but not tied to CouchDB necessarily and open for other platforms and operating systems.

All of the above is pretty amazing (if you ask me). We couldn’t have done any of that without Erlang. Damien, building the foundations, the rest of the team, quickly being able to help with the implementation, the robustness, the concurrency and reliability, all these and many more factors helped to make CouchDB what it is today.

To everyone who is working or has worked on Erlang, everybody who is or has been participating in the Erlang community:

Thank you!

Some time to breathe.

The Bigger Picture

Next I want to frame CouchDB, where does it fit in with the rest of the world. Did anyone hear about the “NoSQL movement” or about “postrelational databases”? Not so many, okay. When I started out explaining CouchDB to other people and my pitch began “CouchDB is a non-relational database that…” I got odd looks “non-relational”? What do you mean, a “database” is a “relational database” so CouchDB is not a database! I had to explain that a database in general is concerned with storing and retrieving data and then RDBMS are just one way to do that.

Now we have rich ecosystem of specialized non-relational databases, a good bunch of them in Erlang (Riak, Dynomite, supposedly SimpleDB, Scalaris). They are all different solutions to some aspects of the wide spectrum of data storage and the lesson you should take away from NoSQL is that you should evaluate and find the right solution for your problem instead of blindly choosing one system.

To seasoned Erlang programmers this may be old news, you’ve been using mnesia for ages and it is not a relational database. In The Web and other industries, this is a big deal.

Local Data is King

What makes CouchDB special among these: Replication. Beside Lotus Notes which is a large influence for CouchDB, no other system has an as robust replication system as CouchDB. Using replication for synching personal data is one use-case as you’ve seen with the Ubuntu case. But it is equally well suited to built distributed, highly available clusters of CouchDB nodes like the BBC is using it.

Taking this to its logical conclusion, we’d like to see CouchDB or something that supports CouchDB replication on any device that contains data that needs to be shared in some way. Replication gives your applications data locality that leads to much richer user experience than the poll for changes model of The Web. Replication can be optimized for throughput, not latency, since individual requests are on local data it doesn’t matter if replication before that took five minutes over your mobile connection while you were shopping. Now that you are standing in the queue, all your data is locally available on your mobile when you want it.

On top of the replication system and the built in web technologies we developed CouchApps, a special class of applications that consist of HTML and JavaScript only and are served out of CouchDB directly. There is no middleware, no Java, no PHP, no Ruby. CouchApps are data, they replicate to your peers as anything else. and since they run on the client, they are ultimately hackable. Our tagline for this is ‘getting kids in trouble for programming’. We’d rather have their near unlimited creativity put into useful shareable CouchApps than making them learn hard computer science first.

We’re already seeing other systems implement the CouchDB replication protocol. BrowserCouch is a pure JavaScript implementation of the core of CouchDB by Mozilla’s Atul Varma. The Google Chrome team wants to put in a native CouchDB B-Tree port into the Chrome browser. We’re committed to help everyone to implement the protocol.

Erlang Served Us Well

I said earlier, all this wouldn’t be possible without Erlang. We’re committed to the platform and we’re happy the community is as welcoming as it is.

I’d like to take this opportunity to start a discussion. As good as Erlang is technologically, a few things are worrying and while I understand most of the reasons for the status quo and the issues with changing the same, I’d like to push into a direction of change.

ENTER MY HUGE DISCLAIMER! We’re benefiting greatly from all the things you are working on and I don’t mean to piss anyone off. By no means!

Got it? Okay, here it goes:

Erlang is not Open Source

Of course, the source code is available, but Open Source is no longer about that alone. In a (not so recent) survey among enterprises using Open Source technologies about the reasons why they were doing so revealed that the number one reason for adoption is the lack of a vendor lock-in. Other reasons like open code, transparent development, more stable code, more secure code and less total cost of ownership matter, but by a large margin the vendor lock-in was most significant result. Enterprises have been fucked over by Microsoft, IBM, Sun and all the others for too long.

Erlang is exclusively developed by Ericsson and Erlang’s development is largely driven by the demands of Ericsson’s commercial customers. Kenneth Lundin (waves to Kenneth in the first row) & team have been very responsive to community requests for a closed-development company, but for a true Open Source project it is not enough. Roadmap and development are not transparent. Development is not transparent. If I work on a fix for a current release I don’t know if that problem has been fixed in the development version already, I don’t even know if the surrounding code still looks the same.

Most importantly though: the future of Erlang is in the hands of Ericsson. While they have a track record of getting it right for their customers and Open Source users with little friction, I only see friction increasing in the future. My company couch.io has been approached by venture capitalists with the interest of giving us funding money. When they discovered that our technological foundation is in the hand of a single company and that this company is not even focussing its efforts in the same areas we do, we have a hard time explaining why we still want to use Erlang despite the superior technology. This is a problem for major adoption of Erlang.

I hope I didn’t offend anyone. At the same time I hope to get the ball rolling towards a true, modern Open Source Erlang. The Erlang and the CouchDB community are more disjoint than it is true for other Erlang based projects and there is a gap that I’d like to see closed in the future.

The 13 Million users we hope to see soon are just the beginning of a big world of people who haven’t benefited from CouchDB or Erlang yet. I see bright future for all of us and them if we play our cards right…

…let’s go and get ‘em!

Thank you.

Posted by Jan | Comments (3)

Comments

Display comments as (Linear | Threaded)

Truly enjoyed your the transcript and slides! Very interesting and witty =)

#1 Alexis on 2009-09-07 13:43 (Reply)

Thanks :)

#1.1 Jan on 2009-09-07 16:19 (Reply)

Nice! Thanks for transcription.

#2 Hynek (Pichi) Vychodil (Homepage) on 2009-09-08 13:22 (Reply)