Hello, this is Jan Lehnardt and you're visiting my blog. Thanks for stopping by.
plok — It reads like a blog, but it sounds harder!
I rarely see the point of posting the slides of a presentation for people who didn't see the original presentation. Yet, this is often requested. I don't have a problem with posting my latest slides (PDF, 2.9M), but they are of little value without context and I do have a problem with posting things of little value, so here's the context.
Welcome to my presentation about CouchDb. This is a ca. 80 minute ride through the world of non-relational data storage.
I'm Jan, from Münster, Germany. I'm a developer focussing mostly on the web. If time permits I'm studying computational linguistics. I mostly do freelancing consultance work on the web and gained experience with scaling high traffic LAMP sites doing that. I'm also the co-founder of freisatz, a company bringing typographic bliss to everyone.
I had little time and no internet while preparing the slides. As a result, some are quite poor, they lack proper typography and have no fancy images to spice things up. Sorry!
If you have any questions, just interrupt me, there's no point in waiting for the end of the presentation and forgetting the question in the meantime.
Since this is a small group and I don't know most of you, it'd be nice if we can go around and everybody says what he's working on at the moment and what are you expecting to get out of this talk.
Ok, at Webtuesday, there are probably a few web developers present. CouchDb is not bound to web technology and you can do other cool things with it (I'll tell more about that later). But nice, you are a target audience, let's get moving!
If you're a web developer, especially coming from the PHP world, the first thing you get to know for storing data is files. You are then told that files are sub-optimal for concurrent access and that you should a database. This is often (if not most of the time) synonymous with use MySQL.
What I want now is that you go back to the beginning where you only want to store data. Forget everything you know about relational data, normalization and all the things that come with that. Be a newbie, you only want to store data.
A smart person, I can't remember right now who, came up with four principles of data storage. Four principles you can measure your data store (the database for example) against and see how well it performs. This concept is called The Four Pillars of Data Management. The four pillars are
You want to put your data in a store that enforces all ACID properties. You want that to be efficient so your hardware is utilized optimally. This is a very basic principle, if your data store doesn't have this, it is probably not a data store at all!
You want to get back your data easily. You want to get single data items and you want to be able to retrieve groups of data items. You also want to be able to get reporting on your data. How much data you have, how much data of a certain type you have, how many types etc.
You also want to be able to aggregate the data itself: Making total counts, calculate sums, averages and whatever you can think of.
If you're a running a website, either you or your users provide some kind of content. This is often natural language. And you often want to let your users search for the data you have in your storage systems. Again using natural language. You want to find past-tense verbs when the user entered the present tense and all these things that make searching useful. This requires something that is a bit more sophisticated than find and grep.
You want… ah no, not again a sentence that begins with You want. Usually not all your data is public and it belongs to some sort of user or role and these can be organized in groups. There should be a flexible permission system that easily fits your needs.
And communication to your data store should be encrypted, optimally using SSL.
A typical requirement in more mature systems is being able to spread your data across multiple servers, either to gain high availability or load balancing, either way, this should be easy. Optimally, a permanent and fast network connection is not required between your instances, so you can take parts offline, change them offline and have them synchronized back later, when you bring them online again. Traditional database systems have a really hard time doing this fourth pillar.
While going through the principles of data management, at no point relational storage is mentioned explicitly. This might be part of the Save-pillar, but it doesn't have to.
Enter CouchDb. It was designed specifically to implement all four pillars. And it was implemented in a way that is fast, reliable and fault-tolerant. It was also designed to leverage two of the bigger paradigm shifts we experienced in recent computing history. One, the move away from monolithic servers to clusters of cheap hardware. And two, the move from single-CPU machines to multi-CPU and multi-core-CPU systems. And CouchDb was built, unlike most older systems, with the web in mind.
This is a nice matrix of some of CouchDb's cooler features. I'm going to talk about them in more detail in a minute. The best feature is the last one. CouchDb is Open Source Software, licensed under the GPL. If you're missing anything, feel free to implement it!
The data storage module of CouchDb is quite remarkable, I'll show you why. To allow simultaneous access to your data with both, reading and writing threads, you have to be smart. When you use simple text files, and a reader looks for data, and at the same time a writer adds data, the reader gets confused. If you have more than one writer at the same time, your file get confused and your data is usually lost. To manage this, some sort of locking is usually used. So when a write occurs, all other writers and readers have to wait until the write is done. Or when a read occurs, no write can happen to avoid confusion. While this works, it doesn't scale well. And while this approach is still trying to figure out who is allowed to do what and when, CouchDb is already done with the whole thing.
CouchDb implements a lockless MVCC system. Data in CouchDb gets versioned. When a reader starts reading from the database, it only ever gets to see the version that existed when he began reading. If a write happens at the meantime, this is fine, the reader just doesn't know about it. And there can be as many readers reading at the same time as you want (or your hardware permits). Writes on other hand are serialized. They happen one after the other in the order they are coming in.
The other cool bit about CouchDb's data storage module is the way things get actually written to disk. This design is highly influenced by the way Erlang applications are written. I'll explain that more detailed when I explain CouchDb's architecture. CouchDb never overwrites data that has been written to disk. A database in CouchDb is a single file in your filesystem. Whatever you do to your data, CouchDb always appends a new version of your data instead of overwriting the old version. Even if you're deleting data, this gets appended to the database file as the latest version of a data item. I'll first explain why this is a good thing and later, how the drawbacks are handled. Because CouchDb never overwrites data, the data is always consistent and never corrupted when your server dies. Another mechanism makes sure, that you are back online where quickly after a crash. At the beginning of a database file, there's a header that maintains a count of data items you put in. And this header exists twice in a row at the beginning of the file. When you make a change to your data, first, the data gets appended to the database file, then the first header updated to reflect the new data count and then the second header is updated as a copy of the first. This makes the data store resistant against server crashes, hardware failure and power outages. There are three points when data is written that might cause data loss:
When data is appended to the end of the database file and the server crashes in the meantime. The next time CouchDb starts, it looks at the first database header and sees the data count from before the write happened. The partial data that was appended whent he server died is lost, but it was incomplete anyway and you might have bigger problems when your servers die, so losing a single data bit doesn't hurt at all.
When the data appending completes but the server crashes during the write of the first header and CouchDb is restarted again, the first header fails his integrety check and the second one gets read. The second one, just like in the first scenario, doesn't know about the latest data item and again, you don't loose much. The second header gets copied over the first and the database is up and running in no time.
When the server crashed during the write of the second header, well, on restart, the first header is checked, the new data is save and the second header gets refreshed from the first one. No problem and no data loss again.
As a consequence, at worst, you only loose a single data item. Not more. And your data is always in a consistent state. There's no need to run lengthy integrity checks on startup. CouchDb is back online immediately.
Being append-only comes with a cost. Your database files keep growing all the time. When you want to get rid of older revisions of your data, you can run a compaction routine that prunes out everything but the very latest version of your data. If you want to keep old versions around, just don't run compaction.
Now a short interlude of the storage model that CouchDb uses. You store your data into documents. Documents are simple lists of key-value pairs called fields where you specify both, the keys and the values. You can make up any structure you like if you want to, but you don't have to. Documents can also have attachments, just like emails have attachments, where you store arbitrary data. Documents are referred to by a document id. CouchDb automatically assigns globally unique ids (in a distributed database systems things like sequences don't make much sense), but you can also create your own ids and have CouchDb use them. There are also document revisions, they get updated every time you store a new version of a document and they allow you to retrieve older versions of your documents until compaction gets rid of them.
With that out of the way we come to getting back your data. CouchDb provides a very simple REST API to manage the data. You just use regular HTTP to create, update and delete data. The actual workload gets transmitted over XML at the moment, but we're working on a transition to JSON which we think, is a better fit, since we don't actually need any of XML's advanced features like validation and transformability. In the end, if you need XML, just convert the JSON to XML and you're done.
Here are some examples of using the REST API using my very simple commandline client that allows issuing HTTP requests:
CouchDb@localhost:8888 > get / <body> <h1>CouchDb</h1> <br> <h2>Welcome!</h2> </body>
-- Gets CouchDb's start screen. It actually meant to be displayed by your browser
CouchDb@localhost:8888 > get /$all_dbs <?xml version="1.0"?> <dbs> <db>bert2/</db> <db>bertoli/</db> <db>bertoli2/</db> <db>bugshrink/</db> <db>foo/</db> <db>webtuesday/</db> </dbs>
CouchDb@localhost:8888 > put /webtuesday "" <?xml version="1.0"?> <success/>
-- This creates a new database called webtuesday.
CouchDb@localhost:8888 > get /webtuesday <?xml version="1.0"?> <database id="webtuesday" doc_count="0" last_update_seq="0"/>
-- This gets information about the webtuesday database. You see there's nothing in it, which is not surprising at this point.
CouchDb@localhost:8888 > delete /webtuesday <?xml version="1.0"?> <success/>
-- This gets rid of the database again.
CouchDb@localhost:8888 > get /bugshrink <?xml version="1.0"?> <database id="bugshrink" doc_count="4" last_update_seq="7"/>
-- Here's a database that as some documents in it.
CouchDb@localhost:8888 > get /bugshrink/$all_docs <?xml version="1.0"?> <table id="$all_docs" xmlns="http://couchdb.org/table"> <row docid="$tables" rev="1BC8E89"/> <row docid="2B3B17B0C93D2461DFEF714655E67F70" rev="4134847"/> <row docid="440CB5BE312E62F8528709A512733182" rev="EFB0140"/> <row docid="F19B2255F2CF582BA3F1A48D153FC933" rev="56C908F5"/> </table>
-- And here's the list of documents in there.
CouchDb@localhost:8888 > get /bugshrink/2B3B17B0C93D2461DFEF714655E67F70 <?xml version="1.0"?> <doc id="2B3B17B0C93D2461DFEF714655E67F70" rev="4134847" xmlns="http://couchdb.org/doc"> <field name="BugId"> <text>098F6BCD4621D373CADE4E832627B4F6</text> </field> <field name="Counter"> <text>1</text> </field> <field name="Name"> <text>test</text> </field> <field name="Type"> <text>Project</text> </field> </doc>
So much for the simple demo.
Here comes the reporting and data aggregation I mentioned earlier. CouchDb allows you to specify views on your data. That is, you can specify a set of criteria that match a subset of your data and which fields you want to get back when you query the view. Every time you add a document, it checks if a view matches this document, if it does, is compiled into the view. How this view definition looks will be on a later slide. The language you do that in is called Fabric.
Now, with the conversion to JSON, we're thinking about something even more flexible. At the moment, what you define in a view is very similar to a WHERE-clause in SQL (even if I told you to forget everything about SQL). We want to allow you to specify structural patterns (think regular expressions) of the JSON data your put in. So you can not only ask for "the documents, that have this value in this field" but also "give me all documents that have a three-item list after two strings and an integer, all wrapped in a JSON object". We're not yet sure how that will look like. We're not aware of anybody having done this before and don't yet have any practical use-case for this but we're convinced that this a super-cool thing to have! Since we're still refining what CouchDb is and you disagree here or anywhere else, please let us know. Your input is valuable to us.
When you want to have fulltext search, you need to take care of two things:
Lucene is a toolkit that allows you to just do that. We integrated this with CouchDb. You can define, which databases should get indexed and CouchDb takes care of the rest. We did it in a way that allows you to integrate any search technology you like. If you don't agree with us that Lucene is great, well, just use what you prefer. All you need is two programs that you operating system can execute as standalone daemons. The first is the indexer and the other one the searcher daemon.
Every time a changes happens in CouchDb and an indexer is registered (i.e. you enabled fulltext indexing), CouchDb notifies the indexer daemon via standard input. All the indexer has to do is to listen on standard input and wait for CouchDb to send data. CouchDb then sends the name of the database that got updated followed by a newline. That's all here. The indexer then asks CouchDb for the latest changes (revision ids are exchanged and a diff is calculated) and the indexer just fetches the latest changes from CouchDb and integrates them into the search index. That's all!
The searching works very similar: If you send a fulltext query to CouchDb it just sends it over to the searcher along with the name database that should be searched. The indexer then performs the search and returns document- and revision-ids on successive lines for the documents that match the query. When there are no more documents and empty newline is sent. CouchDb then applies it's permission system to the result, so you don't get back data you're not supposed to and then it returns the remaining documents.
None of this is implemented yet, but CouchDb will have a permission system, that assigns users and groups to documents and distinct permissions for reading and writing giving you fine grained control of who can access what.
When you know PHP applications you also know that each comes with its own user management system and they all suck for one reason or another and integrating them is a pain in the arse. Ideally, MySQL's permission system should have been used but that didn't happen for one reason or another. We don't want that happening with CouchDb applications, so we're trying to make sure to create a authentication and user management system that can be re-used in every CouchDb application, making it easy to integrate new software with existing users.
For reasons explained earlier, you want to be able to share your data across multiple machines. CouchDb lets you do replication to get there. If you now think "Oh, I know MySQL replication", well, this is different. It is more like rsync, that you can run whenever you like or automatically. Also, you don't need to have single master server, but as many master servers as you like and they can all replicate from each other. Replication can either happen as push (here, slave, take this new data) or pull (dear master, please send over the news). Whatever you like.
The nice thing about event-driven replication is that you don't need a permanent internet connection. You can literally take a copy your database offline, work with it and have the changes replicated back to your main database. And you can do that with any number of instances. this is particularly useful if you external consultants that want to take your internal data to the customer, where there's no (good) network connection or you want to be able to work in other places that have no connectivity.
If you don't have constant stream of synchronization, update conflicts are bound to happen. That is, two servers update the same document with different contents independently. When these changes are replicated a conflict occurs. CouchDb then goes and reliably and deterministically chooses a winning change that becomes the new latest revision of the document. The other change gets saved as the previous revision. This ensures you always have complete and consistent data to work with and you never loose any data. You can then, later and manually promote the other change to be the winner, if you don't agree with CouchDb. The determinism of the approach has the result that two independent CouchDb instances, facing the same conflict behaving the same way, choosing the same winner.
CouchDb is mainly written in Erlang and on its OTP platform. Erlang/OTP was created by Joe Armstrong and his team at Ericsson. In the 80s, they were looking for new ways to program large telco-switches. They had some difficult requirements there. No failure in any part of the system can affect another part (i.e. a running call). Software updates must be possible while software is running, creating no downtime among them.
Erlang is a functional programming language and a runtime environment. It is a virtual machine that executes compiled bytecode. The fundamental approach to software design in Erlang is modularization and multithreading. You program is split up in independent modules (similar to classes) that make up the functionality of your program. Every instance of a module runs in its own lightweight thread.
One other design decision is to completely avoid error handing. This helps to keep your code concise, but it is actually a smart approach. For example you want to handle an HTTP request and your only interested in the "get" and "post" verbs. When you get a call, you check if you have a get or post request and run your handling routines accordingly. Everything else causes the request handler to die. It is just terminated. The module that called the handler gets notified. If it doesn't expect the termination and it usually doesn't, it dies as well. This populates all back up the caller chain. If something doesn't work, you just don't care. You just terminate and start again fresh and from scratch.
There are two kinds of threads in Erlang, worker threads and supervisor threads and they do exactly what you expect them to. If a worker thread dies, its supervisor just respawns it. The system as a whole doesn't care about errors, exceptions and all that.
This explains, partly, the design for the storage module and the work CouchDb does to ensure that your database never gets corrupted.
Erlang also lets you push new code into your live system without taking the live system down. This sounds very scary at first, but it is a very cool thing to have.
Now, you know RDBMSes and you can remember all you know about it. Just to explain again how CouchDb organizes data, I'll go through common database terminology and explain how it applies to CouchDb.
Databases are, well, databases. Just a place where you put data that belongs to a certain domain. There's nothing to learn here.
This is the documents slide again, but we discussed that already. Remember, a document is a list of fields. Fields can have values. And documents can have attachments, just like email. That is pretty straightforward.
This is an example document in XML. You can see the id, the revision id, the fields, their names and the data.
Comparing this to slide 24, this is one reason why we want to switch to JSON. This is the same document in JSON and it is a lot less verbose.
Views in CouchDb are conceptually very similar to views in traditional database systems. You pre-define a subset of your data and all data that matches the subset can be retrieved easily and fast using the view.
Here's some slightly out of date code.
I wrote a PHP Library to CouchDb that frees you from the actual HTTP and XML handling of the REST API. It uses PHPs object oriented syntax but it is actually just a collection of functions.
This shows how easily you can create a new document using the library.
There's also a C-Library by Dirk Schalge that does a similar thing. If C is for you, you can use that as well.
This is the first time I actually show how Fabric, the language to define views, looks like and it is out of date again. Imagine the SELECT not being the same as an SQL SELECT. We will have a more proper name for that soon, don't be confused. What SELECT * does is that it matches all documents. You can also have a Field="Value" instead of the * and only get documents that have that. Or you have Field > 3 or whatever you want. In the COLUMN part that is likely to be renamed already, you specify the fields you want to get back when you query the view. There you can also specify aggregation functions like sum() or avg(). On the roadmap is the support for views that reference other views so you can have some relationalism when retrieving data, if you like.
CouchDb is mainly targeted at the web and you can do a lot of cool things there, but there's more. There's obviously the distributed application, that lets you take parts offline and back online later. A simple web interfasce to documents is inherently and automatically a wiki because of CouchDb's revision system. It is already a version control system with a REST frontend.
Dirk, who did the C-Library, also made a FUSE plugin that allows you to access CouchDB using your filesystem. This can be used to mount, for example, your home directory into a CouchDb on your main machine and to do the same on your laptop. You can then go an synchronize both mounts on the CouchDb level and voilá, you have a home directory that is replicated across as many machines as you like.
MySQL makes it very easy to create storage engines for their database. They abstract away all the SQL and leave you with the job of saving and retrieving of data. It'd be quite easy to put CouchDb under MySQL this way. There's already a REST engine that can be used for that. With this, you would be able to store your data like you are used to in MySQL but you could use CouchDb for fast and efficient data retrieval and replication.
Most of this was done by Damien Katz. He worked for Iris, which was a consultancy that built Notes for Lotus which then went into IBM. Damien worked on the Notes database. He then worked for a startup that didn't make it into the web 2.0 and then took two years off, living off his savings, building CouchDb. He's now with MySQL for around two months where he helps stabilizing the upcoming 5.1 release.
Some pointers for you to further explore CouchDb
Thanks for having me, the talk and beers afterwards were a lot of fun. I hope you enjoyed it.
Cool presentation, Jan! Most people, who put their slides online, don't make the effort to provide their talking, so the presentations are way harder to follow, especially example-slides. It was really nice, thanks!
Excellent presentation Jan! CouchDb is a superbly designed system.