Fault Tolerant CouchDB

Wednesday, December 26. 2007

This is a follow-up piece to an earlier rant:

We are hitting a few walls with a CouchDB deployment and both Damien and I are a bit puzzled. This posting tries to attract someone with a clue to help us out. Our problems might result from not understanding the documentation correctly, but with evidently inaccurate material, we stand little chance

Long story short: We’ve got it all sorted out.

Memory Hogging Spidermonkey

Sam Ruby relayed a hint by “a Mozilla Developer”. Invoking Spidermonkey with the -b parameter and a value of 1000000, we are able to keep the memory footprint constant. We haven’t measured how this impacts performance, though.

Crashing Erlang VM

#erlang on irc.freenode.org helped to clarify how heart is supposed to work. We interpreted the documentation as heart being a monitoring process that restarts the Erlang VM, when it crashes. That is not the case and totally wrong. Since heart is started from the Erlang VM (it is a child process in the process hierarchy), it cannot start a new VM when the old one crashes because the OS wipes out all child process before they can do anything.

What is heart good for then? Apparently, the Erlang VM can potentially get stuck (tip o’ the head to noss). I don’t know how often and under what circumstances that happens (I guess it is seldom and rare), but it can happen. Heart is designed to to check the VM’s health every now and then and launch a utility programme that takes care of the application restart.

A side note, the minimum timeout that heart allows for the Erlang VM to not respond to health checks is 11 seconds. The heart man page clearly states the fact, but heart behaves unintuitive when you specify, say, 10 seconds because you failed to differentiate between < and <=. Instead of defaulting to the lowest possible (11) value, it assumes the default value of 60 seconds which makes testers (me) think, nothing happens at all. Now this is clearly a PEBKAC and RTFM-type of error, but to be frank, the fine manual is not very approachable and I decided to fall back to heart.c to see how things actually work.

Automatically restarting CouchDB

Noah Slater pimped the script that launches CouchDB in a way that, if you want to, CouchDB gets restarted automatically, in case the Erlang process dies. This is quite nice. Since CouchDB takes almost no time to restart, you have a nearly uninterrupted service. We also have heart configured in a way that in case the Erlang VM gets stuck, it kills the VM process and nothing else. The launch script then detects that the process is gone and restarts it. This takes at least 11 seconds, as outlined above. If you need less, you need to hack heart.c.

Thanks to all who sent in suggestions and words of help.

Posted by Jan | Comments (0)
View as PDF: This entry | This month | Full blog

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Markdown format allowed
	Remember Information? Subscribe to this entry

View as PDF: This entry | This month | Full blog