Hello, this is Jan Lehnardt and you're visiting my blog. Thanks for stopping by.
plok — It reads like a blog, but it sounds harder!
↑ Archives
We are hitting a few walls with a CouchDB deployment and both Damien and I are a bit puzzled. This posting tries to attract someone with a clue to help us out. Our problems might result from not understanding the documentation correctly, but with evidently inaccurate material, we stand little chance. Here it goes.
Or its garbage collection is a little ineffective. CouchDB uses Spidermonkey, Mozilla’s Javascript engine to create views on its databases. The user provides a Javascript function and CouchDB uses Spidermonkey to determine which documents to include in a View. The Javascript script that evaluates and executes the user’s function runs as a daemon.
We have a global variable there, map_results
(declaration in line 19) that gets reset to {}
for each document and map function (line 84). Line 85 features a call to gc()
to trigger the garbage collector manually. If we do not do that (i.e. comment out line 85), the process that runs the script keeps hogging memory, not releasing the intermediate values of map_results
, which we think it should.
By calling gc()
, we can keep the memory usage constant, but performance completely goes away. This is not really surprising, but less then optimal. We could gall gc()
only every so often but that is a rather quirky workaround and not a real solution.
We are now looking for somebody with Spidermonkey internals chops to shine some light on this issue.
Erlang has reputation for being robust and fault tolerant. It also has the reputation of slightly inaccurate documentation. There are two issues here, that go hand in hand with the previous problem.
When we use a hogging Spidermonkey with CouchDB and bring it to eat all available RAM on a (Linux OpenVZ vServer) machine with no swap configured, Erlang will bail out with
eheap_alloc: Cannot allocate 2056916 bytes of memory (of type “old_heap”).
It of course can’t allocate the RAM, because there’s none and Erlang should probably crash at this point. But we don’t want CouchDB to go away entirely. Following Erlang’s design principles, CouchDB employs the crash only paradigm. If there’s an unrecoverable error, respawn and try again. So CouchDB does not need to be shut down, you can just terminate it at any time. So far so good.
We now want to get CouchDB up and running again when it crashes and here lies the culprit. There are several solutions to do this with other UNIX applications, the Daemontools come to mind, or launchd on Darwin and MacOS X, but Erlang comes with a built in solution and we want to use that. It is called heart.
Start your Erlang application with the -heart
parameter and Erlang will launch an external daemon (written in C) that periodically checks, if Erlang is still running and if not, terminates the old virtual machine (if it is still around) and starts a new instance. That’s the theory. A good one, but it does not seem to work as advertised.
Heart is controlled by two environment variables, HEART_BEAT_TIMEOUT
and HEART_COMMAND
. HEART_BEAT_TIMEOUT
is an integer that specifies the number of seconds heart does allow Erlang to not respond to its regular checks. HEART_COMMAND
is the command that gets executed (using system()
from libc) in case Erlang did not respond in time.
The documentation is a bit unclear, but here are a few interpretations and their results.
By environment variables they mean UNIX global variables that you set with export VARNAME="value"
in bash or setenv VARNAME value
in csh. Here is a log of what happens:
# export HEART_COMMAND="/usr/local/bin/couchdb"
# export HEART_BEAT_TIMEOUT=10
# /usr/local/bin/couchdb
heart_beat_kill_pid = 24229
Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [kernel-poll:false]
couch 0.7.2 (LogLevel=info)
CouchDB is starting.
CouchDB has started. Time to relax.
Looks good so far, we have a running CouchDB and a heart
command. Here is how things look in pstree -alp
:
|-beam,4030 -Bd -- -root /usr/lib/erlang -progname erl -- -home /root -noshell -noinput -sasl errlog_type error -pa/usr/local/lib/couchdb/erlang/
| |-couchjs,4062 -e /usr/local/bin/couchjs -f
| |-heart,4034 -pid 4030 -ht 10
| |-inet_gethost,4044 4
| | `-inet_gethost,4045 4
| `-{beam},4031
When I now trigger enough hogging Spidermonkeys to satisfy RAM, Erlang dies with the eheap_alloc
message from above. Heart recognises that and launches our HEART_COMMAND
. A new beam
process actually appears, but it does not actually launch CouchDB.
But environment variable could also mean call erl
with -env VARNAME VALUE. /usr/local/bin/couchdb
is a simple shell script that eventually launches erl
. When we add the -env HEART_BEAT_TIMEOUT 10 -env HEART_COMMAND /usr/local/bin/couchdb
parameters here, on startup, it looks similar to what we see above. When we hit the RAM threshold though, we only get an Erlang terminated
message and output from heart that announces the new PID to monitor. There is, again, a beam running, but no CouchDB.
Now things become weird. When I remove the -env
lines from the startup script again and use export
in bash
again. I do not get the eheap_alloc
message but only the Erlang terminated
one. Excuse me?
We also came across the case where, when we got the eheap_alloc
message, that the entire processgroup (as seen in the pstree
output above) getting killed. Including heart
, so it has no chance of relaunching CouchDB. The same thing happens when we send a SIGKILL
(kill -9 $PID
).
So there is the case of heart
not exactly behaving as expected. Is our interpretation of the documentation incorrect?
And there is the case of heart
not being effective at all because it gets killed before it can jump into action. What is up with that?
Open Source software has bad documentation? You’re kidding!
Jan, you should post that message (with all necessary informations such as the erlang version number) to the erlang-questions mailing list (http://www.erlang.org/ml-archive/erlang-questions/ / erlang-questions [at] erlang [dot] org), I’m sure you’ll get help fast and those documentation issues may be qualified as bugs.
As I just wrote back to Sam Ruby, who asked in a private email:
The embedding is obligated to call
JS_GC
orJS_MaybeGC
when appropriate, and not run out of memory and force a "last ditch" GC. Is CouchDB using either of these APIs?See http://developer.mozilla.org/en/docs/JSAPI_Reference and click on individual API entry points.
Please feel free to post to the mozilla.dev.tech.js-engine newsgroup (available via Google Groups) with specific questions. That way others who may be hitting similar problems can help with fixes and share knowledge. Thanks,
/be