Hello, this is Jan Lehnardt and you're visiting my blog. Thanks for stopping by.
plok — It reads like a blog, but it sounds harder!
The RarestNews developer considers InnoDB and CouchDB for a re-architection of his high volume news site. He did his homework researching, but I couldn’t help but comment on a few things he wrote. The comment turned into a blog post and since this is my blog it should be posted here as well.
I am specifically referring to the paragraphs about InnoDB and CouchDB:
So, to be technical here I’ve used MyISAM tables (never really liked InnoDB because of it’s slow writes and at 100k new articles a day with lots of meta-data to write about them, like tags, dates, snippets, word frequencies, etc) - it seemed like a good decision. The bad part was that on write MyISAM locks the whole table. So 50 bots scouring the Web for news writing and locking whole table made site almost unresponsive.
I’m not yet sure how to solve it - with InnoDB, with PostgreSQL or with some kind of new-age databases like CouchDB, StrokeDB, maybe Amazon’s SimpleDB, etc…
They seem like a nice idea when you read about them, but… there are flaws.. The main problem with CouchDB for example is it’s complete HDD-dependence. Modern memory is hundreds of times faster than DB, so you’re using only 1% of speed if you use HDD-based database. And the second problem is it’s “Do not overwrite” motto. It doesn’t reuse space no longer needed, so if I write a 100KB article to database (along with some other data and then I rewrite this entry - there’s now 200KB stored on my drive) and each update eats 100KB more.
How to avoid it? Compact the database, so it creates a NEW file with only the latter 100KB. And delete the previous database file. So, even I didn’t change anything - I’ve had to write the same data 3 times (along with all of my database in compaction process). What that means.
1) It’s AT LEAST 3 time slower than your HDD speed if you want to effectively use ALL of your hard drive, so now we have only 0.3% of computer speed (compared to memory usage).
2) You can only use databases of size of HALF of your HDD (but in reality more like 33%) to effectively use CouchDB (remember - compaction process creates NEW file, so it needs at least same amount of space as it uses).
I just want to put in perspective that CouchDB is still in alpha stage and no performance work has been done. Expect the HDD dependency to be less of a problem. In the meantime, a caching HTTP proxy will do the trick for you.
The update-to-write is a design choice with the consequences you correctly line out. But “effectively using your hard drive” might not mean “use the least amount of space at all times”. It is more like don’t talk to the drive if you can avoid it and make as little seeks as possible and that is what CouchDB is designed to do at the expense of deferring another write operation to off-peak times with compaction. Compaction takes advantage of writing en-bulk, which is just flushing data to disk without random seeks. So your equation is close, but not exact
You can use the bulk insertion feature yourself to get less fragmentation in the first place when your crawlers dump their data. This is also fairly fast (3 seeks and a flush of data).
To be frank, I don’t see a crawled news item to change that often. But then, I don’t know what you are doing with it at RarestNews.
Also, InnoDB is not the worst choice. It puts data integrity before speed (which CouchDB does as well), which will always be slower than MyISAM which just doesn’t care for integrity. InnoDB works hard to make sure to hit the HDD as infrequent as possible and if it has to, to read and write in batches.
The difference between InnoDB and CouchDB is that you can control when to do some of the work with CouchDB’s compaction and InnoDB’s mechanisms add to the current load of a system. So CouchDB lets you actually make smart use o your resources.
I’d like to recommend Theo Schlossnagle’s Scalable Internet Architectures.
Among a plethora of useful information, it discusses the design of a system similar to the one you are describing.
PS: I work with MySQL on the day job and work on CouchDB in my free time, so I am obviously biased in both ways.
PPS: If you have any questions regarding any of the above, feel free to contact me.
they both suck. databases that reimplement most of a filesystem inside a flat file, with their own arbitrary one-off sets of limitations.
BTRfs and Reiser4 are fine for databases.
but we could improve on them by allowing more customization of the metadata fields to shrink minimum inode size
BbnB7A qrqvfnqutrum, [url=http://iatrohgnmqmi.com/]iatrohgnmqmi[/url], [link=http://eqoqplzkpqti.com/]eqoqplzkpqti[/link], http://yqvycbrgwmhq.com/
MyISAM does not have to lock for inserts from one thread, this is the "concurrent insert" feature.
Also, while InnoDB writes/flushes its logs for every commit (as it should!) this can be optimised if you put OS, tablespace and logfiles on separate spindles. If you also separate binary log and InnoDB logs, a commit might not involve seeks - and seeks are what makes your disk I/O slow.
Arjen, thanks for clarifying the InnoDB case. I started pointing out misconceptions about CouchDB and I though I do the InnoDB things as well, only roughly though.
why are you putting the text into a DB in the first place?
just store the meta data in the db instead of the raw text.
FWIW we also have a high-volume news site and we store text+metadata on a non-shared innodb master.
Ian, you are replying at the wrong blog. I’m merely pointing out misconceptions about CouchDB. I don’t build this thing.