Less is more
One thing that makes me think I'm back on the right track with the CDB format and cursor stuff is that I'm ripping out quite a bit of hard-to-follow code in my experimental branch. Feels good :)
OK, It's not so bad
OK, so it's not as bad as I thought.
- First of all, I discovered the CDB database library and tried translating its format into Python.
My home-grown on-disk associative array format is basically an ordered linear list, but chunked up and written out as small pickled blocks containing a range of key-values, with the key ranges of each block kept in an index -- basically taking advantage of the speed of cPickle, which is much faster at reading data off of disk than looping through it with pure Python.
CDB uses a two-stage hashing algorithm, where a key is hashed, the hash is assigned to one of 256 buckets, and then within that bucket it's stored in a hashtable.
Turns out my CDB implementation is 2x faster than my homegrown format when the keys are close together (the homegrown format does OK there because it's inherently caching), but 100x faster for random keys.
That's pretty exciting. (Writing out the CDB format turns out to be roughly 30% slower, but I can live with that for those kinds of lookup improvements.)
Whoosh uses the fact that keys in the on-disk tables are ordered lexicographically (like in a B-tree) all over the place. Luckily the CDB format stores its keys and values in insertion order, so all I have to do is insert them in lexicographic order like I already am doing.
One thing the current format can do is (again like a B-tree) let me say "go to this key in the table", and if the key doesn't exist it returns the closest key following the one I asked for. This turns out to be very useful in a search engine (e.g. for prefix queries). But since the CDB format is a hash, trying to get the position of a non-existent key just gives you an error.
I solved this in a simplistic way: I subclassed the CDB format so that as I insert key/value pairs, every Nth key is added to an index list. So now I can simulate the old "position at or after this key" behavior by looking up the closest key <= the target key in the index list, and then reading linearly forward from there (since CDB stores keys and values sequentially) to the matching or following key.
The trick will be choosing N, which you would want to be small for small indexes (so you don't have to read linearly through large numbers of key-values), but big for big indexes (so the list of index keys doesn't get huge and take up a lot of RAM).
- Second, I did some more synthetic benchmarking and discovered that, even though using a cursor object is much slower than using a generator (probably because of all the dot-name lookups), it seems like that's more than compensated for by not having to read and process so much when you can use the cursor object's skip_to() method in an AND query to skip over non-matching postings.
- I've also got some momentum going adding documentation to Whoosh, even though none of it has been checked in yet. I'm using Sphinx, and while I still hate reStructuredText, the convenience of the rest of Sphinx makes it worth putting up with rST. I need to write scripts so I can generate the HTML docs, include them with source uploads to PyPI, and upload them to the Whoosh website when I make a release.

rss