Posts for the month of March 2009

Whoosh 0.1.11 will require reindexing

The latest PyPI release of Whoosh (0.1.11) contains changes to the index format (I changed to a sane storage format for the per-document field lengths) that are not backward compatible. You will need to recrease any existing indexes with the new version. Sorry for the inconvenience.

  • Posted: 2009-03-21 21:02
  • Author: matt
  • Categories: (none)
  • Comments (0)

Tradeoffs of additional speed

As a followup to my last post, I've been thinking about how speeding up Whoosh using array.tofile() and fromfile() raises a conflict between two of my goals for Whoosh.

Whoosh is supposed to be a fast (for Python) search library. But I also envisioned it as a useful bit of source code to hobbyists and maybe even serious researches, who could take advantage of the dynamic nature of Python to do quick experiments with it.

Using the array methods will speed up Whoosh by quite a bit (at the macro level, not 200x -- I meant to put a ;) after that bit -- but quite a bit). But while it gets Whoosh closer to being as fast as Python can go, it will also warp the implementation into something that doesn't make any sense outside of the Python interpreter.

That is, it would be a horrible way to write a search library except for the fact that it's also the fastest way given the nature of the Python interpreter, where increasing the percentage of your program that touches C code is more important than being clever or "doing it right".

I still think the speed improvements are worth it, because above all else I'd like Whoosh to be of practical use, for example to projects like Trac and MoinMoin? that have search functions but can't require native libraries. And the best way to do that is to be fast. But I'll be sorry to remove some of the "clever" bits.

  • Posted: 2009-03-21 20:45
  • Author: matt
  • Categories: (none)
  • Comments (1)

Speed improvements coming

The great perversion of trying to write high-performance code in Python is that it has almost nothing to do with being clever. It really comes down to writing your program in such as way that it touches as much C code as possible. Some things you would do for performance in another languag would be a disaster in Python. For example, it would be foolish to implement a perfect hash function in Python for better performance, because a Python implementation could never be as fast as the C implementation of the built-in dict class.

When I began implementing Whoosh, I quite shamelessly used Lucene as a template, and then elaborated on it, sometimes implementing things -- completely customizable posting formats, for example -- beyond what Lucene offers.

Lucene uses varints (variable-length integers) and deltas to compress postings lists, so instead of four bytes per number for a 32-bit integer, most numbers are stored in one byte (using deltas keeps the numbers small enough to fit in one byte). This not only saves space, but in low-level languages like C and Java, it's faster, because it helps the disk cache. My implementation of this method in Whoosh (along with the fact that I use zlib to compress stored document metadata by default) is one of the reasons Whoosh indexes are typically extremely small compared to those created by most search libraries.

The downside of using varints for Whoosh is that reading and writing varints is (obviously) not a built-in function of Python. So in some of the tightest loops in the search code, I'm using relatively slow Python constructs, like explicit for loops.

Recent tests I've run suggest using array.tofile() and array.fromfile() is about 200 times faster than writing and reading postings "manually" (that is, explicitly in Python). Which makes sense, because these array methods do both the looping and IO in C, where I'm currently doing it in the interpreter.

Of course the downside of using array is that it doesn't do any compression -- every number, no matter how small, takes up four bytes on disk. So you'd expect the on-disk size of a Whoosh index to get much larger, probably doubling. For my use-case (a medium-size online help system of about 6000 documents), I'd guess the index would probably go from approx. 4MB to somewhere around 8MB.

However, the uncompressed nature of on-disk arrays has a potential advantage: because the numbers are fixed-size, I could use offset file.seek()s to shortcut certain operations.

On balance, considering the price of disk space these days, 100-200x better searching performance probably justifies doubling the size of the index. Combined with ideas for chunked posting formats I've cribbed from the Google presentation that recently showed up on the search blogs, I think searching could potentially be much faster in a future version of Whoosh.

  • Posted: 2009-03-16 15:00
  • Author: matt
  • Categories: (none)
  • Comments (1)