Ibex: an Indexing System

Ibex: an Indexing System Dan Winship

danw@helixcode.com

2000 Helix Code, Inc. Introduction &Ibex; is a library for text indexing. It is being used by &Camel; to allow it to quickly search locally-stored messages, either because the user is looking for a specific piece of text, or because the application is contructing a vFolder or filtering incoming mail. Design Goals and Requirements for Ibex The design of &Ibex; is based on a number of requirements. First, obviously, it must be fast. In particular, searching the index must be appreciably faster than searching through the messages themselves, and constructing and maintaining the index must not take a noticeable amount of time. The indexes must not take up too much space. Many users have limited filesystem quotas on the systems where they read their mail, and even users who read mail on private machines have to worry about running out of space on their disks. The indexes should be able to do their job without taking up so much space that the user decides he would be better off without them. Another aspect of this problem is that the system as a whole must be clever about what it does and does not index: accidentally indexing a "text" mail message containing uuencoded, BinHexed, or PGP-encrypted data will drastically affect the size of the index file. Either the caller or the indexer itself has to avoid trying to index these sorts of things. The indexing system must allow data to be added to the index incrementally, so that new messages can be added to the index (and deleted messages can be removed from it) without having to re-scan all existing messages. It must allow the calling application to explain the structure of the data however it wants to, rather than requiring that the unit of indexing be individual files. This way, &Camel; can index a single mbox-format file and treat it as multiple messages. It must support non-ASCII text, given that many people send and receive non-English email, and even people who only speak English may receive email from people whose names cannot be written in the US-ASCII character set. While there are a number of existing indexing systems, none of them met all (or even most) of our requirements. The Implementation &Ibex; is still young, and many of the details of the current implementation are not yet finalized. With the current index file format, 13 megabytes of Info files can be indexed into a 371 kilobyte index file—a bit under 3% of the original size. This is reasonable, but making it smaller would be nice. (The file format includes some simple compression, but gzip can compress an index file to about half its size, so we can clearly do better.) The implementation has been profiled and optimized for speed to some degree. But, it has so far only been run on a 500MHz Pentium III system with very fast disks, so we have no solid benchmarks. Further optimization (of both the file format and the in-memory data structures) awaits seeing how the library is most easily used by &Evolution;: if the indexes are likely to be kept in memory for long periods of time, the in-memory data structures need to be kept small, but the reading and writing operations can be slow. On the other hand, if the indexes will only be opened when they are needed, reading and writing must be fast, and memory usage is less critical. Of course, to be useful for other applications that have indexing needs, the library should provide several options, so that each application can use the library in the way that is most suited for its needs.