The history of text-indexing in evolution. CamelTextIndex was written to address several shortcomings in the existing libibex (referred to as libibex2), which had been written to address shortcomings in the original libibex. Mail indexing characteristics First, i'll cover some of the scenarios that a mail indexing system must cover. They are slightly different from other indexing systems, at least we wanted them to be. 1. Indexing a few new messages, they may potentially reference most of the alphabet in the index. 2. Indexing a whole mailbox for the first time 3. Unindexing anywhere from a few to all existing messages during expunge. 4. Searching. Cases 1, 3, and 4 occur the most often, however 2 is the most noticeable at first use, or if the index must be reset. So the code needs to work fast in all cases, which generally leads to trade-offs being made. Each implementation aimed to address or ignore these requirements in different ways, with the final implementation probably having the best balance so far. The main issue is that the indexing be performed real time. We index as we add the messages. We index before we open the mailbox. We index as we remove messages. Because of this we need to approach things differently to many other indexing systems; most of which work with static data in an off-line mode. This allows them to index the whole body of content and use as much memory and cpu time as required. We probably need to look at doing offline, or at least delayed indexing in the future - but this introduces some coherency problems with vFolders and any body searches. However, having the indexing library a base part of Camel helps in implementing a mechanism to achieve this. Ibex the first The original ibex used a memory-based hash table to store words. This made the index very fast for both lookups and modifications. However any queries required a full load of the index into memory, and any updates required a full write of the index to disk. After about 5-10 000 messages occasionaly adding to the index became appreciably slower as the whole index needed to be loaded into memory first. This obviously took a toll on memory as well. I wont cover the algorithms used, they were all pretty basic, the only real smarts were that index deletes were only flagged and that data not written to disk when the index saved. Evolution 1.x, ibex 2. In an attempt to rectify the incremental update performance of libibex, it was completely rewritten to use an on-disk block-based filesystem. Note: the first attempt used libdb - however performance was so slow and the indices were so large it was dropped in favour of a custom filesystem-like data file. The motivation was that a few extra disk lookups during retrieval wouldn't be noticeably slower, however it should be able to scale up to many more messages with lower memory overhead and slower startup time. The block filesystem contains 3 major components: 1. A hash table that mapped message's to a word id sequence list. 2. A hash table that mapped word's to a message id sequence list. 3. A sequence filesystem that stored sequences of id's. The id's are 32 bit identifiers that are unique for each word or message. They are also designed to be reversible and static. That is, given the id, you can map it to the string identifier that it represents directly, without having to look it up in the hash table. Other features of this design is that the database file should be kept in sync at all times with the state of the index. The message to wordid tables are used to remove the messageid's from the word's it contains when the message is expunged, and so on. Indexing operation The indexing operation consists of the basic steps: 1. Lookup the messageid from the message name, using the messageid table. 2. Generate a list of words in the message 3. For each word: 4. Lookup the wordid and sequence information 5. If the word doesn't exist, create a new word/wordid 6. Add the messageid to the word sequence. 7. Add the wordid to the message sequence. The initial implementation only used caching at the disk-block level. Unfortunately, the simple hash table design chosen (fixed sized base table with chained buckets) scaled very poorly above about 10 000 messages. So this approach proved to be too i/o intensive for practical use, and several other caches were added to improve performance: 1. Stage (1) above is done entirely in memory. At initial startup the whole list of potential names is read into an in-memory hash table. 2. Stage (4) above is also done entirely in memory. Even a large cache provided little benefit due to wide distribution of potential words. This cache is only created when adding to the index. 3. Stage (6) uses the table from stage (4) and concatenates upto approximately one disk blocks worth of messageid's before writing them out to the word sequence. 4. Stage (7) concatenates all wordid's for a given message before writing them out at once. As you can see, the added complexity meant we nearly have to cache as much as the original version! This also almost removed all of the startup-time benefit for incremental update of the index, as the table was not stored as compactly on disk as the original version. However, we only ever stored a subset of the index in memory, and only during updates, with some tricks to reduce memory usage for very rare words, so the overall memory use was still much lower. Removing a message Removing a message is fairly involved: 1. Lookup the messageid and word sequence list from the messageid table. 2. For each wordid in the sequence list 3. Lookup the message sequence list directly from the wordid table. 4. Scan each block in the sequence, and remove any instances of the messageid. 5. Remove the message to messageid mapping in the messageid table. Unfortunately caching helped very little here, particularly if many messages were removed. Also note that the file could never shrink as the data could be spread randomly over it. Removal is an extremely expensive an unbounded process. Deleting all of the messages in a mailbox is extremely i/o intensive, with blocks potentially being accessed dozens of times. Performing a query Performing a query is fast: 1. Lookup the messageid sequence list from the wordid table. 2. For each messageid 3. Lookup the message name directly from the messageid table. Even without caching this performs at a very acceptable level. Summary This index performs reasonably well upto about 10 000 messages for a complete re-index. However with incremental updates it degrads much faster, only a few thousand messages added and it becomes tiresomely slow and i/o bound. The index becomes more fragmented with random updates and removals and heavily bogs down the system as you go much beyond those few thousand messages. The code is also very complicated and hard to follow. There are too many special cases, and it is buggy. Detected on-disk structure errors result in the index being reset, which although it shrinks the index, is very slow. The indices created are bulky, and never shrink. Because of the reverse index used for message removal, there is 50% redundant data at all times. Some overly tricky techniques (very much like ReiserFS's tail packing) are used to waste as little space as possible, with a great impact on performance. One other problem is that because the index is disk based, we use a file descriptor continuously. With some users having >100 folders, they quickly run out of process file descriptors and evolution fails. To get around this a cache of least recently used index files is used to flush away and free file descriptors so they can be re-used. This makes it hard to lock the files; this problem still exists with the next implementation. Anyway, a better solution is required. CamelIndex The first problem to address was the api. It was starting to age. Although adequate, the api wasn't terribly clean, reusable, or scalable. The first thing was to objectise the library, and since we needed to use it in Camel, the best way was to create a CamelObject. CamelIndex was born. A mostly abstract class that provides a simple common interface for accessing indices, including cursors and utility and maintenance functions. In addition, a number of the features in libibex2 were simplified or rewritten and abstracted into the re-usable classes that follow. By providing simple cursors, more complex queries were easier to write and can execute more efficiently; camel searching now does sub-string searches for all body queries, and still runs at a very healthy speed and uses less memory than before. CamelBlockFile This is basically the same block filesystem used in libibex2. It handles disk i/o based on blocks (CamelBlock), flushing modified blocks to disk, and caching of recently accessed blocks. It was enhanced slightly to allow blocks to be locked in memory. CamelKeyFile This is a simple reverse-linked list of sequences of keyid's. The main property of this file is that updates are only ever appended to the end of the file, which improves i/o characteristics markedly. When an existing keyid sequence is updated, it simply points back to the start of the previous one, and provides a pointer to the new entry. i.e. a simple linked list. CamelKeyTable This is taken from the libibex2 code for mapping keys, with few changes. It uses a CamelBlockFile for its i/o. The key table is a linked list of blocks (CamelKeyBlock) which contain key strings and and a data pointer and flags for each key. Each block is a packed array of string descriptors (CamelKeyKey's). A keyid (camel_key_t) is a 32 bit descriptor which identifies this key in a reversible way. In this case the bottom 10 bits are used to identify the index of the key within the key block, and the top 22 bits are used to identify the key block itself. In this way, given the 32 bit key id, we can reference the block containing the key directly (with at most 1 access), and access the flags and key string using the key index. Keys can potentially be removed and their keyid's reused by simply re-packing the key block. This was used in libibex2, but not in CamelIndex. [diagram - camelkeyblock] CamelPartitionTable An implementation of a scalable, on-disk 'perfect' hash table. It uses the CamelBlockFile to handle its i/o. This is a completely new hash table implementation which was not present in libibex2. [FIXME: Reference the original paper the algorithm is based on.] A partition table consists of a list of mapping blocks (CamelPartitionMapBlock), which is a compact table that maps a range of hashid's to a partition block (CamelPartitionKeyBlock), which contains hashid's of that range. [diagram - camelpartitiontable] The partition block only maps the hashid to a keyid (see CamelKeyTable) which means it can store a lot of keys in each block. To add a new value to the partition table: 1. Calculate the hash value of the key 2. Find out which partition block the key will fit into, using the partition table. 3. If the partition block is full: 4. If there is room in the next or previous block: 5. Merge the 2 blocks together, and split at the half-way point 6. Update the partition table hash indices to match the blocks 7. Else 8. Create a new block, and split the existing block across it 9. Insert the new block into the partition table 10. Else 11. Just add the key to the end of the block. Steps 5 and 8 perform a sorting of the partition key entries by hashid to find the midpoint. It may be beneficial to store the hashid's sorted always, it would then not require a sort to split the blocks. This would also benefit key lookups by being able to use a binary search. However, the incremental sort may be more expensive. If the partition table itself fills up, then perform a similar splitting function on its blocks, and store it over multiple blocks. With a block size of 1024 bytes, we can fit 127 blocks pointers, each with 127 keys in it - around 16000 keys. So we only need 1024 bytes of memory for each 16000 on-disk keys (assuming full tables). Removal is basically the same, but if we end up with an empty block we just remove it from the partition table. CamelTextIndex doesn't actually use removal although it is implemented in CamelPartitionTable. Lookup is very simple. We basically follow steps 1 and 2, and then perform a linear search through the block to find a matching hash id. That is our key. This is assuming a perfect hash, additionally the code could use the keyid to lookup in a keytable to verify the key is indeed the right one. This would require having to support duplicate hashid's and would make block splitting slightly more complex, but only by a couple of lines of code. This is something that will probably have to be addressed in the future. Using a partition table means that we can tell with 1 disk access whether or not a key exists (assuming a perfect hash function), and 1 more access to look up all of the details of the key since the keyid is reversible. Another feature is that the partition table is always self-balancing for any data processed in any order. Yet one more feature is that it is quite easy to order the writes to the partition table so that its structure is always consistent, even in the event of program failure. Although this has been disabled in the current code to take maximal advantage of the block cache. CamelTextIndex CamelTextIndex is the implementation of CamelIndex now used by camel for indexing mail. It shares some features with the second incarnation of libibex, but is generally simpler. It uses the previously described classes to implement the CamelIndex interface. Indexing operation Indexing operation is similar to libibex2, but without the requirement to maintain the reverse index. 1. Lookup the messageid from the message name, using the messageid partition table. 2. Generate a list of words in the message 3. For each word 4. Lookup the wordid and sequence information. 5. Append the messageid to the word sequence. In practice we also have a word cache which caches upto 32 messageid's for each word before it is written to the key file. Removing a message Removal is not immediate. This is one of the major performance improvements in CamelIndex. 1. Lookup the messageid from the message name partition table 2. Use the messageid to set a flag in the message key table to indicate the message has been deleted. 3. Remove the key hash from the partition table. This comes down to a maximum of 2 disk reads and 2 disk writes. libibex2 had unbounded maximums, depending on the number of words in a given message. The key file is not changed. Because data is not removed from the files at all, an additional optional step is required, that of compressing the indices. Performing a query Performing a query is much the same as with libibex2. We usually have slightly less disk i/o because of a more efficient and scalable hash table implementation, and improved locality of reference of the key table data. 1. Lookup the messageid from the message name partition table 2. Use the messageid to get the data pointer directly from the key table. 3. Iterate through the key file, reading blocks backwards through the file. Compressing Although it could have benefited from it, libibex2 did not ever compress indices - the only way to compress an index was to remove it and have it be rebuilt. CamelIndex requires a compression stage as data is never removed from it otherwise. Because of the much greater locality of reference, the compression stage is actually much faster than an incremental removal of data inside the data files. Compressing comprises the following steps: 1. Open a new temporary index, an index block file and an index key file. 2. For each message in the message partition table 3. If the message is not marked deleted, add it to the new message partition table, and recored the old messageid to new messageid mapping. 4. For each word in the word partition table 5. For each messageid's in the word sequence list 6. If the messageid maps to a new messageid, remap the messageid, else discard it. 7. Concatenate upto 256 messageid's in a row before writing to the key file, to improve lookups. 8. Create a new word in the new word key table 9. Add the wordid and new sequence id to the word partition table. Note that at step 8 we could (should?) also check if the word has any messages associated with it, and discard the word from the new index. After compression, the name partition index only contains names which are not deleted, and the key file is compressed into larger blocks which takes up less space and is faster to retrieve. During index operations a number of statistics are taken which trigger an automatic compress when the file fragmentation or number of deleted messages exceed a threshold. So the index maintains itself, and does not need manual compression.