From ab7d6ef097bf18a3ebb97c4d73947fe8d2f059c9 Mon Sep 17 00:00:00 2001 From: nobody Date: Sat, 28 Dec 2002 04:49:29 +0000 Subject: This commit was manufactured by cvs2svn to create tag 'GHEX_2_0_0'. svn path=/tags/GHEX_2_0_0/; revision=19188 --- camel/devel-docs/camel-index.txt | 407 --------------------------------------- 1 file changed, 407 deletions(-) delete mode 100644 camel/devel-docs/camel-index.txt (limited to 'camel/devel-docs/camel-index.txt') diff --git a/camel/devel-docs/camel-index.txt b/camel/devel-docs/camel-index.txt deleted file mode 100644 index 0ebc250db0..0000000000 --- a/camel/devel-docs/camel-index.txt +++ /dev/null @@ -1,407 +0,0 @@ - -The history of text-indexing in evolution. - -CamelTextIndex was written to address several shortcomings in the -existing libibex (referred to as libibex2), which had been written to -address shortcomings in the original libibex. - -Mail indexing characteristics - -First, i'll cover some of the scenarios that a mail indexing system -must cover. They are slightly different from other indexing systems, -at least we wanted them to be. - -1. Indexing a few new messages, they may potentially reference most of - the alphabet in the index. -2. Indexing a whole mailbox for the first time -3. Unindexing anywhere from a few to all existing messages during expunge. -4. Searching. - -Cases 1, 3, and 4 occur the most often, however 2 is the most noticeable -at first use, or if the index must be reset. So the code needs to -work fast in all cases, which generally leads to trade-offs being -made. Each implementation aimed to address or ignore these -requirements in different ways, with the final implementation probably -having the best balance so far. - -The main issue is that the indexing be performed real time. We index -as we add the messages. We index before we open the mailbox. We -index as we remove messages. Because of this we need to approach -things differently to many other indexing systems; most of which work -with static data in an off-line mode. This allows them to index the -whole body of content and use as much memory and cpu time as required. - -We probably need to look at doing offline, or at least delayed -indexing in the future - but this introduces some coherency problems -with vFolders and any body searches. However, having the indexing -library a base part of Camel helps in implementing a mechanism to -achieve this. - -Ibex the first - -The original ibex used a memory-based hash table to store words. This made -the index very fast for both lookups and modifications. However any -queries required a full load of the index into memory, and any updates -required a full write of the index to disk. After about 5-10 000 -messages occasionaly adding to the index became appreciably slower as -the whole index needed to be loaded into memory first. This obviously -took a toll on memory as well. - -I wont cover the algorithms used, they were all pretty basic, the only -real smarts were that index deletes were only flagged and that data -not written to disk when the index saved. - -Evolution 1.x, ibex 2. - -In an attempt to rectify the incremental update performance of -libibex, it was completely rewritten to use an on-disk block-based -filesystem. - -Note: the first attempt used libdb - however performance was so slow -and the indices were so large it was dropped in favour of a custom -filesystem-like data file. - -The motivation was that a few extra disk lookups during -retrieval wouldn't be noticeably slower, however it should be able to -scale up to many more messages with lower memory overhead and slower -startup time. - -The block filesystem contains 3 major components: - -1. A hash table that mapped message's to a word id sequence list. -2. A hash table that mapped word's to a message id sequence list. -3. A sequence filesystem that stored sequences of id's. - -The id's are 32 bit identifiers that are unique for each word or -message. They are also designed to be reversible and static. -That is, given the id, you can map it to the string identifier that it -represents directly, without having to look it up in the hash table. - -Other features of this design is that the database file should be -kept in sync at all times with the state of the index. The message to -wordid tables are used to remove the messageid's from the word's it -contains when the message is expunged, and so on. - -Indexing operation - -The indexing operation consists of the basic steps: - -1. Lookup the messageid from the message name, using the messageid table. -2. Generate a list of words in the message -3. For each word: -4. Lookup the wordid and sequence information -5. If the word doesn't exist, create a new word/wordid -6. Add the messageid to the word sequence. -7. Add the wordid to the message sequence. - -The initial implementation only used caching at the disk-block level. -Unfortunately, the simple hash table design chosen (fixed sized base -table with chained buckets) scaled very poorly above about 10 000 -messages. So this approach proved to be too i/o intensive for -practical use, and several other caches were added to improve -performance: - -1. Stage (1) above is done entirely in memory. At initial startup - the whole list of potential names is read into an in-memory hash - table. -2. Stage (4) above is also done entirely in memory. Even a large - cache provided little benefit due to wide distribution of potential - words. This cache is only created when adding to the index. -3. Stage (6) uses the table from stage (4) and concatenates upto - approximately one disk blocks worth of messageid's before writing - them out to the word sequence. -4. Stage (7) concatenates all wordid's for a given message before - writing them out at once. - -As you can see, the added complexity meant we nearly have to cache as -much as the original version! This also almost removed all of the -startup-time benefit for incremental update of the index, as the table -was not stored as compactly on disk as the original version. - -However, we only ever stored a subset of the index in memory, and only -during updates, with some tricks to reduce memory usage for very rare -words, so the overall memory use was still much lower. - -Removing a message - -Removing a message is fairly involved: - -1. Lookup the messageid and word sequence list from the messageid table. -2. For each wordid in the sequence list -3. Lookup the message sequence list directly from the wordid table. -4. Scan each block in the sequence, and remove any instances of the - messageid. -5. Remove the message to messageid mapping in the messageid table. - -Unfortunately caching helped very little here, particularly if many -messages were removed. Also note that the file could never shrink as -the data could be spread randomly over it. Removal is an extremely -expensive an unbounded process. Deleting all of the messages in a -mailbox is extremely i/o intensive, with blocks potentially being -accessed dozens of times. - -Performing a query - -Performing a query is fast: - -1. Lookup the messageid sequence list from the wordid table. -2. For each messageid -3. Lookup the message name directly from the messageid table. - -Even without caching this performs at a very acceptable level. - -Summary - -This index performs reasonably well upto about 10 000 messages for a -complete re-index. However with incremental updates it degrads much -faster, only a few thousand messages added and it becomes tiresomely -slow and i/o bound. The index becomes more fragmented with random -updates and removals and heavily bogs down the system as you go much -beyond those few thousand messages. - -The code is also very complicated and hard to follow. There are too -many special cases, and it is buggy. Detected on-disk structure -errors result in the index being reset, which although it shrinks the -index, is very slow. - -The indices created are bulky, and never shrink. Because of the -reverse index used for message removal, there is 50% redundant data at -all times. Some overly tricky techniques (very much like ReiserFS's -tail packing) are used to waste as little space as possible, with a -great impact on performance. - -One other problem is that because the index is disk based, we -use a file descriptor continuously. With some users having ->100 folders, they quickly run out of process file descriptors and -evolution fails. To get around this a cache of least recently used -index files is used to flush away and free file descriptors so they -can be re-used. This makes it hard to lock the files; this problem -still exists with the next implementation. - -Anyway, a better solution is required. - -CamelIndex - -The first problem to address was the api. It was starting to age. -Although adequate, the api wasn't terribly clean, reusable, or -scalable. The first thing was to objectise the library, and since we -needed to use it in Camel, the best way was to create a CamelObject. - -CamelIndex was born. A mostly abstract class that provides a simple -common interface for accessing indices, including cursors and utility -and maintenance functions. - -In addition, a number of the features in libibex2 were simplified or -rewritten and abstracted into the re-usable classes that follow. - -By providing simple cursors, more complex queries were easier to write -and can execute more efficiently; camel searching now does sub-string -searches for all body queries, and still runs at a very healthy speed -and uses less memory than before. - -CamelBlockFile - -This is basically the same block filesystem used in libibex2. It -handles disk i/o based on blocks (CamelBlock), flushing modified -blocks to disk, and caching of recently accessed blocks. It was -enhanced slightly to allow blocks to be locked in memory. - -CamelKeyFile - -This is a simple reverse-linked list of sequences of keyid's. - -The main property of this file is that updates are only ever appended -to the end of the file, which improves i/o characteristics markedly. - -When an existing keyid sequence is updated, it simply points back to -the start of the previous one, and provides a pointer to the new -entry. i.e. a simple linked list. - -CamelKeyTable - -This is taken from the libibex2 code for mapping keys, with few -changes. It uses a CamelBlockFile for its i/o. - -The key table is a linked list of blocks (CamelKeyBlock) which contain -key strings and and a data pointer and flags for each key. Each block -is a packed array of string descriptors (CamelKeyKey's). - -A keyid (camel_key_t) is a 32 bit descriptor which identifies this key -in a reversible way. In this case the bottom 10 bits are used to -identify the index of the key within the key block, and the top 22 -bits are used to identify the key block itself. In this way, given -the 32 bit key id, we can reference the block containing the key -directly (with at most 1 access), and access the flags and key string -using the key index. - -Keys can potentially be removed and their keyid's reused by simply -re-packing the key block. This was used in libibex2, but not in -CamelIndex. - -[diagram - camelkeyblock] - -CamelPartitionTable - -An implementation of a scalable, on-disk 'perfect' hash table. It -uses the CamelBlockFile to handle its i/o. This is a completely new -hash table implementation which was not present in libibex2. - -[FIXME: Reference the original paper the algorithm is based on.] - -A partition table consists of a list of mapping blocks -(CamelPartitionMapBlock), which is a compact table that maps a range -of hashid's to a partition block (CamelPartitionKeyBlock), which -contains hashid's of that range. - -[diagram - camelpartitiontable] - -The partition block only maps the hashid to a keyid (see CamelKeyTable) -which means it can store a lot of keys in each block. - -To add a new value to the partition table: - -1. Calculate the hash value of the key -2. Find out which partition block the key will fit into, using the - partition table. -3. If the partition block is full: -4. If there is room in the next or previous block: -5. Merge the 2 blocks together, and split at the half-way point -6. Update the partition table hash indices to match the blocks -7. Else -8. Create a new block, and split the existing block across it -9. Insert the new block into the partition table -10. Else -11. Just add the key to the end of the block. - -Steps 5 and 8 perform a sorting of the partition key entries by hashid -to find the midpoint. It may be beneficial to store the hashid's -sorted always, it would then not require a sort to split the blocks. -This would also benefit key lookups by being able to use a binary -search. However, the incremental sort may be more expensive. - -If the partition table itself fills up, then perform a similar -splitting function on its blocks, and store it over multiple blocks. -With a block size of 1024 bytes, we can fit 127 blocks pointers, each -with 127 keys in it - around 16000 keys. So we only need 1024 bytes -of memory for each 16000 on-disk keys (assuming full tables). - -Removal is basically the same, but if we end up with an empty block we -just remove it from the partition table. CamelTextIndex doesn't -actually use removal although it is implemented in -CamelPartitionTable. - -Lookup is very simple. We basically follow steps 1 and 2, and then -perform a linear search through the block to find a matching hash id. -That is our key. This is assuming a perfect hash, additionally the -code could use the keyid to lookup in a keytable to verify the key is -indeed the right one. This would require having to support duplicate -hashid's and would make block splitting slightly more complex, but -only by a couple of lines of code. This is something that will -probably have to be addressed in the future. - -Using a partition table means that we can tell with 1 disk access -whether or not a key exists (assuming a perfect hash function), and 1 -more access to look up all of the details of the key since the keyid -is reversible. Another feature is that the partition table is always -self-balancing for any data processed in any order. - -Yet one more feature is that it is quite easy to order the writes to -the partition table so that its structure is always consistent, even -in the event of program failure. Although this has been disabled in -the current code to take maximal advantage of the block cache. - -CamelTextIndex - -CamelTextIndex is the implementation of CamelIndex now used by camel -for indexing mail. It shares some features with the second -incarnation of libibex, but is generally simpler. It uses the -previously described classes to implement the CamelIndex interface. - -Indexing operation - -Indexing operation is similar to libibex2, but without the requirement -to maintain the reverse index. - -1. Lookup the messageid from the message name, using the messageid - partition table. -2. Generate a list of words in the message -3. For each word -4. Lookup the wordid and sequence information. -5. Append the messageid to the word sequence. - -In practice we also have a word cache which caches upto 32 messageid's -for each word before it is written to the key file. - -Removing a message - -Removal is not immediate. This is one of the major performance -improvements in CamelIndex. - -1. Lookup the messageid from the message name partition table -2. Use the messageid to set a flag in the message key table to - indicate the message has been deleted. -3. Remove the key hash from the partition table. - -This comes down to a maximum of 2 disk reads and 2 disk writes. -libibex2 had unbounded maximums, depending on the number of words in a -given message. The key file is not changed. - -Because data is not removed from the files at all, an additional -optional step is required, that of compressing the indices. - -Performing a query - -Performing a query is much the same as with libibex2. We usually have -slightly less disk i/o because of a more efficient and scalable hash -table implementation, and improved locality of reference of the key -table data. - -1. Lookup the messageid from the message name partition table -2. Use the messageid to get the data pointer directly from the key - table. -3. Iterate through the key file, reading blocks backwards through the - file. - -Compressing - -Although it could have benefited from it, libibex2 did not ever -compress indices - the only way to compress an index was to remove it -and have it be rebuilt. - -CamelIndex requires a compression stage as data is never removed from -it otherwise. Because of the much greater locality of reference, the -compression stage is actually much faster than an incremental removal -of data inside the data files. - -Compressing comprises the following steps: - -1. Open a new temporary index, an index block file and an index key - file. -2. For each message in the message partition table -3. If the message is not marked deleted, add it to the new message - partition table, and recored the old messageid to new messageid - mapping. -4. For each word in the word partition table -5. For each messageid's in the word sequence list -6. If the messageid maps to a new messageid, remap the messageid, - else discard it. -7. Concatenate upto 256 messageid's in a row before writing to the - key file, to improve lookups. -8. Create a new word in the new word key table -9. Add the wordid and new sequence id to the word partition table. - -Note that at step 8 we could (should?) also check if the word has any -messages associated with it, and discard the word from the new index. - -After compression, the name partition index only contains names which -are not deleted, and the key file is compressed into larger blocks -which takes up less space and is faster to retrieve. - -During index operations a number of statistics are taken which trigger -an automatic compress when the file fragmentation or number of deleted -messages exceed a threshold. So the index maintains itself, and does -not need manual compression. - - - - -- cgit v1.2.3