aboutsummaryrefslogtreecommitdiffstats
path: root/camel/devel-docs/camel-index.txt
diff options
context:
space:
mode:
authornobody <nobody@localhost>2003-05-03 19:02:31 +0800
committernobody <nobody@localhost>2003-05-03 19:02:31 +0800
commitd34577c91655ef19466b05f9270e2bc5c4ab83fa (patch)
tree1445967f5c0ef4c749bc1b4030295ea1a12e3c00 /camel/devel-docs/camel-index.txt
parent19f2626e65d1700ff9c631a70ecb917f98dfcb38 (diff)
downloadgsoc2013-evolution-R3_1.tar
gsoc2013-evolution-R3_1.tar.gz
gsoc2013-evolution-R3_1.tar.bz2
gsoc2013-evolution-R3_1.tar.lz
gsoc2013-evolution-R3_1.tar.xz
gsoc2013-evolution-R3_1.tar.zst
gsoc2013-evolution-R3_1.zip
This commit was manufactured by cvs2svn to create tag 'R3_1'.R3_1
svn path=/tags/R3_1/; revision=21091
Diffstat (limited to 'camel/devel-docs/camel-index.txt')
-rw-r--r--camel/devel-docs/camel-index.txt407
1 files changed, 0 insertions, 407 deletions
diff --git a/camel/devel-docs/camel-index.txt b/camel/devel-docs/camel-index.txt
deleted file mode 100644
index 0ebc250db0..0000000000
--- a/camel/devel-docs/camel-index.txt
+++ /dev/null
@@ -1,407 +0,0 @@
-
-The history of text-indexing in evolution.
-
-CamelTextIndex was written to address several shortcomings in the
-existing libibex (referred to as libibex2), which had been written to
-address shortcomings in the original libibex.
-
-Mail indexing characteristics
-
-First, i'll cover some of the scenarios that a mail indexing system
-must cover. They are slightly different from other indexing systems,
-at least we wanted them to be.
-
-1. Indexing a few new messages, they may potentially reference most of
- the alphabet in the index.
-2. Indexing a whole mailbox for the first time
-3. Unindexing anywhere from a few to all existing messages during expunge.
-4. Searching.
-
-Cases 1, 3, and 4 occur the most often, however 2 is the most noticeable
-at first use, or if the index must be reset. So the code needs to
-work fast in all cases, which generally leads to trade-offs being
-made. Each implementation aimed to address or ignore these
-requirements in different ways, with the final implementation probably
-having the best balance so far.
-
-The main issue is that the indexing be performed real time. We index
-as we add the messages. We index before we open the mailbox. We
-index as we remove messages. Because of this we need to approach
-things differently to many other indexing systems; most of which work
-with static data in an off-line mode. This allows them to index the
-whole body of content and use as much memory and cpu time as required.
-
-We probably need to look at doing offline, or at least delayed
-indexing in the future - but this introduces some coherency problems
-with vFolders and any body searches. However, having the indexing
-library a base part of Camel helps in implementing a mechanism to
-achieve this.
-
-Ibex the first
-
-The original ibex used a memory-based hash table to store words. This made
-the index very fast for both lookups and modifications. However any
-queries required a full load of the index into memory, and any updates
-required a full write of the index to disk. After about 5-10 000
-messages occasionaly adding to the index became appreciably slower as
-the whole index needed to be loaded into memory first. This obviously
-took a toll on memory as well.
-
-I wont cover the algorithms used, they were all pretty basic, the only
-real smarts were that index deletes were only flagged and that data
-not written to disk when the index saved.
-
-Evolution 1.x, ibex 2.
-
-In an attempt to rectify the incremental update performance of
-libibex, it was completely rewritten to use an on-disk block-based
-filesystem.
-
-Note: the first attempt used libdb - however performance was so slow
-and the indices were so large it was dropped in favour of a custom
-filesystem-like data file.
-
-The motivation was that a few extra disk lookups during
-retrieval wouldn't be noticeably slower, however it should be able to
-scale up to many more messages with lower memory overhead and slower
-startup time.
-
-The block filesystem contains 3 major components:
-
-1. A hash table that mapped message's to a word id sequence list.
-2. A hash table that mapped word's to a message id sequence list.
-3. A sequence filesystem that stored sequences of id's.
-
-The id's are 32 bit identifiers that are unique for each word or
-message. They are also designed to be reversible and static.
-That is, given the id, you can map it to the string identifier that it
-represents directly, without having to look it up in the hash table.
-
-Other features of this design is that the database file should be
-kept in sync at all times with the state of the index. The message to
-wordid tables are used to remove the messageid's from the word's it
-contains when the message is expunged, and so on.
-
-Indexing operation
-
-The indexing operation consists of the basic steps:
-
-1. Lookup the messageid from the message name, using the messageid table.
-2. Generate a list of words in the message
-3. For each word:
-4. Lookup the wordid and sequence information
-5. If the word doesn't exist, create a new word/wordid
-6. Add the messageid to the word sequence.
-7. Add the wordid to the message sequence.
-
-The initial implementation only used caching at the disk-block level.
-Unfortunately, the simple hash table design chosen (fixed sized base
-table with chained buckets) scaled very poorly above about 10 000
-messages. So this approach proved to be too i/o intensive for
-practical use, and several other caches were added to improve
-performance:
-
-1. Stage (1) above is done entirely in memory. At initial startup
- the whole list of potential names is read into an in-memory hash
- table.
-2. Stage (4) above is also done entirely in memory. Even a large
- cache provided little benefit due to wide distribution of potential
- words. This cache is only created when adding to the index.
-3. Stage (6) uses the table from stage (4) and concatenates upto
- approximately one disk blocks worth of messageid's before writing
- them out to the word sequence.
-4. Stage (7) concatenates all wordid's for a given message before
- writing them out at once.
-
-As you can see, the added complexity meant we nearly have to cache as
-much as the original version! This also almost removed all of the
-startup-time benefit for incremental update of the index, as the table
-was not stored as compactly on disk as the original version.
-
-However, we only ever stored a subset of the index in memory, and only
-during updates, with some tricks to reduce memory usage for very rare
-words, so the overall memory use was still much lower.
-
-Removing a message
-
-Removing a message is fairly involved:
-
-1. Lookup the messageid and word sequence list from the messageid table.
-2. For each wordid in the sequence list
-3. Lookup the message sequence list directly from the wordid table.
-4. Scan each block in the sequence, and remove any instances of the
- messageid.
-5. Remove the message to messageid mapping in the messageid table.
-
-Unfortunately caching helped very little here, particularly if many
-messages were removed. Also note that the file could never shrink as
-the data could be spread randomly over it. Removal is an extremely
-expensive an unbounded process. Deleting all of the messages in a
-mailbox is extremely i/o intensive, with blocks potentially being
-accessed dozens of times.
-
-Performing a query
-
-Performing a query is fast:
-
-1. Lookup the messageid sequence list from the wordid table.
-2. For each messageid
-3. Lookup the message name directly from the messageid table.
-
-Even without caching this performs at a very acceptable level.
-
-Summary
-
-This index performs reasonably well upto about 10 000 messages for a
-complete re-index. However with incremental updates it degrads much
-faster, only a few thousand messages added and it becomes tiresomely
-slow and i/o bound. The index becomes more fragmented with random
-updates and removals and heavily bogs down the system as you go much
-beyond those few thousand messages.
-
-The code is also very complicated and hard to follow. There are too
-many special cases, and it is buggy. Detected on-disk structure
-errors result in the index being reset, which although it shrinks the
-index, is very slow.
-
-The indices created are bulky, and never shrink. Because of the
-reverse index used for message removal, there is 50% redundant data at
-all times. Some overly tricky techniques (very much like ReiserFS's
-tail packing) are used to waste as little space as possible, with a
-great impact on performance.
-
-One other problem is that because the index is disk based, we
-use a file descriptor continuously. With some users having
->100 folders, they quickly run out of process file descriptors and
-evolution fails. To get around this a cache of least recently used
-index files is used to flush away and free file descriptors so they
-can be re-used. This makes it hard to lock the files; this problem
-still exists with the next implementation.
-
-Anyway, a better solution is required.
-
-CamelIndex
-
-The first problem to address was the api. It was starting to age.
-Although adequate, the api wasn't terribly clean, reusable, or
-scalable. The first thing was to objectise the library, and since we
-needed to use it in Camel, the best way was to create a CamelObject.
-
-CamelIndex was born. A mostly abstract class that provides a simple
-common interface for accessing indices, including cursors and utility
-and maintenance functions.
-
-In addition, a number of the features in libibex2 were simplified or
-rewritten and abstracted into the re-usable classes that follow.
-
-By providing simple cursors, more complex queries were easier to write
-and can execute more efficiently; camel searching now does sub-string
-searches for all body queries, and still runs at a very healthy speed
-and uses less memory than before.
-
-CamelBlockFile
-
-This is basically the same block filesystem used in libibex2. It
-handles disk i/o based on blocks (CamelBlock), flushing modified
-blocks to disk, and caching of recently accessed blocks. It was
-enhanced slightly to allow blocks to be locked in memory.
-
-CamelKeyFile
-
-This is a simple reverse-linked list of sequences of keyid's.
-
-The main property of this file is that updates are only ever appended
-to the end of the file, which improves i/o characteristics markedly.
-
-When an existing keyid sequence is updated, it simply points back to
-the start of the previous one, and provides a pointer to the new
-entry. i.e. a simple linked list.
-
-CamelKeyTable
-
-This is taken from the libibex2 code for mapping keys, with few
-changes. It uses a CamelBlockFile for its i/o.
-
-The key table is a linked list of blocks (CamelKeyBlock) which contain
-key strings and and a data pointer and flags for each key. Each block
-is a packed array of string descriptors (CamelKeyKey's).
-
-A keyid (camel_key_t) is a 32 bit descriptor which identifies this key
-in a reversible way. In this case the bottom 10 bits are used to
-identify the index of the key within the key block, and the top 22
-bits are used to identify the key block itself. In this way, given
-the 32 bit key id, we can reference the block containing the key
-directly (with at most 1 access), and access the flags and key string
-using the key index.
-
-Keys can potentially be removed and their keyid's reused by simply
-re-packing the key block. This was used in libibex2, but not in
-CamelIndex.
-
-[diagram - camelkeyblock]
-
-CamelPartitionTable
-
-An implementation of a scalable, on-disk 'perfect' hash table. It
-uses the CamelBlockFile to handle its i/o. This is a completely new
-hash table implementation which was not present in libibex2.
-
-[FIXME: Reference the original paper the algorithm is based on.]
-
-A partition table consists of a list of mapping blocks
-(CamelPartitionMapBlock), which is a compact table that maps a range
-of hashid's to a partition block (CamelPartitionKeyBlock), which
-contains hashid's of that range.
-
-[diagram - camelpartitiontable]
-
-The partition block only maps the hashid to a keyid (see CamelKeyTable)
-which means it can store a lot of keys in each block.
-
-To add a new value to the partition table:
-
-1. Calculate the hash value of the key
-2. Find out which partition block the key will fit into, using the
- partition table.
-3. If the partition block is full:
-4. If there is room in the next or previous block:
-5. Merge the 2 blocks together, and split at the half-way point
-6. Update the partition table hash indices to match the blocks
-7. Else
-8. Create a new block, and split the existing block across it
-9. Insert the new block into the partition table
-10. Else
-11. Just add the key to the end of the block.
-
-Steps 5 and 8 perform a sorting of the partition key entries by hashid
-to find the midpoint. It may be beneficial to store the hashid's
-sorted always, it would then not require a sort to split the blocks.
-This would also benefit key lookups by being able to use a binary
-search. However, the incremental sort may be more expensive.
-
-If the partition table itself fills up, then perform a similar
-splitting function on its blocks, and store it over multiple blocks.
-With a block size of 1024 bytes, we can fit 127 blocks pointers, each
-with 127 keys in it - around 16000 keys. So we only need 1024 bytes
-of memory for each 16000 on-disk keys (assuming full tables).
-
-Removal is basically the same, but if we end up with an empty block we
-just remove it from the partition table. CamelTextIndex doesn't
-actually use removal although it is implemented in
-CamelPartitionTable.
-
-Lookup is very simple. We basically follow steps 1 and 2, and then
-perform a linear search through the block to find a matching hash id.
-That is our key. This is assuming a perfect hash, additionally the
-code could use the keyid to lookup in a keytable to verify the key is
-indeed the right one. This would require having to support duplicate
-hashid's and would make block splitting slightly more complex, but
-only by a couple of lines of code. This is something that will
-probably have to be addressed in the future.
-
-Using a partition table means that we can tell with 1 disk access
-whether or not a key exists (assuming a perfect hash function), and 1
-more access to look up all of the details of the key since the keyid
-is reversible. Another feature is that the partition table is always
-self-balancing for any data processed in any order.
-
-Yet one more feature is that it is quite easy to order the writes to
-the partition table so that its structure is always consistent, even
-in the event of program failure. Although this has been disabled in
-the current code to take maximal advantage of the block cache.
-
-CamelTextIndex
-
-CamelTextIndex is the implementation of CamelIndex now used by camel
-for indexing mail. It shares some features with the second
-incarnation of libibex, but is generally simpler. It uses the
-previously described classes to implement the CamelIndex interface.
-
-Indexing operation
-
-Indexing operation is similar to libibex2, but without the requirement
-to maintain the reverse index.
-
-1. Lookup the messageid from the message name, using the messageid
- partition table.
-2. Generate a list of words in the message
-3. For each word
-4. Lookup the wordid and sequence information.
-5. Append the messageid to the word sequence.
-
-In practice we also have a word cache which caches upto 32 messageid's
-for each word before it is written to the key file.
-
-Removing a message
-
-Removal is not immediate. This is one of the major performance
-improvements in CamelIndex.
-
-1. Lookup the messageid from the message name partition table
-2. Use the messageid to set a flag in the message key table to
- indicate the message has been deleted.
-3. Remove the key hash from the partition table.
-
-This comes down to a maximum of 2 disk reads and 2 disk writes.
-libibex2 had unbounded maximums, depending on the number of words in a
-given message. The key file is not changed.
-
-Because data is not removed from the files at all, an additional
-optional step is required, that of compressing the indices.
-
-Performing a query
-
-Performing a query is much the same as with libibex2. We usually have
-slightly less disk i/o because of a more efficient and scalable hash
-table implementation, and improved locality of reference of the key
-table data.
-
-1. Lookup the messageid from the message name partition table
-2. Use the messageid to get the data pointer directly from the key
- table.
-3. Iterate through the key file, reading blocks backwards through the
- file.
-
-Compressing
-
-Although it could have benefited from it, libibex2 did not ever
-compress indices - the only way to compress an index was to remove it
-and have it be rebuilt.
-
-CamelIndex requires a compression stage as data is never removed from
-it otherwise. Because of the much greater locality of reference, the
-compression stage is actually much faster than an incremental removal
-of data inside the data files.
-
-Compressing comprises the following steps:
-
-1. Open a new temporary index, an index block file and an index key
- file.
-2. For each message in the message partition table
-3. If the message is not marked deleted, add it to the new message
- partition table, and recored the old messageid to new messageid
- mapping.
-4. For each word in the word partition table
-5. For each messageid's in the word sequence list
-6. If the messageid maps to a new messageid, remap the messageid,
- else discard it.
-7. Concatenate upto 256 messageid's in a row before writing to the
- key file, to improve lookups.
-8. Create a new word in the new word key table
-9. Add the wordid and new sequence id to the word partition table.
-
-Note that at step 8 we could (should?) also check if the word has any
-messages associated with it, and discard the word from the new index.
-
-After compression, the name partition index only contains names which
-are not deleted, and the key file is compressed into larger blocks
-which takes up less space and is faster to retrieve.
-
-During index operations a number of statistics are taken which trigger
-an automatic compress when the file fragmentation or number of deleted
-messages exceed a threshold. So the index maintains itself, and does
-not need manual compression.
-
-
-
-