From 21f39505370402403081c6b17f4ce2637637164a Mon Sep 17 00:00:00 2001
From: Not Zed <NotZed@Ximian.com>
Date: Fri, 19 Apr 2002 02:04:35 +0000
Subject: Start of a white-paperish document describing camel-index and older

2002-04-19  Not Zed  <NotZed@Ximian.com>

        * devel-docs/camel-index.txt: Start of a white-paperish document
        describing camel-index and older libibex.

svn path=/trunk/; revision=16531
---
 camel/ChangeLog                  |   5 +
 camel/devel-docs/camel-index.txt | 407 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 412 insertions(+)
 create mode 100644 camel/devel-docs/camel-index.txt

diff --git a/camel/ChangeLog b/camel/ChangeLog
index c2b7d9d9c3..fbfe9c60ce 100644
--- a/camel/ChangeLog
+++ b/camel/ChangeLog
@@ -1,3 +1,8 @@
+2002-04-19  Not Zed  <NotZed@Ximian.com>
+
+	* devel-docs/camel-index.txt: Start of a white-paperish document
+	describing camel-index and older libibex.
+
 2002-04-18  Not Zed  <NotZed@Ximian.com>
 
 	* providers/local/camel-local-store.c (rename_folder): If we get a
diff --git a/camel/devel-docs/camel-index.txt b/camel/devel-docs/camel-index.txt
new file mode 100644
index 0000000000..0ebc250db0
--- /dev/null
+++ b/camel/devel-docs/camel-index.txt
@@ -0,0 +1,407 @@
+
+The history of text-indexing in evolution.
+
+CamelTextIndex was written to address several shortcomings in the
+existing libibex (referred to as libibex2), which had been written to
+address shortcomings in the original libibex.
+
+Mail indexing characteristics
+
+First, i'll cover some of the scenarios that a mail indexing system
+must cover.  They are slightly different from other indexing systems,
+at least we wanted them to be.
+
+1. Indexing a few new messages, they may potentially reference most of
+   the alphabet in the index.
+2. Indexing a whole mailbox for the first time
+3. Unindexing anywhere from a few to all existing messages during expunge.
+4. Searching.
+
+Cases 1, 3, and 4 occur the most often, however 2 is the most noticeable
+at first use, or if the index must be reset.  So the code needs to
+work fast in all cases, which generally leads to trade-offs being
+made.  Each implementation aimed to address or ignore these
+requirements in different ways, with the final implementation probably
+having the best balance so far.
+
+The main issue is that the indexing be performed real time.  We index
+as we add the messages.  We index before we open the mailbox.  We
+index as we remove messages.  Because of this we need to approach
+things differently to many other indexing systems; most of which work
+with static data in an off-line mode.  This allows them to index the
+whole body of content and use as much memory and cpu time as required.
+
+We probably need to look at doing offline, or at least delayed
+indexing in the future - but this introduces some coherency problems
+with vFolders and any body searches.  However, having the indexing
+library a base part of Camel helps in implementing a mechanism to
+achieve this.
+
+Ibex the first
+
+The original ibex used a memory-based hash table to store words.  This made
+the index very fast for both lookups and modifications.   However any
+queries required a full load of the index into memory, and any updates
+required a full write of the index to disk.  After about 5-10 000
+messages occasionaly adding to the index became appreciably slower as
+the whole index needed to be loaded into memory first.  This obviously
+took a toll on memory as well.
+
+I wont cover the algorithms used, they were all pretty basic, the only
+real smarts were that index deletes were only flagged and that data
+not written to disk when the index saved.
+
+Evolution 1.x, ibex 2.
+
+In an attempt to rectify the incremental update performance of
+libibex, it was completely rewritten to use an on-disk block-based
+filesystem.
+
+Note: the first attempt used libdb - however performance was so slow
+and the indices were so large it was dropped in favour of a custom
+filesystem-like data file.
+
+The motivation was that a few extra disk lookups during
+retrieval wouldn't be noticeably slower, however it should be able to
+scale up to many more messages with lower memory overhead and slower
+startup time.
+
+The block filesystem contains 3 major components:
+
+1. A hash table that mapped message's to a word id sequence list.
+2. A hash table that mapped word's to a message id sequence list.
+3. A sequence filesystem that stored sequences of id's.
+
+The id's are 32 bit identifiers that are unique for each word or
+message.  They are also designed to be reversible and static.
+That is, given the id, you can map it to the string identifier that it
+represents directly, without having to look it up in the hash table.
+
+Other features of this design is that the database file should be
+kept in sync at all times with the state of the index.  The message to
+wordid tables are used to remove the messageid's from the word's it
+contains when the message is expunged, and so on.
+
+Indexing operation
+
+The indexing operation consists of the basic steps:
+
+1. Lookup the messageid from the message name, using the messageid table.
+2. Generate a list of words in the message
+3. For each word:
+4.   Lookup the wordid and sequence information
+5.   If the word doesn't exist, create a new word/wordid
+6.   Add the messageid to the word sequence.
+7.   Add the wordid to the message sequence.
+
+The initial implementation only used caching at the disk-block level.
+Unfortunately, the simple hash table design chosen (fixed sized base
+table with chained buckets) scaled very poorly above about 10 000
+messages.  So this approach proved to be too i/o intensive for
+practical use, and several other caches were added to improve
+performance:
+
+1. Stage (1) above is done entirely in memory.  At initial startup
+   the whole list of potential names is read into an in-memory hash
+   table.
+2. Stage (4) above is also done entirely in memory.  Even a large
+   cache provided little benefit due to wide distribution of potential
+   words.  This cache is only created when adding to the index.
+3. Stage (6) uses the table from stage (4) and concatenates upto
+   approximately one disk blocks worth of messageid's before writing
+   them out to the word sequence.
+4. Stage (7) concatenates all wordid's for a given message before
+   writing them out at once.
+
+As you can see, the added complexity meant we nearly have to cache as
+much as the original version!  This also almost removed all of the
+startup-time benefit for incremental update of the index, as the table
+was not stored as compactly on disk as the original version.
+
+However, we only ever stored a subset of the index in memory, and only
+during updates, with some tricks to reduce memory usage for very rare
+words, so the overall memory use was still much lower.
+
+Removing a message
+
+Removing a message is fairly involved:
+
+1. Lookup the messageid and word sequence list from the messageid table.
+2. For each wordid in the sequence list
+3.   Lookup the message sequence list directly from the wordid table.
+4.   Scan each block in the sequence, and remove any instances of the
+     messageid.
+5. Remove the message to messageid mapping in the messageid table.
+
+Unfortunately caching helped very little here, particularly if many
+messages were removed.  Also note that the file could never shrink as
+the data could be spread randomly over it.  Removal is an extremely
+expensive an unbounded process.  Deleting all of the messages in a
+mailbox is extremely i/o intensive, with blocks potentially being
+accessed dozens of times.
+
+Performing a query
+
+Performing a query is fast:
+
+1. Lookup the messageid sequence list from the wordid table.
+2. For each messageid
+3.   Lookup the message name directly from the messageid table.
+
+Even without caching this performs at a very acceptable level.
+
+Summary
+
+This index performs reasonably well upto about 10 000 messages for a
+complete re-index.  However with incremental updates it degrads much
+faster, only a few thousand messages added and it becomes tiresomely
+slow and i/o bound.  The index becomes more fragmented with random
+updates and removals and heavily bogs down the system as you go much
+beyond those few thousand messages.
+
+The code is also very complicated and hard to follow.  There are too
+many special cases, and it is buggy.  Detected on-disk structure
+errors result in the index being reset, which although it shrinks the
+index, is very slow.
+
+The indices created are bulky, and never shrink.  Because of the
+reverse index used for message removal, there is 50% redundant data at
+all times.  Some overly tricky techniques (very much like ReiserFS's
+tail packing) are used to waste as little space as possible, with a
+great impact on performance.
+
+One other problem is that because the index is disk based, we
+use a file descriptor continuously.  With some users having
+>100 folders, they quickly run out of process file descriptors and
+evolution fails.  To get around this a cache of least recently used
+index files is used to flush away and free file descriptors so they
+can be re-used.  This makes it hard to lock the files; this problem
+still exists with the next implementation.
+
+Anyway, a better solution is required.
+
+CamelIndex
+
+The first problem to address was the api.  It was starting to age.
+Although adequate, the api wasn't terribly clean, reusable, or
+scalable.  The first thing was to objectise the library, and since we
+needed to use it in Camel, the best way was to create a CamelObject.
+
+CamelIndex was born.  A mostly abstract class that provides a simple
+common interface for accessing indices, including cursors and utility
+and maintenance functions.
+
+In addition, a number of the features in libibex2 were simplified or
+rewritten and abstracted into the re-usable classes that follow.
+
+By providing simple cursors, more complex queries were easier to write
+and can execute more efficiently; camel searching now does sub-string
+searches for all body queries, and still runs at a very healthy speed
+and uses less memory than before.
+
+CamelBlockFile
+
+This is basically the same block filesystem used in libibex2.  It
+handles disk i/o based on blocks (CamelBlock), flushing modified
+blocks to disk, and caching of recently accessed blocks.  It was
+enhanced slightly to allow blocks to be locked in memory.
+
+CamelKeyFile
+
+This is a simple reverse-linked list of sequences of keyid's.
+
+The main property of this file is that updates are only ever appended
+to the end of the file, which improves i/o characteristics markedly.
+
+When an existing keyid sequence is updated, it simply points back to
+the start of the previous one, and provides a pointer to the new
+entry.  i.e. a simple linked list.
+
+CamelKeyTable
+
+This is taken from the libibex2 code for mapping keys, with few
+changes.  It uses a CamelBlockFile for its i/o.
+
+The key table is a linked list of blocks (CamelKeyBlock) which contain
+key strings and and a data pointer and flags for each key.  Each block
+is a packed array of string descriptors (CamelKeyKey's).
+
+A keyid (camel_key_t) is a 32 bit descriptor which identifies this key
+in a reversible way.  In this case the bottom 10 bits are used to
+identify the index of the key within the key block, and the top 22
+bits are used to identify the key block itself.  In this way, given
+the 32 bit key id, we can reference the block containing the key
+directly (with at most 1 access), and access the flags and key string
+using the key index.
+
+Keys can potentially be removed and their keyid's reused by simply
+re-packing the key block.  This was used in libibex2, but not in
+CamelIndex.
+
+[diagram - camelkeyblock]
+
+CamelPartitionTable
+
+An implementation of a scalable, on-disk 'perfect' hash table.  It
+uses the CamelBlockFile to handle its i/o.  This is a completely new
+hash table implementation which was not present in libibex2.
+
+[FIXME: Reference the original paper the algorithm is based on.]
+
+A partition table consists of a list of mapping blocks
+(CamelPartitionMapBlock), which is a compact table that maps a range
+of hashid's to a partition block (CamelPartitionKeyBlock), which
+contains hashid's of that range.
+
+[diagram - camelpartitiontable]
+
+The partition block only maps the hashid to a keyid (see CamelKeyTable)
+which means it can store a lot of keys in each block.
+
+To add a new value to the partition table:
+
+1. Calculate the hash value of the key
+2. Find out which partition block the key will fit into, using the
+   partition table.
+3. If the partition block is full:
+4.   If there is room in the next or previous block:
+5.     Merge the 2 blocks together, and split at the half-way point
+6.     Update the partition table hash indices to match the blocks
+7.   Else
+8.     Create a new block, and split the existing block across it
+9.     Insert the new block into the partition table
+10. Else
+11.   Just add the key to the end of the block.
+
+Steps 5 and 8 perform a sorting of the partition key entries by hashid
+to find the midpoint.  It may be beneficial to store the hashid's
+sorted always, it would then not require a sort to split the blocks.
+This would also benefit key lookups by being able to use a binary
+search.  However, the incremental sort may be more expensive.
+
+If the partition table itself fills up, then perform a similar
+splitting function on its blocks, and store it over multiple blocks.
+With a block size of 1024 bytes, we can fit 127 blocks pointers, each
+with 127 keys in it - around 16000 keys.  So we only need 1024 bytes
+of memory for each 16000 on-disk keys (assuming full tables).
+
+Removal is basically the same, but if we end up with an empty block we
+just remove it from the partition table.  CamelTextIndex doesn't
+actually use removal although it is implemented in
+CamelPartitionTable.
+
+Lookup is very simple.  We basically follow steps 1 and 2, and then
+perform a linear search through the block to find a matching hash id.
+That is our key.  This is assuming a perfect hash, additionally the
+code could use the keyid to lookup in a keytable to verify the key is
+indeed the right one.  This would require having to support duplicate
+hashid's and would make block splitting slightly more complex, but
+only by a couple of lines of code.  This is something that will
+probably have to be addressed in the future.
+
+Using a partition table means that we can tell with 1 disk access
+whether or not a key exists (assuming a perfect hash function), and 1
+more access to look up all of the details of the key since the keyid
+is reversible.  Another feature is that the partition table is always
+self-balancing for any data processed in any order.
+
+Yet one more feature is that it is quite easy to order the writes to
+the partition table so that its structure is always consistent, even
+in the event of program failure.  Although this has been disabled in
+the current code to take maximal advantage of the block cache.
+
+CamelTextIndex
+
+CamelTextIndex is the implementation of CamelIndex now used by camel
+for indexing mail.  It shares some features with the second
+incarnation of libibex, but is generally simpler.  It uses the
+previously described classes to implement the CamelIndex interface.
+
+Indexing operation
+
+Indexing operation is similar to libibex2, but without the requirement
+to maintain the reverse index.
+
+1. Lookup the messageid from the message name, using the messageid
+   partition table.
+2. Generate a list of words in the message
+3. For each word
+4.  Lookup the wordid and sequence information.
+5.  Append the messageid to the word sequence.
+
+In practice we also have a word cache which caches upto 32 messageid's
+for each word before it is written to the key file.
+
+Removing a message
+
+Removal is not immediate.  This is one of the major performance
+improvements in CamelIndex.
+
+1. Lookup the messageid from the message name partition table
+2. Use the messageid to set a flag in the message key table to
+   indicate the message has been deleted.
+3. Remove the key hash from the partition table.
+
+This comes down to a maximum of 2 disk reads and 2 disk writes.
+libibex2 had unbounded maximums, depending on the number of words in a
+given message.  The key file is not changed.
+
+Because data is not removed from the files at all, an additional
+optional step is required, that of compressing the indices.
+
+Performing a query
+
+Performing a query is much the same as with libibex2.  We usually have
+slightly less disk i/o because of a more efficient and scalable hash
+table implementation, and improved locality of reference of the key
+table data.
+
+1. Lookup the messageid from the message name partition table
+2. Use the messageid to get the data pointer directly from the key
+   table.
+3. Iterate through the key file, reading blocks backwards through the
+   file.
+
+Compressing
+
+Although it could have benefited from it, libibex2 did not ever
+compress indices - the only way to compress an index was to remove it
+and have it be rebuilt.
+
+CamelIndex requires a compression stage as data is never removed from
+it otherwise.  Because of the much greater locality of reference, the
+compression stage is actually much faster than an incremental removal
+of data inside the data files.
+
+Compressing comprises the following steps:
+
+1. Open a new temporary index, an index block file and an index key
+   file.
+2. For each message in the message partition table
+3.   If the message is not marked deleted, add it to the new message
+     partition table, and recored the old messageid to new messageid
+     mapping.
+4. For each word in the word partition table
+5.   For each messageid's in the word sequence list
+6.     If the messageid maps to a new messageid, remap the messageid,
+       else discard it.
+7.     Concatenate upto 256 messageid's in a row before writing to the
+       key file, to improve lookups.
+8.   Create a new word in the new word key table
+9.   Add the wordid and new sequence id to the word partition table.
+
+Note that at step 8 we could (should?) also check if the word has any
+messages associated with it, and discard the word from the new index.
+
+After compression, the name partition index only contains names which
+are not deleted, and the key file is compressed into larger blocks
+which takes up less space and is faster to retrieve.
+
+During index operations a number of statistics are taken which trigger
+an automatic compress when the file fragmentation or number of deleted
+messages exceed a threshold.  So the index maintains itself, and does
+not need manual compression.
+
+
+
+
-- 
cgit v1.2.3