camel/devel-docs/camel-index.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407


The history of text-indexing in evolution.

CamelTextIndex was written to address several shortcomings in the
existing libibex (referred to as libibex2), which had been written to
address shortcomings in the original libibex.

Mail indexing characteristics

First, i'll cover some of the scenarios that a mail indexing system
must cover.  They are slightly different from other indexing systems,
at least we wanted them to be.

1. Indexing a few new messages, they may potentially reference most of
   the alphabet in the index.
2. Indexing a whole mailbox for the first time
3. Unindexing anywhere from a few to all existing messages during expunge.
4. Searching.

Cases 1, 3, and 4 occur the most often, however 2 is the most noticeable
at first use, or if the index must be reset.  So the code needs to
work fast in all cases, which generally leads to trade-offs being
made.  Each implementation aimed to address or ignore these
requirements in different ways, with the final implementation probably
having the best balance so far.

The main issue is that the indexing be performed real time.  We index
as we add the messages.  We index before we open the mailbox.  We
index as we remove messages.  Because of this we need to approach
things differently to many other indexing systems; most of which work
with static data in an off-line mode.  This allows them to index the
whole body of content and use as much memory and cpu time as required.

We probably need to look at doing offline, or at least delayed
indexing in the future - but this introduces some coherency problems
with vFolders and any body searches.  However, having the indexing
library a base part of Camel helps in implementing a mechanism to
achieve this.

Ibex the first

The original ibex used a memory-based hash table to store words.  This made
the index very fast for both lookups and modifications.   However any
queries required a full load of the index into memory, and any updates
required a full write of the index to disk.  After about 5-10 000
messages occasionaly adding to the index became appreciably slower as
the whole index needed to be loaded into memory first.  This obviously
took a toll on memory as well.

I wont cover the algorithms used, they were all pretty basic, the only
real smarts were that index deletes were only flagged and that data
not written to disk when the index saved.

Evolution 1.x, ibex 2.

In an attempt to rectify the incremental update performance of
libibex, it was completely rewritten to use an on-disk block-based
filesystem.

Note: the first attempt used libdb - however performance was so slow
and the indices were so large it was dropped in favour of a custom
filesystem-like data file.

The motivation was that a few extra disk lookups during
retrieval wouldn't be noticeably slower, however it should be able to
scale up to many more messages with lower memory overhead and slower
startup time.

The block filesystem contains 3 major components:

1. A hash table that mapped message's to a word id sequence list.
2. A hash table that mapped word's to a message id sequence list.
3. A sequence filesystem that stored sequences of id's.

The id's are 32 bit identifiers that are unique for each word or
message.  They are also designed to be reversible and static.
That is, given the id, you can map it to the string identifier that it
represents directly, without having to look it up in the hash table.

Other features of this design is that the database file should be
kept in sync at all times with the state of the index.  The message to
wordid tables are used to remove the messageid's from the word's it
contains when the message is expunged, and so on.

Indexing operation

The indexing operation consists of the basic steps:

1. Lookup the messageid from the message name, using the messageid table.
2. Generate a list of words in the message
3. For each word:
4.   Lookup the wordid and sequence information
5.   If the word doesn't exist, create a new word/wordid
6.   Add the messageid to the word sequence.
7.   Add the wordid to the message sequence.

The initial implementation only used caching at the disk-block level.
Unfortunately, the simple hash table design chosen (fixed sized base
table with chained buckets) scaled very poorly above about 10 000
messages.  So this approach proved to be too i/o intensive for
practical use, and several other caches were added to improve
performance:

1. Stage (1) above is done entirely in memory.  At initial startup
   the whole list of potential names is read into an in-memory hash
   table.
2. Stage (4) above is also done entirely in memory.  Even a large
   cache provided little benefit due to wide distribution of potential
   words.  This cache is only created when adding to the index.
3. Stage (6) uses the table from stage (4) and concatenates upto
   approximately one disk blocks worth of messageid's before writing
   them out to the word sequence.
4. Stage (7) concatenates all wordid's for a given message before
   writing them out at once.

As you can see, the added complexity meant we nearly have to cache as
much as the original version!  This also almost removed all of the
startup-time benefit for incremental update of the index, as the table
was not stored as compactly on disk as the original version.

However, we only ever stored a subset of the index in memory, and only
during updates, with some tricks to reduce memory usage for very rare
words, so the overall memory use was still much lower.

Removing a message

Removing a message is fairly involved:

1. Lookup the messageid and word sequence list from the messageid table.
2. For each wordid in the sequence list
3.   Lookup the message sequence list directly from the wordid table.
4.   Scan each block in the sequence, and remove any instances of the
     messageid.
5. Remove the message to messageid mapping in the messageid table.

Unfortunately caching helped very little here, particularly if many
messages were removed.  Also note that the file could never shrink as
the data could be spread randomly over it.  Removal is an extremely
expensive an unbounded process.  Deleting all of the messages in a
mailbox is extremely i/o intensive, with blocks potentially being
accessed dozens of times.

Performing a query

Performing a query is fast:

1. Lookup the messageid sequence list from the wordid table.
2. For each messageid
3.   Lookup the message name directly from the messageid table.

Even without caching this performs at a very acceptable level.

Summary

This index performs reasonably well upto about 10 000 messages for a
complete re-index.  However with incremental updates it degrads much
faster, only a few thousand messages added and it becomes tiresomely
slow and i/o bound.  The index becomes more fragmented with random
updates and removals and heavily bogs down the system as you go much
beyond those few thousand messages.

The code is also very complicated and hard to follow.  There are too
many special cases, and it is buggy.  Detected on-disk structure
errors result in the index being reset, which although it shrinks the
index, is very slow.

The indices created are bulky, and never shrink.  Because of the
reverse index used for message removal, there is 50% redundant data at
all times.  Some overly tricky techniques (very much like ReiserFS's
tail packing) are used to waste as little space as possible, with a
great impact on performance.

One other problem is that because the index is disk based, we
use a file descriptor continuously.  With some users having
>100 folders, they quickly run out of process file descriptors and
evolution fails.  To get around this a cache of least recently used
index files is used to flush away and free file descriptors so they
can be re-used.  This makes it hard to lock the files; this problem
still exists with the next implementation.

Anyway, a better solution is required.

CamelIndex

The first problem to address was the api.  It was starting to age.
Although adequate, the api wasn't terribly clean, reusable, or
scalable.  The first thing was to objectise the library, and since we
needed to use it in Camel, the best way was to create a CamelObject.

CamelIndex was born.  A mostly abstract class that provides a simple
common interface for accessing indices, including cursors and utility
and maintenance functions.

In addition, a number of the features in libibex2 were simplified or
rewritten and abstracted into the re-usable classes that follow.

By providing simple cursors, more complex queries were easier to write
and can execute more efficiently; camel searching now does sub-string
searches for all body queries, and still runs at a very healthy speed
and uses less memory than before.

CamelBlockFile

This is basically the same block filesystem used in libibex2.  It
handles disk i/o based on blocks (CamelBlock), flushing modified
blocks to disk, and caching of recently accessed blocks.  It was
enhanced slightly to allow blocks to be locked in memory.

CamelKeyFile

This is a simple reverse-linked list of sequences of keyid's.

The main property of this file is that updates are only ever appended
to the end of the file, which improves i/o characteristics markedly.

When an existing keyid sequence is updated, it simply points back to
the start of the previous one, and provides a pointer to the new
entry.  i.e. a simple linked list.

CamelKeyTable

This is taken from the libibex2 code for mapping keys, with few
changes.  It uses a CamelBlockFile for its i/o.

The key table is a linked list of blocks (CamelKeyBlock) which contain
key strings and and a data pointer and flags for each key.  Each block
is a packed array of string descriptors (CamelKeyKey's).

A keyid (camel_key_t) is a 32 bit descriptor which identifies this key
in a reversible way.  In this case the bottom 10 bits are used to
identify the index of the key within the key block, and the top 22
bits are used to identify the key block itself.  In this way, given
the 32 bit key id, we can reference the block containing the key
directly (with at most 1 access), and access the flags and key string
using the key index.

Keys can potentially be removed and their keyid's reused by simply
re-packing the key block.  This was used in libibex2, but not in
CamelIndex.

[diagram - camelkeyblock]

CamelPartitionTable

An implementation of a scalable, on-disk 'perfect' hash table.  It
uses the CamelBlockFile to handle its i/o.  This is a completely new
hash table implementation which was not present in libibex2.

[FIXME: Reference the original paper the algorithm is based on.]

A partition table consists of a list of mapping blocks
(CamelPartitionMapBlock), which is a compact table that maps a range
of hashid's to a partition block (CamelPartitionKeyBlock), which
contains hashid's of that range.

[diagram - camelpartitiontable]

The partition block only maps the hashid to a keyid (see CamelKeyTable)
which means it can store a lot of keys in each block.

To add a new value to the partition table:

1. Calculate the hash value of the key
2. Find out which partition block the key will fit into, using the
   partition table.
3. If the partition block is full:
4.   If there is room in the next or previous block:
5.     Merge the 2 blocks together, and split at the half-way point
6.     Update the partition table hash indices to match the blocks
7.   Else
8.     Create a new block, and split the existing block across it
9.     Insert the new block into the partition table
10. Else
11.   Just add the key to the end of the block.

Steps 5 and 8 perform a sorting of the partition key entries by hashid
to find the midpoint.  It may be beneficial to store the hashid's
sorted always, it would then not require a sort to split the blocks.
This would also benefit key lookups by being able to use a binary
search.  However, the incremental sort may be more expensive.

If the partition table itself fills up, then perform a similar
splitting function on its blocks, and store it over multiple blocks.
With a block size of 1024 bytes, we can fit 127 blocks pointers, each
with 127 keys in it - around 16000 keys.  So we only need 1024 bytes
of memory for each 16000 on-disk keys (assuming full tables).

Removal is basically the same, but if we end up with an empty block we
just remove it from the partition table.  CamelTextIndex doesn't
actually use removal although it is implemented in
CamelPartitionTable.

Lookup is very simple.  We basically follow steps 1 and 2, and then
perform a linear search through the block to find a matching hash id.
That is our key.  This is assuming a perfect hash, additionally the
code could use the keyid to lookup in a keytable to verify the key is
indeed the right one.  This would require having to support duplicate
hashid's and would make block splitting slightly more complex, but
only by a couple of lines of code.  This is something that will
probably have to be addressed in the future.

Using a partition table means that we can tell with 1 disk access
whether or not a key exists (assuming a perfect hash function), and 1
more access to look up all of the details of the key since the keyid
is reversible.  Another feature is that the partition table is always
self-balancing for any data processed in any order.

Yet one more feature is that it is quite easy to order the writes to
the partition table so that its structure is always consistent, even
in the event of program failure.  Although this has been disabled in
the current code to take maximal advantage of the block cache.

CamelTextIndex

CamelTextIndex is the implementation of CamelIndex now used by camel
for indexing mail.  It shares some features with the second
incarnation of libibex, but is generally simpler.  It uses the
previously described classes to implement the CamelIndex interface.

Indexing operation

Indexing operation is similar to libibex2, but without the requirement
to maintain the reverse index.

1. Lookup the messageid from the message name, using the messageid
   partition table.
2. Generate a list of words in the message
3. For each word
4.  Lookup the wordid and sequence information.
5.  Append the messageid to the word sequence.

In practice we also have a word cache which caches upto 32 messageid's
for each word before it is written to the key file.

Removing a message

Removal is not immediate.  This is one of the major performance
improvements in CamelIndex.

1. Lookup the messageid from the message name partition table
2. Use the messageid to set a flag in the message key table to
   indicate the message has been deleted.
3. Remove the key hash from the partition table.

This comes down to a maximum of 2 disk reads and 2 disk writes.
libibex2 had unbounded maximums, depending on the number of words in a
given message.  The key file is not changed.

Because data is not removed from the files at all, an additional
optional step is required, that of compressing the indices.

Performing a query

Performing a query is much the same as with libibex2.  We usually have
slightly less disk i/o because of a more efficient and scalable hash
table implementation, and improved locality of reference of the key
table data.

1. Lookup the messageid from the message name partition table
2. Use the messageid to get the data pointer directly from the key
   table.
3. Iterate through the key file, reading blocks backwards through the
   file.

Compressing

Although it could have benefited from it, libibex2 did not ever
compress indices - the only way to compress an index was to remove it
and have it be rebuilt.

CamelIndex requires a compression stage as data is never removed from
it otherwise.  Because of the much greater locality of reference, the
compression stage is actually much faster than an incremental removal
of data inside the data files.

Compressing comprises the following steps:

1. Open a new temporary index, an index block file and an index key
   file.
2. For each message in the message partition table
3.   If the message is not marked deleted, add it to the new message
     partition table, and recored the old messageid to new messageid
     mapping.
4. For each word in the word partition table
5.   For each messageid's in the word sequence list
6.     If the messageid maps to a new messageid, remap the messageid,
       else discard it.
7.     Concatenate upto 256 messageid's in a row before writing to the
       key file, to improve lookups.
8.   Create a new word in the new word key table
9.   Add the wordid and new sequence id to the word partition table.

Note that at step 8 we could (should?) also check if the word has any
messages associated with it, and discard the word from the new index.

After compression, the name partition index only contains names which
are not deleted, and the key file is compressed into larger blocks
which takes up less space and is faster to retrieve.

During index operations a number of statistics are taken which trigger
an automatic compress when the file fragmentation or number of deleted
messages exceed a threshold.  So the index maintains itself, and does
not need manual compression.