1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
|
<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
<!entity Evolution "<application>Evolution</application>">
<!entity Camel "Camel">
<!entity Ibex "Ibex">
]>
<article class="whitepaper" id="ibex">
<artheader>
<title>Ibex: an Indexing System</title>
<authorgroup>
<author>
<firstname>Dan</firstname>
<surname>Winship</surname>
<affiliation>
<address>
<email>danw@helixcode.com</email>
</address>
</affiliation>
</author>
</authorgroup>
<copyright>
<year>2000</year>
<holder>Helix Code, Inc.</holder>
</copyright>
</artheader>
<sect1 id="introduction">
<title>Introduction</title>
<para>
&Ibex; is a library for text indexing. It is being used by
&Camel; to allow it to quickly search locally-stored messages,
either because the user is looking for a specific piece of text,
or because the application is contructing a vFolder or filtering
incoming mail.
</para>
</sect1>
<sect1 id="goals">
<title>Design Goals and Requirements for Ibex</title>
<para>
The design of &Ibex; is based on a number of requirements.
<itemizedlist>
<listitem>
<para>
First, obviously, it must be fast. In particular, searching
the index must be appreciably faster than searching through
the messages themselves, and constructing and maintaining
the index must not take a noticeable amount of time.
</para>
</listitem>
<listitem>
<para>
The indexes must not take up too much space. Many users have
limited filesystem quotas on the systems where they read
their mail, and even users who read mail on private machines
have to worry about running out of space on their disks. The
indexes should be able to do their job without taking up so
much space that the user decides he would be better off
without them.
</para>
<para>
Another aspect of this problem is that the system as a whole
must be clever about what it does and does not index:
accidentally indexing a "text" mail message containing
uuencoded, BinHexed, or PGP-encrypted data will drastically
affect the size of the index file. Either the caller or the
indexer itself has to avoid trying to index these sorts of
things.
</para>
</listitem>
<listitem>
<para>
The indexing system must allow data to be added to the index
incrementally, so that new messages can be added to the
index (and deleted messages can be removed from it) without
having to re-scan all existing messages.
</para>
</listitem>
<listitem>
<para>
It must allow the calling application to explain the
structure of the data however it wants to, rather than
requiring that the unit of indexing be individual files.
This way, &Camel; can index a single mbox-format file and
treat it as multiple messages.
</para>
</listitem>
<listitem>
<para>
It must support non-ASCII text, given that many people send
and receive non-English email, and even people who only
speak English may receive email from people whose names
cannot be written in the US-ASCII character set.
</para>
</listitem>
</itemizedlist>
<para>
While there are a number of existing indexing systems, none of
them met all (or even most) of our requirements.
</para>
</sect1>
<sect1 id="implementation">
<title>The Implementation</title>
<para>
&Ibex; is still young, and many of the details of the current
implementation are not yet finalized.
</para>
<para>
With the current index file format, 13 megabytes of Info files
can be indexed into a 371 kilobyte index file—a bit under
3% of the original size. This is reasonable, but making it
smaller would be nice. (The file format includes some simple
compression, but <application>gzip</application> can compress an
index file to about half its size, so we can clearly do better.)
</para>
<para>
The implementation has been profiled and optimized for speed to
some degree. But, it has so far only been run on a 500MHz
Pentium III system with very fast disks, so we have no solid
benchmarks.
</para>
<para>
Further optimization (of both the file format and the in-memory
data structures) awaits seeing how the library is most easily
used by &Evolution;: if the indexes are likely to be kept in
memory for long periods of time, the in-memory data structures
need to be kept small, but the reading and writing operations
can be slow. On the other hand, if the indexes will only be
opened when they are needed, reading and writing must be fast,
and memory usage is less critical.
</para>
<para>
Of course, to be useful for other applications that have
indexing needs, the library should provide several options, so
that each application can use the library in the way that is
most suited for its needs.
</para>
</sect1>
</article>
|