doc/white-papers/mail/ibex.sgml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
<!entity Evolution "<application>Evolution</application>">
<!entity Camel "Camel">
<!entity Ibex "Ibex">
]>

<article class="whitepaper" id="ibex">

  <artheader>
    <title>Ibex: an Indexing System</title>

    <authorgroup>
      <author>
    <firstname>Dan</firstname>
    <surname>Winship</surname>
    <affiliation>
      <address>
        <email>danw@helixcode.com</email>
      </address>
    </affiliation>
      </author>
    </authorgroup>

    <copyright>
      <year>2000</year>
      <holder>Helix Code, Inc.</holder>
    </copyright>

  </artheader>

  <sect1 id="introduction">
    <title>Introduction</title>

    <para>
      &Ibex; is a library for text indexing. It is being used by
      &Camel; to allow it to quickly search locally-stored messages,
      either because the user is looking for a specific piece of text,
      or because the application is contructing a vFolder or filtering
      incoming mail.
    </para>
  </sect1>

  <sect1 id="goals">
    <title>Design Goals and Requirements for Ibex</title>

    <para>
      The design of &Ibex; is based on a number of requirements.

    <itemizedlist>
      <listitem>
        <para>
      First, obviously, it must be fast. In particular, searching
      the index must be appreciably faster than searching through
      the messages themselves, and constructing and maintaining
      the index must not take a noticeable amount of time.
    </para>
      </listitem>

      <listitem>
        <para>
      The indexes must not take up too much space. Many users have
      limited filesystem quotas on the systems where they read
      their mail, and even users who read mail on private machines
      have to worry about running out of space on their disks. The
      indexes should be able to do their job without taking up so
      much space that the user decides he would be better off
      without them.
    </para>

    <para>
      Another aspect of this problem is that the system as a whole
      must be clever about what it does and does not index:
      accidentally indexing a "text" mail message containing
      uuencoded, BinHexed, or PGP-encrypted data will drastically
      affect the size of the index file. Either the caller or the
      indexer itself has to avoid trying to index these sorts of
      things.
    </para>
      </listitem>

      <listitem>
        <para>
      The indexing system must allow data to be added to the index
      incrementally, so that new messages can be added to the
      index (and deleted messages can be removed from it) without
      having to re-scan all existing messages.
    </para>
      </listitem>

      <listitem>
        <para>
      It must allow the calling application to explain the
      structure of the data however it wants to, rather than
      requiring that the unit of indexing be individual files.
      This way, &Camel; can index a single mbox-format file and
      treat it as multiple messages.
    </para>
      </listitem>

      <listitem>
        <para>
      It must support non-ASCII text, given that many people send
      and receive non-English email, and even people who only
      speak English may receive email from people whose names
      cannot be written in the US-ASCII character set.
    </para>
      </listitem>
    </itemizedlist>

    <para>
      While there are a number of existing indexing systems, none of
      them met all (or even most) of our requirements.
    </para>
  </sect1>

  <sect1 id="implementation">
    <title>The Implementation</title>

    <para>
      &Ibex; is still young, and many of the details of the current
      implementation are not yet finalized.
    </para>

    <para>
      With the current index file format, 13 megabytes of Info files
      can be indexed into a 371 kilobyte index file&mdash;a bit under
      3% of the original size. This is reasonable, but making it
      smaller would be nice. (The file format includes some simple
      compression, but <application>gzip</application> can compress an
      index file to about half its size, so we can clearly do better.)
    </para>

    <para>
      The implementation has been profiled and optimized for speed to
      some degree. But, it has so far only been run on a 500MHz
      Pentium III system with very fast disks, so we have no solid
      benchmarks.
    </para>

    <para>
      Further optimization (of both the file format and the in-memory
      data structures) awaits seeing how the library is most easily
      used by &Evolution;: if the indexes are likely to be kept in
      memory for long periods of time, the in-memory data structures
      need to be kept small, but the reading and writing operations
      can be slow. On the other hand, if the indexes will only be
      opened when they are needed, reading and writing must be fast,
      and memory usage is less critical.
    </para>

    <para>
      Of course, to be useful for other applications that have
      indexing needs, the library should provide several options, so
      that each application can use the library in the way that is
      most suited for its needs.
    </para>
  </sect1>
</article>