aboutsummaryrefslogtreecommitdiffstats
path: root/doc/white-papers
diff options
context:
space:
mode:
Diffstat (limited to 'doc/white-papers')
-rw-r--r--doc/white-papers/mail/ChangeLog4
-rw-r--r--doc/white-papers/mail/ibex.sgml158
2 files changed, 162 insertions, 0 deletions
diff --git a/doc/white-papers/mail/ChangeLog b/doc/white-papers/mail/ChangeLog
index 6d4e8b7f8a..5933582d40 100644
--- a/doc/white-papers/mail/ChangeLog
+++ b/doc/white-papers/mail/ChangeLog
@@ -1,3 +1,7 @@
+2000-03-01 Dan Winship <danw@helixcode.com>
+
+ * ibex.sgml: Ibex white paper
+
2000-02-29 Dan Winship <danw@helixcode.com>
* camel.sgml: Reorg a bit more, make the <PRE> section narrower,
diff --git a/doc/white-papers/mail/ibex.sgml b/doc/white-papers/mail/ibex.sgml
new file mode 100644
index 0000000000..dcb8f5ca4b
--- /dev/null
+++ b/doc/white-papers/mail/ibex.sgml
@@ -0,0 +1,158 @@
+<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
+<!entity Evolution "<application>Evolution</application>">
+<!entity Camel "Camel">
+<!entity Ibex "Ibex">
+]>
+
+<article class="whitepaper" id="ibex">
+
+ <artheader>
+ <title>Ibex: an Indexing System</title>
+
+ <authorgroup>
+ <author>
+ <firstname>Dan</firstname>
+ <surname>Winship</surname>
+ <affiliation>
+ <address>
+ <email>danw@helixcode.com</email>
+ </address>
+ </affiliation>
+ </author>
+ </authorgroup>
+
+ <copyright>
+ <year>2000</year>
+ <holder>Helix Code, Inc.</holder>
+ </copyright>
+
+ </artheader>
+
+ <sect1 id="introduction">
+ <title>Introduction</title>
+
+ <para>
+ &Ibex; is a library for text indexing. It is being used by
+ &Camel; to allow it to quickly search locally-stored messages,
+ either because the user is looking for a specific piece of text,
+ or because the application is contructing a vFolder or filtering
+ incoming mail.
+ </para>
+ </sect1>
+
+ <sect1 id="goals">
+ <title>Design Goals and Requirements for Ibex</title>
+
+ <para>
+ The design of &Ibex; is based on a number of requirements.
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ First, obviously, it must be fast. In particular, searching
+ the index must be appreciably faster than searching through
+ the messages themselves, and constructing and maintaining
+ the index must not take a noticeable amount of time.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The indexes must not take up too much space. Many users have
+ limited filesystem quotas on the systems where they read
+ their mail, and even users who read mail on private machines
+ have to worry about running out of space on their disks. The
+ indexes should be able to do their job without taking up so
+ much space that the user decides he would be better off
+ without them.
+ </para>
+
+ <para>
+ Another aspect of this problem is that the system as a whole
+ must be clever about what it does and does not index:
+ accidentally indexing a "text" mail message containing
+ uuencoded, BinHexed, or PGP-encrypted data will drastically
+ affect the size of the index file. Either the caller or the
+ indexer itself has to avoid trying to index these sorts of
+ things.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The indexing system must allow data to be added to the index
+ incrementally, so that new messages can be added to the
+ index (and deleted messages can be removed from it) without
+ having to re-scan all existing messages.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ It must allow the calling application to explain the
+ structure of the data however it wants to, rather than
+ requiring that the unit of indexing be individual files.
+ This way, &Camel; can index a single mbox-format file and
+ treat it as multiple messages.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ It must support non-ASCII text, given that many people send
+ and receive non-English email, and even people who only
+ speak English may receive email from people whose names
+ cannot be written in the US-ASCII character set.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ While there are a number of existing indexing systems, none of
+ them met all (or even most) of our requirements.
+ </para>
+ </sect1>
+
+ <sect1 id="implementation">
+ <title>The Implementation</title>
+
+ <para>
+ &Ibex; is still young, and many of the details of the current
+ implementation are not yet finalized.
+ </para>
+
+ <para>
+ With the current index file format, 13 megabytes of Info files
+ can be indexed into a 371 kilobyte index file&mdash;a bit under
+ 3% of the original size. This is reasonable, but making it
+ smaller would be nice. (The file format includes some simple
+ compression, but <application>gzip</application> can compress an
+ index file to about half its size, so we can clearly do better.)
+ </para>
+
+ <para>
+ The implementation has been profiled and optimized for speed to
+ some degree. But, it has so far only been run on a 500MHz
+ Pentium III system with very fast disks, so we have no solid
+ benchmarks.
+ </para>
+
+ <para>
+ Further optimization (of both the file format and the in-memory
+ data structures) awaits seeing how the library is most easily
+ used by &Evolution;: if the indexes are likely to be kept in
+ memory for long periods of time, the in-memory data structures
+ need to be kept small, but the reading and writing operations
+ can be slow. On the other hand, if the indexes will only be
+ opened when they are needed, reading and writing must be fast,
+ and memory usage is less critical.
+ </para>
+
+ <para>
+ Of course, to be useful for other applications that have
+ indexing needs, the library should provide several options, so
+ that each application can use the library in the way that is
+ most suited for its needs.
+ </para>
+ </sect1>
+</article>