diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/white-papers/mail/ChangeLog | 4 | ||||
-rw-r--r-- | doc/white-papers/mail/ibex.sgml | 158 |
2 files changed, 162 insertions, 0 deletions
diff --git a/doc/white-papers/mail/ChangeLog b/doc/white-papers/mail/ChangeLog index 6d4e8b7f8a..5933582d40 100644 --- a/doc/white-papers/mail/ChangeLog +++ b/doc/white-papers/mail/ChangeLog @@ -1,3 +1,7 @@ +2000-03-01 Dan Winship <danw@helixcode.com> + + * ibex.sgml: Ibex white paper + 2000-02-29 Dan Winship <danw@helixcode.com> * camel.sgml: Reorg a bit more, make the <PRE> section narrower, diff --git a/doc/white-papers/mail/ibex.sgml b/doc/white-papers/mail/ibex.sgml new file mode 100644 index 0000000000..dcb8f5ca4b --- /dev/null +++ b/doc/white-papers/mail/ibex.sgml @@ -0,0 +1,158 @@ +<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [ +<!entity Evolution "<application>Evolution</application>"> +<!entity Camel "Camel"> +<!entity Ibex "Ibex"> +]> + +<article class="whitepaper" id="ibex"> + + <artheader> + <title>Ibex: an Indexing System</title> + + <authorgroup> + <author> + <firstname>Dan</firstname> + <surname>Winship</surname> + <affiliation> + <address> + <email>danw@helixcode.com</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2000</year> + <holder>Helix Code, Inc.</holder> + </copyright> + + </artheader> + + <sect1 id="introduction"> + <title>Introduction</title> + + <para> + &Ibex; is a library for text indexing. It is being used by + &Camel; to allow it to quickly search locally-stored messages, + either because the user is looking for a specific piece of text, + or because the application is contructing a vFolder or filtering + incoming mail. + </para> + </sect1> + + <sect1 id="goals"> + <title>Design Goals and Requirements for Ibex</title> + + <para> + The design of &Ibex; is based on a number of requirements. + + <itemizedlist> + <listitem> + <para> + First, obviously, it must be fast. In particular, searching + the index must be appreciably faster than searching through + the messages themselves, and constructing and maintaining + the index must not take a noticeable amount of time. + </para> + </listitem> + + <listitem> + <para> + The indexes must not take up too much space. Many users have + limited filesystem quotas on the systems where they read + their mail, and even users who read mail on private machines + have to worry about running out of space on their disks. The + indexes should be able to do their job without taking up so + much space that the user decides he would be better off + without them. + </para> + + <para> + Another aspect of this problem is that the system as a whole + must be clever about what it does and does not index: + accidentally indexing a "text" mail message containing + uuencoded, BinHexed, or PGP-encrypted data will drastically + affect the size of the index file. Either the caller or the + indexer itself has to avoid trying to index these sorts of + things. + </para> + </listitem> + + <listitem> + <para> + The indexing system must allow data to be added to the index + incrementally, so that new messages can be added to the + index (and deleted messages can be removed from it) without + having to re-scan all existing messages. + </para> + </listitem> + + <listitem> + <para> + It must allow the calling application to explain the + structure of the data however it wants to, rather than + requiring that the unit of indexing be individual files. + This way, &Camel; can index a single mbox-format file and + treat it as multiple messages. + </para> + </listitem> + + <listitem> + <para> + It must support non-ASCII text, given that many people send + and receive non-English email, and even people who only + speak English may receive email from people whose names + cannot be written in the US-ASCII character set. + </para> + </listitem> + </itemizedlist> + + <para> + While there are a number of existing indexing systems, none of + them met all (or even most) of our requirements. + </para> + </sect1> + + <sect1 id="implementation"> + <title>The Implementation</title> + + <para> + &Ibex; is still young, and many of the details of the current + implementation are not yet finalized. + </para> + + <para> + With the current index file format, 13 megabytes of Info files + can be indexed into a 371 kilobyte index file—a bit under + 3% of the original size. This is reasonable, but making it + smaller would be nice. (The file format includes some simple + compression, but <application>gzip</application> can compress an + index file to about half its size, so we can clearly do better.) + </para> + + <para> + The implementation has been profiled and optimized for speed to + some degree. But, it has so far only been run on a 500MHz + Pentium III system with very fast disks, so we have no solid + benchmarks. + </para> + + <para> + Further optimization (of both the file format and the in-memory + data structures) awaits seeing how the library is most easily + used by &Evolution;: if the indexes are likely to be kept in + memory for long periods of time, the in-memory data structures + need to be kept small, but the reading and writing operations + can be slow. On the other hand, if the indexes will only be + opened when they are needed, reading and writing must be fast, + and memory usage is less critical. + </para> + + <para> + Of course, to be useful for other applications that have + indexing needs, the library should provide several options, so + that each application can use the library in the way that is + most suited for its needs. + </para> + </sect1> +</article> |