path: root/devel-docs/query/virtual-folder-in-depth.sgml



<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" []>

<!-- SGMLized by Bertrand <Bertrand.Guiheneuf@inria.fr> -->

<article id="index">
 <artheader>
 <authorgroup>
   <author>
    <firstname>Giao</firstname>
    <surname>Nguyen</surname>
    </author>
   </authorgroup>
 <title>An in-depth look at the virtual folder mechanism</title>
  <abstract>
  <para>
  This document describes a different way of approaching mail
  organization and how all things are possible in this brave new
  world. This document does not describe physical storage issues nor
  interface issues.
  </para>
  <para>
  Historically mail has been organized into folders. These folders
  usually mapped to a single storage medium. The relationship between
  mail organization and storage medium was one to one. There was one
  mail organization for every storage medium. This scheme had its
  limitations.
  </para>
  <para>  
  Efforts at categorizations are only meaningful at the instance that
  one categorized. To find any piece of data, regardless of how well
  it was categorized, required some amount of searching. Therefore, any
  attempts to nullify searching is doomed to fail. It's time to embrace
  searching as a way of life.
  </para>
  <para>  
  These are the terms and their definitions. The example rules used are
  based on the syntax for VM (http://www.wonderworks.com/vm/) by Kyle
  Jones whose ideas form the basis for this. I'm only adding the
  existence of summary files to aid in scaling. I currently use VM and
  it's virtual-folder rules for my daily mail purposes. To date, my only 
  complaints are speed (it has no caches) and for the unitiated, it's
  not very user-friendly.
  </para>
  <para>  
  Comments, questions, rants, etc. should be directed at Giao Nguyen
  (grail@cafebabe.org) who will try to address issues in a timely
  manner.
  </para>
 </abstract>
</artheader>
 <sect1 id="definitions">
  <title>Definitions</title>
  <sect2>
   <title>Store</title> 
   <para>
   A location where mail can be found. This may be a file (Berkeley
   mbox), directory (MH), IMAP server, POP3 server, Exchange server,
   Lotus Notes server, a stack of Post-Its by your monitor fed through
   some OCR system.
   </para>
  </sect2>
  <sect2>
   <title>Message</title> 
   <para>  
   An individual mail message.
   </para>
  </sect2>
  <sect2>
   <title>Vfolder</title> 
   <para>  
   A group of messages sharing some commonality. This is the result of a
   query. The vfolder maybe contained in a store, but it is not necessary
   that a store holds only one vfolder. There is always an implicit
   vfolder rule which matches all messages. A store contains the vfolder
   which is the result of the query (any). It's short for virtual folder
   or maybe view folder. I dunno.
   </para>
  </sect2>
  <sect2>
   <title>Default-vfolder</title> 
   <para>  
   The vfolder defined by (any) applied to the store. This is not the
   inbox. The inbox could easily be defined by a query. A default rule
   for the inbox could be (new) but it doesn't have to be. Mine happens
   to be (or (unread) (new)).
   </para>
  </sect2>
  <sect2>
   <title>Folder</title> 
   <para>  
   The classical mail folder approach: one message organization per
   store.
   </para>
  </sect2>
  <sect2>
   <title>Query</title> 
   <para>  
   A search for messages. The result of this is a vfolder. There are two
   kinds of queries: named queries and lambda queries. More on this
   later.
   </para>
  </sect2>
  <sect2>
   <title>Summary file </title> 
   <para>  
   An external file that contains pointers to messages which are matches
   for a named query. In addition to pointers, the summary file should
   also contain signatures of the store for sanity checks. When the term
   "index" is used as a verb, it means to build a summary file for a
   given name-value pair.
   </para>
  </sect2>
 </sect1>

 <sect1>
  <title>Queries</title> 
  <para>  
  Named queries are analogous to classical mail folders. Because named
  queries maybe reused, summary files are kept as caches to reduce
  the overall cost of viewing a vfolder. Summary files are superior to
  folders in that they allow for the same messages to appear in multiple
  vfolders without message duplications. Duplications of messages
  defeats attempts at tagging a message with additional user information
  like annotations. Named queries will define folders.
  </para>
  <para>
  Lambda queries are similar to named queries except that they have no
  name. These are created on the fly by the user to filter out or
  include certain messages.
  </para>
  <para>
  All queries can be layered on top of each other. A lambda query can be 
  layered on a named query and a named query can be layered on a lambda
  query. The possibilities are endless.
  </para>
  <para>
  The layerings can be done as boolean operations (and, or, not). Short
  circuiting should be used. 
  </para>
  <para>
  Examples:
  <programlisting>
  (and (author "Giao")
       (unread))
  </programlisting>
  The (unread) query should only be evaluated on the results of (author
  "Giao").
  <programlisting>
  (or (author "Giao")
      (unread))
  </programlisting>
  Both of these queries should be evaluated. Any matches are added to the
  resulting vfolder.
  </para>
 </sect1>
 
 <sect1>
  <title>Summary files</title> 
  <para>    
  Summary files are only meaningful when applied to the context of the
  default-vfolder of a store.
  </para>
  <para>
  Summary files should be generated for queries of the form:
  <programlisting>
  (function "constant value")
  </programlisting>
  Summary files should never be generated for queries of the form:
  <programlisting>
  (function (function1))
  
  (and (function "value")
       (another-function "another value"))
  </programlisting>
  Given a query of the form:
  <programlisting>
  (and (function "value")
       (another-function "another value"))
  </programlisting>
  The system should use one summary file for (function "value") and
  another summary file for (another-function "another value"). I will
  call the prior form the "plain form".
  </para>
  <para>
  It should be noted that the signature of the store should be based on
  the assumption that new data may have been added to the store since
  the application generated the summary file. Signatures generated on
  the entirety of the store will most likely be meaningless for things
  like POP/IMAP servers. 
  </para>
 </sect1>

 <sect1>
  <title>Incremental indexing</title> 
  <para>
  When new messages are detected, all known queries should be evaluated
  on the new messages. vfolders should be notified of new messages that
  are positive matches for their queries. The indexes generated by this
  process should be merged into the current indexes for the vfolder.
  </para>
 </sect1>

 <sect1>
  <title>Can I have multiple stores?</title> 
  <para> 
  I don't see why not. Again, the inbox is a vfolder so you can get a
  unified inbox consisting of all new mail sent to all your stores or
  your can get inboxes for each store or any combination your heart
  desire. You get your cake, eat it, and someone else cleans the dishes!
  </para>
 </sect1>

 <sect1>
  <title>Why all this?</title> 
  <para> 
  Consider the dynamic nature of the following query:
  <programlisting>
  (and (author "Giao")
       (sent-after (today-midnight)))
  </programlisting>
  today-midnight would be a function that is evaluated at run-time to
  calculate the appropriate object.
  </para>
 </sect1>

 <sect1>
  <title>Scenarios of usage and their solutions</title> 
  <sect2>
   <title>Mesage alterations</title>
   <para>
   This is a fuzzy area that should be left to the UI to handle. Messages 
   are altered. Read status are altered when a new message is read for
   example. How do we handle this if our query is for unread messages?
   Upon viewing the state would change.
   </para>
   <para>
   One idea is to not evaluate the queries unless we're changing between
   vfolder views. This assumes that one can only view a particular
   vfolder at a time. For multi-vfolder viewing, a message change should
   propagate through the vfolder system. Certain effects (as in our
   example) would not be intuitive.
   </para>
   <para> 
   It would not be a clean solution to make special cases but they may be 
   necessary where certain defined fields are ignored when they are
   changed. Some combination of the above rules can be used. I don't
   think it's an easy solution.
   </para>
  </sect2>
  <sect2>
   <title>Message inclusion and exclusion</title>
   <para>
   Messages are included and excluded also with queries. The final query
   will have the form of:
   <programlisting>
   (and (author "Giao")
        (criteria value)
        (not (criteria other-value)))
   </programlisting>
   Userland criterias may be a label of some sort. These may be userland
   labels or Message-IDs. What are the performance issues involved in
   this? With short circuiting, it's not a major problem.
   </para>
   <para>    
   The criterias and values are determined by the UI. The vfolder
   mechanism isn't concerned with such issues.
   </para>
   <para>   
   Messages can be included and excluded at will. The idea is often
   called "arbitrary inclusion/exclusion". This can be done by
   Message-IDs or other fields. It's been noted that Message-IDs are not
   unique. 
   </para>
   <para>  
   I propose that any given vfolder is allocated an inclusion label and an 
   exclusion label. These should be randomly generated. This should be
   part of the vfolder description. It should be noted that the vfolder
   description has not been drafted yet.
   </para>
   <para>   
   The result is such that the rules for a given named query is:
   <programlisting>
   (and (user-query)
        (label inclusion-label)
        (not exclusion-label))
   </programlisting>
   </para>
  </sect2>
  <sect2>
   <title>Query scheduling</title>
   <para>
   Consider the following extremely dynamic queries:
   <programlisting>
   A:
   (and (author "Giao")
        (sent-after (today-midnight)))
   
   B:
   (and (sent-after (today-midnight))
        (author "Giao"))
   
   C:
   (or (author "Giao")
       (sent-after (today-midnight)))
   </programlisting>
   Query A would be significantly faster because (author "Giao") is not
   dynamic. A summary file could be generated for this query. Query B is
   slow and can be optimized if there was a query compiler of some
   sort. Query C demonstrates a query in which there is no good
   optimization which can be applied. These come with a certain amount of
   baggage.
   </para>
   <para>
   It seems then that for boolean 'and' operations, plain forms should be 
   moved forward and other queries should be moved such that they are
   evaluated later. I would expect that the majority of queries would be
   of the plain form.
   </para>
   <para>  
   First is that the summary file is tied to the query and the store
   where the query originates from. Second, a hashing function for
   strings needs to be calculated for the query so that the query and the 
   summary file can be associated. This hashing function could be similar 
   to the hashing function described in Rob Pike's "The Practice of
   Programming". (FIXME: Stick page number here)
   </para>
  </sect2>
  <sect2>
   <title>Archives</title>
   <para>
   Many people are concerned that archives won't be preserved, archives
   aren't supported, and many other archive related issues. This is the
   short version.
   </para>
   <para>    
   Archives are just that, archives. Archives are stores. Take your
   vfolder, export it to a store. You are done. If you load up the store
   again, then the default-vfolder of that store is the view of the
   vfolder, except the query is different.
   </para>
   <para>    
   The point to vfolder is not to do away with classical folder
   representation but to move the queries to the front where it would
   make data management easier for people who don't think in terms of
   files but in terms of queries because ordinary people don't think in
   terms of files.
   </para>    
  </sect2>
 </sect1>

 <sect1>
  <title>Miscellany</title>
  <sect2>
   <title>Annotations</title>
   <para>
   There should be a scheme to add annotations to messages. Common mail
   user agents have used a tag in the message header to mark messages as
   read/unread for example. Extending on this we have the ability to add
   our own data to a message to add meaning to it. If we have a good
   scheme for doing this, new possibilities are opened.
   </para>
   <sect3>
    <title>Keywords</title>  
    <para>
    When sending a message, a message could have certain keywords attached 
    to it. While this can be done with the subject line, the subject line
    has a tendency to be munged by other mail applications. One popular
    example is the "[rR]e:" prefix. Using the subject line also breaks the 
    "contract" with other mail user agents. Using keywords in another
    field in the message header allows the sender to assist the recipient
    in organizing data automatically. Note that the sender can only
    provide hints as the sender is unlikely to know the organization
    schemes of the recipient.
    </para>
   </sect3>
  </sect2>
  <sect2>
   <title>Scope</title>  
   <para>
   Let us assume that we have multiple stores. Does a query work on a
   given store? Or does it work on all stores? Or is it configurable such 
   that a query can work on a user-selected list of stores?
   </para>
  </sect2>
 </sect1>

 <sect1>
  <title>Alternatives to the above</title>
  <para>
  Jim Meyer (purp@selequa.com) is putting some notes on where
  annotations needs to be located. They'll be located here as well as
  any contributions I may have to them.
  </para>
 </sect1>
</article>