path: root/devel-docs/query/virtual-folder-in-depth.sgml


                                                                 
<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" []>

<!-- SGMLized by Bertrand <Bertrand.Guiheneuf@aful.org> -->

<article id="index">
  <artheader>
    <authorgroup>
      <author>
    <firstname>Giao</firstname>
    <surname>Nguyen</surname>
      </author>
    </authorgroup>

    <title>An in-depth look at the virtual folder mechanism</title>
    <abstract>
      <para>
    This document describes a different way of approaching mail
    organization and how all things are possible in this brave new
    world. This document does not describe physical storage issues
    nor interface issues.
      </para>
      <para>
    Historically mail has been organized into folders. These
    folders usually mapped to a single storage medium. The
    relationship between mail organization and storage medium was
    one to one. There was one mail organization for every storage
    medium. This scheme had its limitations.
      </para>
      <para>  
    Efforts at categorizations are only meaningful at the instance that
    one categorized. To find any piece of data, regardless of how well
    it was categorized, required some amount of searching. Therefore, any
    attempts to nullify searching is doomed to fail. It's time to embrace
    searching as a way of life.
      </para>
      <para>  
    These are the terms and their definitions. The example rules used are
    based on the syntax for VM (http://www.wonderworks.com/vm/) by Kyle
    Jones whose ideas form the basis for this. I'm only adding the
    existence of summary files to aid in scaling. I currently use VM and
    it's virtual-folder rules for my daily mail purposes. To date, my only 
    complaints are speed (it has no caches) and for the unitiated, it's
    not very user-friendly.
      </para>
      <para>  
    Comments, questions, rants, etc. should be directed at Giao Nguyen
    (grail@cafebabe.org) who will try to address issues in a timely
    manner.
      </para>
    </abstract>
  </artheader>

  <!-- Definitions -->
  <sect1 id="definitions">
    <title>Definitions</title>
    <sect2>
      <title>Store</title> 
      <para>
    A location where mail can be found. This may be a file (Berkeley
    mbox), directory (MH), IMAP server, POP3 server, Exchange server,
    Lotus Notes server, a stack of Post-Its by your monitor fed through
    some OCR system.
      </para>
    </sect2>

    <sect2>
      <title>Message</title> 
      <para>  
    An individual mail message.
      </para>
    </sect2>
    <sect2>
      <title>Vfolder</title> 
      <para>  
    A group of messages sharing some commonality. This is the result of a
    query. The vfolder maybe contained in a store, but it is not necessary
    that a store holds only one vfolder. There is always an implicit
    vfolder rule which matches all messages. A store contains the vfolder
    which is the result of the query (any). It's short for virtual folder
    or maybe view folder. I dunno.
      </para>
    </sect2>
    <sect2>
      <title>Default-vfolder</title> 
      <para>  
    The vfolder defined by (any) applied to the store. This is not the
    inbox. The inbox could easily be defined by a query. A default rule
    for the inbox could be (new) but it doesn't have to be. Mine happens
    to be (or (unread) (new)).
      </para>
    </sect2>
    <sect2>
      <title>Folder</title> 
      <para>  
    The classical mail folder approach: one message organization per
    store.
      </para>
    </sect2>
    <sect2>
      <title>Query</title> 
      <para>  
    A search for messages. The result of this is a vfolder. There are two
    kinds of queries: named queries and lambda queries. More on this
    later.
      </para>
    </sect2>
    <sect2>
      <title>Summary file </title> 
      <para>  
    An external file that contains pointers to messages which are matches
    for a named query. In addition to pointers, the summary file should
    also contain signatures of the store for sanity checks. When the term
    "index" is used as a verb, it means to build a summary file for a
    given name-value pair.
      </para>
    </sect2>
  </sect1>

  <!-- Queries -->
  <sect1>
    <title>Queries</title> 
    <para>  
      Named queries are analogous to classical mail folders. Because named
      queries maybe reused, summary files are kept as caches to reduce
      the overall cost of viewing a vfolder. Summary files are superior to
      folders in that they allow for the same messages to appear in multiple
      vfolders without message duplications. Duplications of messages
      defeats attempts at tagging a message with additional user information
      like annotations. Named queries will define folders.
    </para>
    <para>
      Lambda queries are similar to named queries except that they have no
      name. These are created on the fly by the user to filter out or
      include certain messages.
    </para>
    <para>
      All queries can be layered on top of each other. A lambda query can be 
      layered on a named query and a named query can be layered on a lambda
      query. The possibilities are endless.
    </para>
    <para>
      The layerings can be done as boolean operations (and, or, not). Short
      circuiting should be used. 
    </para>
    <para>
      Examples:
      <programlisting>
(and (author "Giao")
  (unread))
      </programlisting>
      The (unread) query should only be evaluated on the results of (author
      "Giao").
      <programlisting>
(or (author "Giao")
  (unread))
      </programlisting>
      Both of these queries should be evaluated. Any matches are added to the
      resulting vfolder.
    </para>
  </sect1>

  <!-- Summary files -->
  <sect1>
    <title>Summary files</title> 
    <para>    
      Summary files are only meaningful when applied to the context of the
      default-vfolder of a store.
    </para>
    <para>
      Summary files should be generated for queries of the form:
      <programlisting>
(function "constant value")
      </programlisting>
      Summary files should never be generated for queries of the form:
      <programlisting>
    (function (function1))
    
    (and (function "value")
    (another-function "another value"))
      </programlisting>
      Given a query of the form:
      <programlisting>
    (and (function "value")
    (another-function "another value"))
      </programlisting>
      The system should use one summary file for (function "value") and
      another summary file for (another-function "another value"). I will
      call the prior form the "plain form".
    </para>
    <para>
      It should be noted that the signature of the store should be based on
      the assumption that new data may have been added to the store since
      the application generated the summary file. Signatures generated on
      the entirety of the store will most likely be meaningless for things
      like POP/IMAP servers. 
    </para>
  </sect1>

  <!-- Incremental Indexing -->
  <sect1>
    <title>Incremental indexing</title> 
    <para>
      When new messages are detected, all known queries should be evaluated
      on the new messages. vfolders should be notified of new messages that
      are positive matches for their queries. The indexes generated by this
      process should be merged into the current indexes for the vfolder.
    </para>
  </sect1>

  <!-- Can I have multiple stores -->
  <sect1>
    <title>Can I have multiple stores?</title> 
    <para> 
      I don't see why not. Again, the inbox is a vfolder so you can get a
      unified inbox consisting of all new mail sent to all your stores or
      your can get inboxes for each store or any combination your heart
      desire. You get your cake, eat it, and someone else cleans the dishes!
    </para>
  </sect1>

  <!-- Why all this? -->
  <sect1>
    <title>Why all this?</title> 
    <para> 
      Consider the dynamic nature of the following query:
      <programlisting>
(and (author "Giao")
  (sent-after (today-midnight)))
      </programlisting>
      today-midnight would be a function that is evaluated at run-time to
      calculate the appropriate object.
    </para>
  </sect1>

  <!-- Scenarios of usage and their solutions -->
  <sect1>
    <title>Scenarios of usage and their solutions</title> 
    <sect2>
      <title>Mesage alterations</title>
      <para>
    This is a fuzzy area that should be left to the UI to handle. Messages 
    are altered. Read status are altered when a new message is read for
    example. How do we handle this if our query is for unread messages?
    Upon viewing the state would change.
      </para>
      <para>
    One idea is to not evaluate the queries unless we're changing between
    vfolder views. This assumes that one can only view a particular
    vfolder at a time. For multi-vfolder viewing, a message change should
    propagate through the vfolder system. Certain effects (as in our
    example) would not be intuitive.
      </para>
      <para> 
    It would not be a clean solution to make special cases but they may be 
    necessary where certain defined fields are ignored when they are
    changed. Some combination of the above rules can be used. I don't
    think it's an easy solution.
      </para>
    </sect2>
    <sect2>
      <title>Message inclusion and exclusion</title>
      <para>
    Messages are included and excluded also with queries. The final query
    will have the form of:
    <programlisting>
      (and (author "Giao")
      (criteria value)
      (not (criteria other-value)))
    </programlisting>
    Userland criterias may be a label of some sort. These may be userland
    labels or Message-IDs. What are the performance issues involved in
    this? With short circuiting, it's not a major problem.
      </para>
      <para>    
    The criterias and values are determined by the UI. The vfolder
    mechanism isn't concerned with such issues.
      </para>
      <para>   
    Messages can be included and excluded at will. The idea is often
    called "arbitrary inclusion/exclusion". This can be done by
    Message-IDs or other fields. It's been noted that Message-IDs are not
    unique. 
      </para>
      <para>  
    I propose that any given vfolder is allocated an inclusion label and an 
    exclusion label. These should be randomly generated. This should be
    part of the vfolder description. It should be noted that the vfolder
    description has not been drafted yet.
      </para>
      <para>   
    The result is such that the rules for a given named query is:
    <programlisting>
      (and (user-query)
      (label inclusion-label)
      (not exclusion-label))
    </programlisting>
      </para>
    </sect2>
    <sect2>
      <title>Query scheduling</title>
      <para>
    Consider the following extremely dynamic queries:
    <programlisting>
      A:
      (and (author "Giao")
      (sent-after (today-midnight)))
      
      B:
      (and (sent-after (today-midnight))
      (author "Giao"))
      
      C:
      (or (author "Giao")
      (sent-after (today-midnight)))
    </programlisting>
    Query A would be significantly faster because (author "Giao") is not
    dynamic. A summary file could be generated for this query. Query B is
    slow and can be optimized if there was a query compiler of some
    sort. Query C demonstrates a query in which there is no good
    optimization which can be applied. These come with a certain amount of
    baggage.
      </para>
      <para>
    It seems then that for boolean 'and' operations, plain forms should be 
    moved forward and other queries should be moved such that they are
    evaluated later. I would expect that the majority of queries would be
    of the plain form.
      </para>
      <para>  
    First is that the summary file is tied to the query and the store
    where the query originates from. Second, a hashing function for
    strings needs to be calculated for the query so that the query and the 
    summary file can be associated. This hashing function could be similar 
    to the hashing function described in Rob Pike's "The Practice of
    Programming". (FIXME: Stick page number here)
      </para>
    </sect2>
    <sect2>
      <title>Archives</title>
      <para>
    Many people are concerned that archives won't be preserved, archives
    aren't supported, and many other archive related issues. This is the
    short version.
      </para>
      <para>    
    Archives are just that, archives. Archives are stores. Take your
    vfolder, export it to a store. You are done. If you load up the store
    again, then the default-vfolder of that store is the view of the
    vfolder, except the query is different.
      </para>
      <para>    
    The point to vfolder is not to do away with classical folder
    representation but to move the queries to the front where it would
    make data management easier for people who don't think in terms of
    files but in terms of queries because ordinary people don't think in
    terms of files.
      </para>    
    </sect2>
  </sect1>

  <!-- Miscellany -->
  <sect1>
    <title>Miscellany</title>
    <sect2>
      <title>Annotations</title>
      <para>
    There should be a scheme to add annotations to messages. Common mail
    user agents have used a tag in the message header to mark messages as
    read/unread for example. Extending on this we have the ability to add
    our own data to a message to add meaning to it. If we have a good
    scheme for doing this, new possibilities are opened.
      </para>
      <sect3>
    <title>Keywords</title>  
    <para>
      When sending a message, a message could have certain keywords attached 
      to it. While this can be done with the subject line, the subject line
      has a tendency to be munged by other mail applications. One popular
      example is the "[rR]e:" prefix. Using the subject line also breaks the 
      "contract" with other mail user agents. Using keywords in another
      field in the message header allows the sender to assist the recipient
      in organizing data automatically. Note that the sender can only
      provide hints as the sender is unlikely to know the organization
      schemes of the recipient.
    </para>
      </sect3>
    </sect2>
    <sect2>
      <title>Scope</title>  
      <para>
    Let us assume that we have multiple stores. Does a query work on a
    given store? Or does it work on all stores? Or is it configurable such 
    that a query can work on a user-selected list of stores?
      </para>
    </sect2>
  </sect1>

  <!-- Alternatives to the above -->
  <sect1>
    <title>Alternatives to the above</title>
    <para>
      Jim Meyer (purp@selequa.com) is putting some notes on where
      annotations needs to be located. They'll be located here as well as
      any contributions I may have to them.
    </para>
  </sect1>
</article>