3 Replies Latest reply on Mar 21, 2005 2:33 PM by joe.cheng

    Indexing

    acoliver

       

      "mikezzz" wrote:
      or maybe implement a Lucene index


      This is not such a bad idea. Perhaps not to support the Mail API but to support protocols that require search. We could use lucene, or roll our own that will allow joins directly from the index to the message rows. I think this could be a POC task for M4. If it will be useful, implement in M5.


      My temptation is to hold of indexing for M5. The reason being is that IMAP has various search functions that we'll be half-impelemnting in M4. These will dictate things about indexing.

      My only problem with Lucene is that it may be difficult to index concurrently enough to not make writes really slow and secondly we'll probably fragment the hell out of things. We may need a different sort of index.

      All databases doing this (http://www.oracle.com/technology/products/text/index.html in particular see the technical whitepaper) is an ideal. However supporting that in a multi-dabase fashion is going to be a real challenge.

      "oracle technical whitepaper" wrote:


      SELECT score(1), product_id, product_name
      FROM product_information
      WHERE CONTAINS
      (product_description, 'monitor NEAR "high resolution"', 1) > 0
      ORDER BY score(1) DESC;


      However with MySQL gaining triggers and PostgreSQL having them, it is possible that we'll be able to do database packages to accomplish this.

      Thoughts?


        • 1. Re: Indexing

          I don't have many thoughts at this point mostly questions.

          What the requirements of the protocols that use search? Is IMAP the only one? MAPI? (I need to do some reading).

          Is simple keyword/pattern search enough, or do we need to consider terms together (locality, ordering, etc.)? How are various types of searches expressed?

          Does the index need to be up to date with regards to the Mailbox/Mail Store, could we update asynchronously?

          Can we be pluggable and leverage the database to do the heavy lifting (if it has functionality to do so) and support a fall back solution for lesser DBs (e.g. HSQL) that is functionaly equivilent? Is this necessary?

          I'm not trying to get answers for these now, but throw up a some points of information to steer the implemenation (I'm sure there will be others). I will have a think about some of these during M4 (I'll probably just play with a few tools then forget about it :-). I agree it is unlikely we will be able, or need, to produce anything usable until M5. It will probably have to evolve over several milestones. I think it will be a bit of a tough nut to crack and will probably require a few people weighing in on it, at least from the design perspective.

          Mike.

          • 2. Re: Indexing
            acoliver

            Yes. :-) This is good.

            These thoughts lead you to why the big boy is DB-specific. (I envision Exchange as a large round boy in overalls)

            • 3. Re: Indexing
              joe.cheng

               

              Is simple keyword/pattern search enough, or do we need to consider terms together (locality, ordering, etc.)? How are various types of searches expressed?


              For IMAP, it is simple substring matching--see RFC 3501 section 6.4.4. Note that substring matching is not something Lucene is good at; Lucene really wants to work at the token level. (In fact, I'm not aware of any indexing techniques to speed up simple substring matching, but then again I'm no expert in the field.)

              It is not clear that fast search for IMAP is very important anyway, since some of the most common IMAP clients only search their own local store. See this flamefest on comp.mail.imap: http://tinyurl.com/5wla3

              By the way, in that thread, the post by Mark Crispin (creator of IMAP) dated Nov. 6 2001 at 7:00pm lists some relatively non-obvious requirements for IMAP SEARCH implementations.