3 Replies Latest reply on Apr 2, 2013 11:59 AM by rhauch

    LIMIT 1 Bug and thoughts about JOINs

    clementp

      Just filed https://issues.jboss.org/browse/MODE-1865 where LIMIT 1 doesn't work properly with JOINs. (There is a workaround and that is to use LIMIT 2 and then trim the results yourself). The optimization to bail in a LHS/RHS nested join is incorrectly pushed down.

       

      As part of the bug hunting process, I discovered a potentially much more significant issue in terms of JOIN performance. I suspect the developers know about it but I'd like to solicit some feedback from the community on whether this is an actual problem.

       

      Essentially (quoting from my comment in the bug), the NestedLoopJoinComponent yields all rows from both the LHS and RHS join selectors. This can be a pretty significant problem since it means that both selectors must yield a relatively small number of rows or else it will blow up and the performance of the complexity of the join is basically O(N * M) where N is the number of rows yielded by the LHS selector and M the RHS selector. For instance in a typical folder repository in modeshape, if one were to search for a file within a parent folder (i.e. find all files named foo.txt under /tmp), modeshape would load all files named foo.txt (which can be numerous) and all folders name "/tmp" (which hopefully is unique) whereas a typical SQL engine would first find the folder name "/tmp" and search for all files with a parent of "/tmp" (instructing lucene to look for documents with path starting with /tmp in modeshape's case) with a name of "foo.txt".

       

      This is, BTW, not how a typical SQL query join would work since it would have likely "drive" the RHS of the join by results from the LHS. http://blogs.msdn.com/b/craigfr/archive/2006/07/26/679319.aspx

       

      Looking at the code, I can't see how JOINs can be efficient without redesigning how query execution work (instead of execute() return a List of Object[] it probably needs to allow cursor like access to the query data). Also the query planner needs to know how to have the LHS of a join create RHS sub-queries so that modeshape isn't loading all rows from both sides whenever it joins.

        • 1. Re: LIMIT 1 Bug and thoughts about JOINs
          rhauch

          Yes, there are lots of complexities of implementing a full-blown query optimizer and executor, especially around joins. We've taken an initial pass at the basic functionality, but as you've noted there are areas where ModeShape's query implementation is less than ideal.

           

          A well-written, efficient, and functional query engine is a huge beast in an of itself, and I'm not sure we have the resources or the interest to turn our simplistic (yet functional and comparatively pretty good) query system into a truly optimized implementation. So IMO we should continue to make small improvements to our query system, but we should plan to replace it with a real query engine. In fact, it was always our long-term plan to embed the Teiid query engine, once it was small and lightweight enough. Recently, the Teiid folks have made it possible to embed the relational query engine, which is the heart of the Teiid system.

           

          We need to wrap up 3.2, but I would be very pleased to have our community start working on this effort. I've logged MODE-1869 to encompass this effort, and initially targeted it to 3.3.

           

          Thoughts?

          • 2. Re: LIMIT 1 Bug and thoughts about JOINs
            clementp

            Thanks for the response. I guess it's possible to denormalize the data structure in order to obviate the need for joins right now until Teiid is integrated. I'll be happy to help or try out the new code when the time comes around.

             

            Speaking of which, when is 3.2 coming out?

            • 3. Re: LIMIT 1 Bug and thoughts about JOINs
              rhauch

              See Estimated for upcoming 3.2.0.Final release for the latest on the 3.2 release timeframe.