LIMIT 1 Bug and thoughts about JOINs
clementp Mar 22, 2013 9:55 PMJust filed https://issues.jboss.org/browse/MODE-1865 where LIMIT 1 doesn't work properly with JOINs. (There is a workaround and that is to use LIMIT 2 and then trim the results yourself). The optimization to bail in a LHS/RHS nested join is incorrectly pushed down.
As part of the bug hunting process, I discovered a potentially much more significant issue in terms of JOIN performance. I suspect the developers know about it but I'd like to solicit some feedback from the community on whether this is an actual problem.
Essentially (quoting from my comment in the bug), the NestedLoopJoinComponent yields all rows from both the LHS and RHS join selectors. This can be a pretty significant problem since it means that both selectors must yield a relatively small number of rows or else it will blow up and the performance of the complexity of the join is basically O(N * M) where N is the number of rows yielded by the LHS selector and M the RHS selector. For instance in a typical folder repository in modeshape, if one were to search for a file within a parent folder (i.e. find all files named foo.txt under /tmp), modeshape would load all files named foo.txt (which can be numerous) and all folders name "/tmp" (which hopefully is unique) whereas a typical SQL engine would first find the folder name "/tmp" and search for all files with a parent of "/tmp" (instructing lucene to look for documents with path starting with /tmp in modeshape's case) with a name of "foo.txt".
This is, BTW, not how a typical SQL query join would work since it would have likely "drive" the RHS of the join by results from the LHS. http://blogs.msdn.com/b/craigfr/archive/2006/07/26/679319.aspx
Looking at the code, I can't see how JOINs can be efficient without redesigning how query execution work (instead of execute() return a List of Object[] it probably needs to allow cursor like access to the query data). Also the query planner needs to know how to have the LHS of a join create RHS sub-queries so that modeshape isn't loading all rows from both sides whenever it joins.