7 Replies Latest reply on Jul 27, 2012 9:06 AM by eric.wittmann

    Implementing the S-RAMP Query API

    eric.wittmann

      Overview

      The S-RAMP specification (section 4) specifies an X-Path 2.0 based query language.  The S-RAMP Query grammar is actually a subset of the XPath 2.0 grammar.  The challenges for an S-RAMP implementation is to properly parse and execute queries that conform to this dialect.  This discussion is intended to help us decide the best approach to implementing the Query dialect for the Overlord S-RAMP implementation.

       

      Issues/Concerns

      For the most part, the S-RAMP Query dialect is a straightforward use of XPath.  However, there are some nuances that must be considered when deciding on an implementation approach.  Some of these, in no particular order, are:

       

      • Parsing the query dialect into an AST/model
      • Validating the resulting model
      • Ensuring the model is suitable for use by the S-RAMP provider (ModeShape right now)
      • Handling the classifiedBy set of custom s-ramp functions
      • Are arbitrary relationship depths supported (e.g. give me all WSDLs that import any XSD that imports xyz.xsd) - this is not clear to me after reading the spec

       

      Implementation Thoughts

      I think the first decision to make is:  how will we parse the query into a model?

       

      Some options that come to mind:

      • Use javacc or some other parser generator to create our own parser specific to the S-RAMP Query dialect (I think I might favor this approach).
      • Use an existing XPath query parser such as Saxon or Xalan (it's unclear whether these could easily be leveraged to simply do the parsing)

       

      Once the query is parsed into a model, then I think static validation is a simple matter of analysis of the model.  I don't think there are decisions to be made here.

       

      We do need to make sure that the model we produce is easily consumed by the provider.  I'll assume that any model we produce can be visited/traversed to make it easy for a provider to convert the query into something native.  In the case of our existing ModeShape provider, I think this actually means converting from S-RAMP XPath into ModeShape XPath.  These two dialects are different enough that I believe we would definitely want to convert to a Model and then back again.  For other providers the resulting provider query might be SQL or some other native language.

       

      That leads to the following concern I have:  how do we handle S-RAMP classifications?

       

      I think this is an challenging issue that needs to be solved by each provider in its own way.  For example, if the provider supports something like XPath and supports user-defined functions (e.g. eXist XML database) then the solution might be straightforward.  However, not all providers will have an easy time with this.  It's possible that the ontologies will need to be normalized by the provider for easy querying.  That would work reasonably well, although making ontology changes after the fact becomes a more challenging operation (existing normalized artifacts would need to be updated).

       

      Another (unrelated) issue with querying are relationships.  How deep into the relationship hierarchy can the query go?

       

      The S-RAMP spec provides Query examples like this:

       

      /s-ramp/wsdl/WsdlDocument[includedXsds[@someProperty='true']]

       

      This query should return all WsdlDocument artifacts that include an XSD which has 'someProperty' set to 'true'.  That's pretty straightforward, but what about a query like this:

       

      /s-ramp/wsdl/WsdlDocument[includedXsds[includedXsds[@someProperty='true']]]

       

      That query should return all WsdlDocument artifacts that include any XSD that itself includes an XSD with 'someProperty' set to 'true'.  You can see how the depth is infinitely deep.  Does the S-RAMP spec allow this?  I couldn't quite tell based on my reading of it.

       

      I'll stop here - let the discussion begin!!

        • 1. Re: Implementing the S-RAMP Query API
          rhauch

          Assuming that we're talking about the implementation of the S-RAMP Query API, then one option for the implementation to take the query (supplied in the client's request), parse it into an S-RAMP query AST (multiple approaches are possible, as described by Eric), and then generate a JCR/ModeShape query using one of several techniques that use the JCR API to:

           

          1. build a JCR-QOM (Query Object Model) object structure
          2. create a JCR-SQL2 query expression (as a string)
          3. create a JCR XPath query expression (as a string)

           

          Each of these have their advantages, though I think #3 is the least attractive because the JCR XPath language has been deprecated and is simply not very powerful (compared to the other two). Plus, if you're already creating an AST, generating JCR XPath may not be much different or easier than generating JCR-SQL2 or JCR-QOM. The JCR-QOM approach and the JCR-SQL2 approach are basically equivalent in functionality/capability and vary only in the code necessary to transform the S-RAMP Query AST into a JCR query. Option #1 is a bit more efficient (as ModeShape doesn't need to parse a string, but instead can directly use the supplied QOM), but Option #2 is probably a bit easier to code to (perhaps using a simple visitor on the AST).

           

           

          We do need to make sure that the model we produce is easily consumed by the provider.  I'll assume that any model we produce can be visited/traversed to make it easy for a provider to convert the query into something native.  In the case of our existing ModeShape provider, I think this actually means converting from S-RAMP XPath into ModeShape XPath.  These two dialects are different enough that I believe we would definitely want to convert to a Model and then back again.  For other providers the resulting provider query might be SQL or some other native language.

           

          I might suggest that the S-RAMP implementation parse and analyze/validate the submitted query, but then submit the query in S-RAMP AST form to the provider, which should then return the results in some form defined by the S-RAMP implementation. The JCR API defines a similar approach (the JCR-QOM is basically its AST), and it works great! In fact, it would allow the various providers to implement the query execution in the most appropriate way. For example, the ModeShape-specific could then use ModeShape's extension to the JCR-SQL2 grammar that adds support for non-correlated subqueries, set queries, limits, offsets, set criteria, reference criteria, depth criteria, and other features. (There's also a corresponding extension to the JCR-JQOM object model within ModeShape's public API.) We added these extensions because they're extremely useful, and without them you have to do a LOT more coding.

           

           

          I think the first decision to make is:  how will we parse the query into a model?

           

          Some options that come to mind:

          • Use javacc or some other parser generator to create our own parser specific to the S-RAMP Query dialect (I think I might favor this approach).
          • Use an existing XPath query parser such as Saxon or Xalan (it's unclear whether these could easily be leveraged to simply do the parsing)

           

          Another option is to simply write a parser to handle the subset of XPath. This avoids the complexities associated with parser generators, while keeping the code very simple and straightfoward. ModeShape did this with its JCR XPath parser, which produces a simple AST using domain-specifc classes.

           

           

          That leads to the following concern I have:  how do we handle S-RAMP classifications?

           

          I think this is an challenging issue that needs to be solved by each provider in its own way.  For example, if the provider supports something like XPath and supports user-defined functions (e.g. eXist XML database) then the solution might be straightforward.  However, not all providers will have an easy time with this.  It's possible that the ontologies will need to be normalized by the provider for easy querying.  That would work reasonably well, although making ontology changes after the fact becomes a more challenging operation (existing normalized artifacts would need to be updated).

           

          I can offer some suggestions related to how they might be handled within the JCR providers. There are two approaches:

           

          1. Define a node structure that represents the tags (ontology), and then create REFERENCE properties on the various S-RAMP node types so that various nodes can simply refer to the appropriate tag node. This may not scale with all JCR implementations, and would probably make updating the tags (ontology) more difficult (as you'd want to reuse nodes rather than replacing the tag/ontology node structure outright), and could cause referential integrity issues. This also makes queries a bit more convoluted and less obvious.
          2. Define STRING multi-valued properties on the various S-RAMP node types whose values are names of the tags. This is highly scalable, and since the S-RAMP implementation would be the one setting these values, it can ensure that only valid tags are used. This would also be much more amenable to updating the tags/ontology, since it wouldn't require that tags removed from the ontology also be removed from the artifact nodes. Queries would also make a lot more sense; for example, "... WHERE [sramp:category] = 'tag1' ..." or even "... WHERE [sramp:category] IN ('tag1','tag2') ...". Finally, there is no referential integrity problem.

           

          BTW, this is where JCR-QOM and JCR-SQL2 offer a lot more flexibility and capability than JCR XPath.

           

           

          Another (unrelated) issue with querying are relationships.  How deep into the relationship hierarchy can the query go?

           

          The S-RAMP spec provides Query examples like this:

           

          /s-ramp/wsdl/WsdlDocument[includedXsds[@someProperty='true']]

           

           

          This query should return all WsdlDocument artifacts that include an XSD which has 'someProperty' set to 'true'.  That's pretty straightforward, but what about a query like this:

           

          /s-ramp/wsdl/WsdlDocument[includedXsds[includedXsds[@someProperty='true']]]

           

           

          That query should return all WsdlDocument artifacts that include any XSD that itself includes an XSD with 'someProperty' set to 'true'.  You can see how the depth is infinitely deep.  Does the S-RAMP spec allow this?  I couldn't quite tell based on my reading of it.

           

          That is a tough one, and I think shows the limitation of using an XPath-based query language. This is far easier in a SQL-like language, especially one that supports joins (like JCR-SQL2 and JCR-QOM).

          • 2. Re: Implementing the S-RAMP Query API
            eric.wittmann

            Each of these have their advantages, though I think #3 is the least attractive because the JCR XPath language has been deprecated and is simply not very powerful (compared to the other two). Plus, if you're already creating an AST, generating JCR XPath may not be much different or easier than generating JCR-SQL2 or JCR-QOM. The JCR-QOM approach and the JCR-SQL2 approach are basically equivalent in functionality/capability and vary only in the code necessary to transform the S-RAMP Query AST into a JCR query. Option #1 is a bit more efficient (as ModeShape doesn't need to parse a string, but instead can directly use the supplied QOM), but Option #2 is probably a bit easier to code to (perhaps using a simple visitor on the AST).

             

            Thanks, that's very helpful on the ModeShape side of things.

             

            I might suggest that the S-RAMP implementation parse and analyze/validate the submitted query, but then submit the query in S-RAMP AST form to the provider, which should then return the results in some form defined by the S-RAMP implementation. The JCR API defines a similar approach (the JCR-QOM is basically its AST), and it works great! In fact, it would allow the various providers to implement the query execution in the most appropriate way.

             

            Yes, that's precisely what I had in mind.  Parse the S-RAMP query into an AST, then pass that model to the provider.  As long as the model is easy to traverse/visit, the provider can do something sensible with it.

             

             

            Another option is to simply write a parser to handle the subset of XPath. This avoids the complexities associated with parser generators, while keeping the code very simple and straightfoward. ModeShape did this with its JCR XPath parser, which produces a simple AST using domain-specifc classes.

             

            As long as the grammar is simple enough, I think this approach can work well.  I looked at the references (briefly) and it looks pretty good.  I would certainly favor this approach over a parser generator (always a pain to work with).  I believe the S-RAMP query grammar is simple enough.

             

            I can offer some suggestions related to how they might be handled within the JCR providers. There are two approaches:

             

            1. Define a node structure that represents the tags (ontology), and then create REFERENCE properties on the various S-RAMP node types so that various nodes can simply refer to the appropriate tag node. This may not scale with all JCR implementations, and would probably make updating the tags (ontology) more difficult (as you'd want to reuse nodes rather than replacing the tag/ontology node structure outright), and could cause referential integrity issues. This also makes queries a bit more convoluted and less obvious.
            2. Define STRING multi-valued properties on the various S-RAMP node types whose values are names of the tags. This is highly scalable, and since the S-RAMP implementation would be the one setting these values, it can ensure that only valid tags are used. This would also be much more amenable to updating the tags/ontology, since it wouldn't require that tags removed from the ontology also be removed from the artifact nodes. Queries would also make a lot more sense; for example, "... WHERE [sramp:category] = 'tag1' ..." or even "... WHERE [sramp:category] IN ('tag1','tag2') ...". Finally, there is no referential integrity problem.

             

            I think option #2 is the way to go, and is what I was trying to get at with my "normalize the ontology" comments.  The difficulty, I believe, is that since the ontology is hierarchical, and the S-RAMP query spec defines the classifiedByAnyOf and exactlyClassifiedByAnyOf functions, the tags added to an artifact would need to be a normalized set of tags derived from the ontology hierarchy.  This isn't hard, but adds a potential challenge to updating the ontology (if things get removed or relocated in the hierarchy, do we need to update all affected artifacts??).

             

            That is a tough one, and I think shows the limitation of using an XPath-based query language. This is far easier in a SQL-like language, especially one that supports joins (like JCR-SQL2 and JCR-QOM).

             

            Perhaps I'll interpret the spec in a favorable way in this case.

            • 3. Re: Implementing the S-RAMP Query API
              rhauch

              Yes, that's precisely what I had in mind.  Parse the S-RAMP query into an AST, then pass that model to the provider.  As long as the model is easy to traverse/visit, the provider can do something sensible with it.

               

              Especially if you provide some utilities that the providers can use (e.g. some sort of visitor mechanism).

               

               

              I think option #2 is the way to go, and is what I was trying to get at with my "normalize the ontology" comments.  The difficulty, I believe, is that since the ontology is hierarchical, and the S-RAMP query spec defines the classifiedByAnyOf and exactlyClassifiedByAnyOf functions, the tags added to an artifact would need to be a normalized set of tags derived from the ontology hierarchy.  This isn't hard, but adds a potential challenge to updating the ontology (if things get removed or relocated in the hierarchy, do we need to update all affected artifacts??).

               

              Actually, handling hiearchical classes is not too difficult, either. Consider this farsical ontology:  A, B subClassOf A, C subClassOf A, D subClassOf B, E (where "subClassOf" matches that used in OWL). If the artifacts were tagged with only the values corresponding to the classifications they were explicitly assigned (e.g., if classified as "A", then multi-valued STRING property would contain only the "A" value), then finding all the artifacts that are classified exactly by any of B or E could be found with this query:

               

                   SELECT * FROM [sramp:artifact] WHERE [sramp:classifiedBy] IN ('B', 'E')

               

              whereas finding all the artifacts that are classified by any of B or E could be found with this query:

               

                  SELECT * FROM [sramp:artifact] WHERE [sramp:classifiedBy] IN ('B', 'E', 'C', 'D')

               

              Note that all we have to do for the 'classified by any of' case is also add to the classification values criteria those classifications that are subtypes of those explicitly included in the S-RAMP query. A simple caching mechanism can return the subtypes for a given classification, and even use a JCR listener to know when to disgard the cached information. (And because the subtypes of a given classification are idempotent, the cache doesn't even have to use a lock.) Let me know if you want me to clarify any of this.

               

              This approach is much better than storing on the artifacts each classification and its subtypes, because that encodes the hiearchical nature of the classifications inside the persisted store, and this persisted information can then become out of date.

               

              (ModeShape does exactly this with it's query system; simply map "classification" to JCR's "node type". Remember that in JCR, a node has a primary node type and 0 or more mixin node types, and the node types form an inheritance hierarchy just like the classification ontology.)

               

              UPDATED the 3rd to last paragraph to more accurately reflect that the classification criteria only needs to be changed.

              • 4. Re: Implementing the S-RAMP Query API
                eric.wittmann

                Note that all we have to do for the 'classified by any of' case is also add to the classification values those classifications that are subtypes of those explicitly included in the S-RAMP query. A simple caching mechanism can return the subtypes for a given classification, and even use a JCR listener to know when to disgard the cached information. (And because the subtypes of a given classification are idempotent, the cache doesn't even have to use a lock.) Let me know if you want me to clarify any of this.

                 

                No, no - of course - rip apart the hierarchy as part of the query, don't store the classifications on the artifacts.    I don't know what I was thinking!

                • 5. Re: Implementing the S-RAMP Query API
                  rhauch

                  I just updated my previous post to clarify what I was saying.

                   

                  No, no - of course - rip apart the hierarchy as part of the query, don't store the classifications on the artifacts.    I don't know what I was thinking!

                  Just to be clear, we can and need to store some classification values on the artifacts, but only those to which the artifacts are explicitly assigned. But we shouldn't store on the artifacts any classification values that are inferred based upon the classification ontology's inheritance hierarchy, since we can deal with inheritance at query time using classification criteria.

                  • 6. Re: Implementing the S-RAMP Query API
                    eric.wittmann

                    Agreed!

                     

                    I wasn't clear in my reply, sorry.  Here's what I should have said:

                     

                      "..., don't store the parent classifications on the artifacts (only store those explicitely set)."

                     

                     

                    • 7. Re: Implementing the S-RAMP Query API
                      eric.wittmann

                      By the way - I meant to followup to say that this is the approach I ended up taking with the query parser.  It seems to have worked out very well.  I did lift a small amount of code from ModeShape to help with tokenizing. 

                      Randall Hauch wrote:

                       

                      Another option is to simply write a parser to handle the subset of XPath. This avoids the complexities associated with parser generators, while keeping the code very simple and straightfoward. ModeShape did this with its JCR XPath parser, which produces a simple AST using domain-specifc classes.