5 Replies Latest reply on Feb 21, 2012 9:00 AM by Randall Hauch

    Location and Identification Properties

    Brian Wallis Master

      I have a connector I'm playing with and I am unsure about part of the interface.

       

      org.modeshape.graph.Location.create(Path path, Property idProperty)

       

      and other variations on create.

       

      What are identification properties? Are these just the properties that I see in the nodes at the JCR frontend? If they aren't, then how do I get properties into the nodes I am creating via the graph api?

       

      thanks

        • 1. Re: Location and Identification Properties
          Randall Hauch Master

          The "idenfitication properties" is just the subset of regular properties that the connector uses to uniquely identify the node. For example, "jcr:uuid" or "mode:uuid". (ModeShape's JCR layer will store the UUID for a non-referencable node in the "mode:uuid" property, but as soon as that node is made referenceable the UUID is kept in the "jcr:uuid". But Location has a method to get the UUID object.)

           

          The Location instances are used to uniquely identify a node and to encapsulate everything necessary to do this: the Path and any identification properties. IOW, a Location can contain just a Path, or just identification properties, or all of the above. But Location objects never contain the normal properties for the node. Every node is identifiable by just the path, or by the "identification properties" (e.g., "jcr:uuid" or "mode:uuid" properties).

           

          If you're connector talks to a external system that doesn't have the ability to track the UUID of the node, you should only use the Path to create the Location. But if the external system can track it, it is more efficient to make use of it and to create a Location with both the Path and UUID.

           

          Generally, a connector that understands UUIDs for nodes should anticipate Location objects in requests to have either a Path, or a UUID, or both. But it should always set the actual location on the request to a Location object with both the Path and UUID. This way, the JCR layer will be able to choose (based upon the situation) whether to send requests with a Path, or a UUID, or both.

           

          If a connector only tracks the Path of nodes (and not the UUID), then the JCR layer has to always look up nodes by path, which may be more expensive for some situations.

           

           

          ... how do I get properties into the nodes I am creating via the graph api?

           

          If you're using the JCR API to create the nodes, then ModeShape will create the appropriate Request objects and send them to the connector. However, if you're talking about how does a connector populate an incoming ReadNodeRequest object with the properties, that's done by simply calling "addProperty(...)" on the ReadNodeRequest.

           

          Perhaps you're asking a different question tho?

          • 2. Re: Location and Identification Properties
            Brian Wallis Master

            Thanks, that was the question.

             

            When I wrote this connector some months ago I was a bit confused about the separation of ReadAllChildrenRequest and ReadAllPropertiesRequest, apparently I still am :-)

             

            I think I've got it now.

             

            brian...

            • 3. Re: Location and Identification Properties
              Randall Hauch Master

              I was a bit confused about the separation of ReadAllChildrenRequest and ReadAllPropertiesRequest

               

              For anyone else reading this thread, the basic difference is that ModeShape tries to read only the information necessary to answer the question. Much of the time, ModeShape issues a ReadNodeRequest (which is essentially a single request that is equivalent to a ReadAllChildrenRequest and a ReadAllPropertiesRequest), but occassionally it only needs the properties or only needs the children.

               

              The default RequestProcessor contains a default implementation of ReadNodeRequest that issues a process(ReadPropertiesRequest) call followed by a process(ReadAllChildrenRequest) call. Since these are called on the same connection within the same process and within the same transaction, it will be pretty quick. However, some connectors may want to override the process(ReadNodeRequest) method to do its work more efficiently.

              • 4. Re: Location and Identification Properties
                Brian Wallis Master

                I have a further question about the ReadAllChildren and ReadAllProperties requests.

                 

                ReadAllChildren would be called once per node to find all the child nodes and then ReadAllProperties will be called for each child node. Is that correct?

                 

                So, if I have a data set and document model that ends up with 100,000 nodes under the root node then there will be a call that returns 100,000 child nodes and then 100,000 calls to ReadAllProperties.

                 

                This isn't likely to work is it.

                 

                Under this point of my model there are unlikely to be more than a few hundred child nodes at any level but at the top level I have a very flat data set without much to distinguish the nodes unless I introduce something artificial (such as a couple of levels introduced using a few bytes from a sha1 hash of a key value as suggested elsewhere).

                 

                I've been wondering about the model I've decided on and am wondering if the large number of top level nodes is going to cause problems over time.

                 

                Any recommendations on this? I cannot see this working well with the 2.x style of connectors and don't know enough about how 3.x will work to know.

                 

                brian...

                • 5. Re: Location and Identification Properties
                  Randall Hauch Master

                  ReadAllChildren would be called once per node to find all the child nodes and then ReadAllProperties will be called for each child node. Is that correct?

                  It depends. Sometimes ModeShape only needs to know the properties for a node, and so it might issue a ReadAllPropertiesRequest to the connector. For example, ModeShape really only indexes the properties of a node; yes it uses the children when (re)indexing a subgraph of some depth, but at the bottom of that subgraph it only needs the properties and will not need the children.

                   

                  This is somewhat of an edge case - most of the time ModeShape needs the whole node. You can always implement all of the process(...) methods in your RequestProcessor subclass - it takes more work to implement, but the benefit is that each method does what is expected and no more.

                  So, if I have a data set and document model that ends up with 100,000 nodes under the root node then there will be a call that returns 100,000 child nodes and then 100,000 calls to ReadAllProperties.

                   

                  This isn't likely to work is it.

                   

                  Under this point of my model there are unlikely to be more than a few hundred child nodes at any level but at the top level I have a very flat data set without much to distinguish the nodes unless I introduce something artificial (such as a couple of levels introduced using a few bytes from a sha1 hash of a key value as suggested elsewhere).

                   

                  I've been wondering about the model I've decided on and am wondering if the large number of top level nodes is going to cause problems over time.

                   

                  Any recommendations on this? I cannot see this working well with the 2.x style of connectors and don't know enough about how 3.x will work to know.

                   

                  ModeShape 2.x (and Jackrabbit, for that matter) don't work as well with flat hieararchies. If there's a way for you to use multiple levels to represent that "very flat data set", then ModeShape 2.x will perform better. And I would suggest that versioning nodes that have 100K of children might be pretty expensive, so that should be avoided.

                   

                  ModeShape 3.0, on the other hand, will be asking the connector for an iterator over the references to the children (where a reference is basically a string identifier), and this iterator will be embedded within the NodeIterator returned by "getNodes()" and "getNodes(String)". The connector will be free to implement that as they like. When ModeShape stores the node internally in Infinispan, we start out by storing a node's references to all children in the JSON document for that node. But a background process will look for nodes with a number of child node references beyond some threshold and break up the list of child node references into multiple JSON documents. We'll have to tune what that threshold is, but the rest of ModeShape already can deal with multiple segments of child node references.