13 Replies Latest reply on Sep 27, 2013 12:30 PM by markaddleman

    Some quesitons/observations on TEIID-2301

    markaddleman

      I've been playing around with the latest code with a focus on [#TEIID-2301] Programatically invalidate a single execution's cache using cache API - JBoss Issue Tracker.  A few observations and questions:

      • I don't see Invalidation.LAZY referenced anywhere in the engine and it seems to have produce the same be as Invalidation.IMMEDIATE.  Should the behavior be different? 
      • Perhaps it's because LAZY hasn't been implemented yet, but a continuous join query always consults both executions even if both have LAZY invalidations
      • How would I achieve caching a query for both continuous and non-continuous execution (perhaps both submitted concurrently) while still causing the continuous query to pause between executions?
      • Continuous executions against a stored procedure don't pause exeuctions on IMMEDIATE / LAZY invalidation the same way they do for tables

       

      My code isn't in test case form but I can create some if that will help illustrate.

       

      If the invalidation feature isn't fully baked yet, what Jira should I use to track its progress?

        • 1. Re: Some quesitons/observations on TEIID-2301
          shawkins

          > I don't see Invalidation.LAZY referenced anywhere in the engine and it seems to have produce the same be as Invalidation.IMMEDIATE.  Should the behavior be different?

           

          A specific action is not needed for the engine with lazy invalidation (it's just skipping the lookup).  For most scenarios you probably wont see a difference.

           

          > Perhaps it's because LAZY hasn't been implemented yet, but a continuous join query always consults both executions even if both have LAZY invalidations

           

          Lazy means invalidate after obtaining new results -  if other cache accesses are not marked as lazy or if you fail to produce new results the original will still be used.  With immediate invalidation the cache entry is removed on the initial consult of the cache.

           

          > How would I achieve caching a query for both continuous and non-continuous execution (perhaps both submitted concurrently) while still causing the continuous query to pause between executions?

           

          I think we're getting features/requirements intermixed.  The caching logic doesn't (and likely shouldn't) be concerned with the notion of continuous queries and related source coordination.  The engine's view is just to ask whether there's a valid entry or not and then use it.  That is why the issue initially suggested using the notion of source modifications to notify the engine about invalidated entries.

           

          Coming at this from the high-level view, continuous executions cover two cases:

          1. A non-terminating stream of results (such that the translator never returns null and the query plan doesn't need all results to make progress)

          2. A repeated execution of terminating results.

           

          The second case leaves open the question as to when to re-execute.  One initial proposal was to use a general source hint to coordinate the sources.  Another is some form of a heartbeat source.  At one point it was proposed to add more logic to the RequestOptions.  The latter probably makes the most sense as the re-execution notion is elevated as an engine concern and doesn't require adulterating sql (all of the source based approaches suffer from nuances where the sources are not consulted for whatever reason, there is replanning and possibly different sources, or dynamic sql where the sources aren't know apriori).  This could be done in a sufficiently generic way - such as a Coordinator interface with a single method begin() that is allowed to throw a DNA and we'd add RequestOptions.setCoordinator.  Would something like that alleviate a lot of the pause/coordination issues or do the decisions need to go much deeper into the Executions/ExecutionFactories?

           

          > Continuous executions against a stored procedure don't pause exeuctions on IMMEDIATE / LAZY invalidation the same way they do for tables

           

          You'll have to elaborate more here as I'm not quite sure what you mean.

          • 2. Re: Some quesitons/observations on TEIID-2301
            markaddleman

            The caching logic doesn't (and likely shouldn't) be concerned with the notion of continuous queries and related source coordination.  The engine's view is just to ask whether there's a valid entry or not and then use it.

            I agree and your explanation brings into sharp focus the behavior that I had observed.  I had originally attributed the execution pause behavior to the invalidation setting but now I see that it is truly due to DataNotAvailable and dataAvailable functionality.

            Coming at this from the high-level view, continuous executions cover two cases:

            1. A non-terminating stream of results (such that the translator never returns null and the query plan doesn't need all results to make progress)

            2. A repeated execution of terminating results.

             

            The second case leaves open the question as to when to re-execute.  One initial proposal was to use a general source hint to coordinate the sources.  Another is some form of a heartbeat source.  At one point it was proposed to add more logic to the RequestOptions.  The latter probably makes the most sense as the re-execution notion is elevated as an engine concern and doesn't require adulterating sql (all of the source based approaches suffer from nuances where the sources are not consulted for whatever reason, there is replanning and possibly different sources, or dynamic sql where the sources aren't know apriori).  This could be done in a sufficiently generic way - such as a Coordinator interface with a single method begin() that is allowed to throw a DNA and we'd add RequestOptions.setCoordinator.  Would something like that alleviate a lot of the pause/coordination issues or do the decisions need to go much deeper into the Executions/ExecutionFactories?

            You have a very good memory   Now that we've had a while to play with the continuous executions feature and have a bit better understanding how all this fits together, I'll try to organize my thoughts around the case of repeated execution of terminating results.  Thinking aloud:

             

            Practically, I see three logics for the coordinator:

            1. The engine should drive a new execution when ALL sources indicate data available
            2. The engine drives a new execution on some polling interval
            3. The engine drives a new execution when ANY source indicates data available

            For our application, we care mostly about cases #2 and #3.  I like that your setCoordinator idea leaves the coordination logic up to the client. 

             

            The polling case is pretty straightforward and allows the client to specify policies for handling late polling intervals.  It does bring up a question of allowing for overlapping executions.  I can see how some applications would care about that but not ours.  Applications that care are probably near real-time and I doubt Teiid can make those kinds of promises anyway.  I'd say skip that use case.

             

            The logic for the ANY case is a bit more interesting -

            1. The first time, begin() does not throw DNA allowing the execution to proceed
            2. The execution factories set up source-specific async notifications indicating that a source's data has changed probably using the request id + execution id has a key
            3. The executions proceed normally and eventually all end
            4. Coordinators begin() is called, throw DNA
            5. When any notification from #3 pops, the coordinator informs the engine to restart

             

            As you say, caching is a separate issue but to complete the picture, all participating executions would cache their results using Invalidation.NONE.  When any of the source's data changes, the appropriate cache would be invalidated on the next getCacheDirective() call.  It seems like the coordinator would have to be the repository for all the necessary flags.  In order to clean up memory properly, the coordinator would register some cleanup routine registered as a command context listener. 

             

            This plan makes me think the following:

            1. The coordinator should be available from the command context
            2. The begin() method should take the CommandContext as an argument
            3. Probably is convenient to put a restartExecute() method in the command context
            4. As you pointed out in another conversation, the execution id can change if the engine has to re-plan.  I still think that's perfectly valid and probably good, but it does bring up a memory leak possibility.  Without some more interaction with the engine, the coordinator doesn't know about re-plans and, thus, won't know to clean up memory associated with request ids that it will never see again.  For our app, the memory that we're talking about is pretty minor and we could probably introduce some clean up heuristic for very long running continuous queries so we'd be fine.  But it does bring up an interesting point:  How difficult would it be to either provide for a executionExpired(CommandContext, String executionID) in the Coordinator's interface?  Alternatively, provide a stable primary key object that the coordinator could use in a WeakHashMap?

             

            Do you envision the coordinator being consulted on dataAvailable() though a continueExecution(Execution) method?  I don't see much reason for this but it seems like a natural extension.

             

            > Continuous executions against a stored procedure don't pause exeuctions on IMMEDIATE / LAZY invalidation the same way they do for tables

             

            You'll have to elaborate more here as I'm not quite sure what you mean.

            Let me rerun my test and if I still think there's a problem, I'll post on a separate thread with a test case.

            • 3. Re: Some quesitons/observations on TEIID-2301
              markaddleman

              Thinking thinking about the polling case for the coordinator (#2, above) - The event stream that determines when a poll occurs could be determined from the results from a separate continuous query.  That means it would be useful for the solution to [#TEIID-2577] Translator API to provide session backed connection - JBoss Issue Tracker to apply to the coordinator as well.  I think this means making the connection object available from the command context.

              • 4. Re: Re: Some quesitons/observations on TEIID-2301
                shawkins

                > Practically, I see three logics for the coordinator:

                1. The engine should drive a new execution when ALL sources indicate data available
                2. The engine drives a new execution on some polling interval
                3. The engine drives a new execution when ANY source indicates data available

                 

                2 is primarily why introducing a higher level coordinator makes sense.  #3 you could argue is effectively the logic today although there may be some considerations there with coordination (for example you have mentioned before something like a DNA extension that restarts the entire execution cycle).  #1 I'm not sure how it would be different than #3 given the current logic - that is considering sources (and really just source queries) are determined on demand.  This means that late binding due to dynamic sql (in which the source accessed could be data driven) or even subquery evaluation that can be avoid won't be considered as a source.  It seems like for #1 to function as you would intuitively want you have to know ahead of time what sources are involved.

                 

                > When any notification from #3 pops, the coordinator informs the engine to restart

                 

                Yes, having the coordinator saves you from having all of your sources from throwing their own non-strict indefinite DNA on reexecution and keeps the conceptual restart of execution more closely aligned to what you want.

                 

                But based upon the subsequent questions and thoughts (and my preliminary response that was getting a little long) I am wondering if it is a fruitful path to take.  It seems like it's leading to a lot of new API that will have quite a few couplings.  Perhaps it would be enough to bake in a few reexectue strategies as part of RequestOptions?  Interval, Any, or All - we would assume that a replan restarts the execution cycle and that for any/all the sources are based upon the previous execution cycles sources.

                • 5. Re: Re: Re: Some quesitons/observations on TEIID-2301
                  markaddleman

                  I'll see your simplification and raise you one:  I believe that all of our use cases can be met with anonymous procedure blocks and two re-execute strategies, any and all.

                   

                  First, let's start with the interval strategy.  I posit a DELAY stored procedure that returns whatever result set is meaningful but its execution behavior is to delay for some time period.  Executing the following anonymous procedure block would achieve the same effect as an Interval re-execution strategy:

                  BEGIN

                    CALL DELAY(...);

                    SELECT * FROM t1 JOIN t2...

                  END

                  I think it's obvious that an event based re-execution strategy could be achieved using the same technique.

                   

                  That leaves the issue of Any versus All.  Ultimately, I see this as a convenience as we can force the Any behavior that we really want.  From our exchanges, I'm gaining a deeper appreciation for how the engine is designed and, specifically in this case, what strict is intended to do and how to best separate caching concerns from re-execution concerns.  I'm going to play around with these ideas over the next few days and see how they work out.

                  • 6. Re: Re: Re: Some quesitons/observations on TEIID-2301
                    shawkins

                    > I believe that all of our use cases can be met with anonymous procedure blocks

                     

                    Yes, that is brilliant.  As long as you don't mind adulterating your SQL, that clearly conveys the intent of polling.

                     

                    > Ultimately, I see this as a convenience as we can force the Any behavior that we really want

                     

                    Just to make sure we have the same understanding.  Sans the caching consideration we effectively do the ANY strategy.  To extend it further would be to add a check prior to re-execution to see if any source has reported dataAvailable after the current execution has ended?  Is there any meaning for a dataAvailable received after results have been pulled, but before the execution cycle has ended?

                     

                    I see ALL, while conceptually pretty clear, as being more problematic.  Given that some nuances of processing such as dependent joins (unless issued serially) will cause additional reusable executions to be created is the notion of source here any reusable executions for a given connector translator/source combination?

                     

                    > From our exchanges, I'm gaining a deeper appreciation for how the engine is designed and, specifically in this case, what strict is intended to do and how to best separate caching concerns from re-execution concerns.

                     

                    Yes, it's imperative to keep concerns as concise as possible.

                    • 7. Re: Some quesitons/observations on TEIID-2301
                      markaddleman

                      Sans the caching consideration we effectively do the ANY strategy.  To extend it further would be to add a check prior to re-execution to see if any source has reported dataAvailable after the current execution has ended?

                      I believe so.  I'm still trying to reorient by understanding of throwing DNA and strict so let me ask a few questions:  Throwing DNA with strict=false allows the engine to reexecute without the same execution indicating dataAvailable?  The only time the engine would do reexecute is when some other execution within the query indicates dataAvailable?  We nearly always use strict=true but, I confess, this is because, early on, we were burned with strict=false.  Early on, we noticed an infinite looping behavior with strict=false.  I'll have to rerun some experiments now that I have a better understanding.

                       

                      Is there any meaning for a dataAvailable received after results have been pulled, but before the execution cycle has ended?

                      I think so:  It indicates that the source has new data that, presumably, matches the user's query.  Under a continuous query, the client would want to see the result set reflecting the change.  But, all this may be moot.  See below.

                       

                      I see ALL, while conceptually pretty clear, as being more problematic.  Given that some nuances of processing such as dependent joins (unless issued serially) will cause additional reusable executions to be created is the notion of source here any reusable executions for a given connector translator/source combination?

                      Because my understanding of the engine's behavior was tainted by my misunderstanding of strict, I was under the impression that ALL was the default behavior and ANY was the hard one.  In fact, I was struggling to come up with a practical reason why anyone would want ALL behavior.  We don't want it.  I say we drop it from the discussion unless you can come up with a use case for it.  I can't.

                       

                      Taking a step back a moment and review some of my assumptions coming into this - I started my thinking with the desire to re-execute queries when a source indicates there is a reason to re-execute the query.  Sources break down into two buckets:  cacheable and not-cacheable.  A query is re-executed on some event (time-based or otherwise).  In all cases, the re-execute event is the same as the invalidation event.  I'm pretty sure that every time we invalidate the cache, we want to re-execute the query.  The not-cacheable sources only have a re-execute event.

                       

                      When a source is cacheable, we can require that it publish a re-execute event after it handles its invalidation event.  The re-execute event would be associated with the engine's request id.  The receiver of the re-execute events could be exposed as a stored proc that emits a result set for any invalidation events matching its execution's request id.  Then, our anonymous procedure block would look something like this:

                       

                      BEGIN

                         SELECT 1 FROM EXECUTION_COORDINATION_LATCH(request_id) LIMIT 1
                         UNION
                         SELECT 1 FROM <event table for non-cacheable sources>  LIMIT 1;

                         ...
                      END

                       

                      We can write the EXECUTION_COORDINATION_LATCH as a regular translator.  I don't think there is any new functionality required of the engine beyond anonymous procedure blocks.  The trick is, I think to get the UNION of re-execution event queries to behave in an ANY fashion.  As you pointed out, that's current engine behavior.  If all this supposition chains together in reality, I see no value in adding a request option for ANY vs ALL behavior. 

                       

                      My next step is to POC this solution.

                      • 8. Re: Some quesitons/observations on TEIID-2301
                        markaddleman

                        A brief update before I head on vacation:

                        • The anonymous procedure block approach is working very well.  I'm a little surprised and very pleased how little code is required to get the behavior I'm looking for.
                        • There are a couple of little wierdnesses. They are mainly due to the lifespan of executions and that (it appears) different instances of CommandContexts are used for different executions within a UNION.  It might be helpful to add hashCode() and equals() to the CommandContext.  I'll share more when I get back from vacation.
                        • I think I found a bug related to DataNotFoundException and isForkable.  It appears that when isForkable=false, the delay specified in DataNotFoundException is not honored.  I have seen similar behavior even when isForkable=true but it's a Heisenbug:  when I add logging, it goes away.  It smells like a race condition.  Test case is attached.
                        • 9. Re: Some quesitons/observations on TEIID-2301
                          markaddleman

                          I don't see a way to upload files on this new message board system.  I have opened [TEIID-2676] Delay in DataNotAvailableException not honored when isForkable=false - JBoss Issue Tracker and attached test case there

                          • 10. Re: Some quesitons/observations on TEIID-2301
                            shawkins

                            > There are a couple of little wierdnesses. They are mainly due to the lifespan of executions and that (it appears) different instances of CommandContexts are used for different executions within a UNION.  It might be helpful to add hashCode() and equals() to the CommandContext.  I'll share more when I get back from vacation.

                             

                            Do you mean CommandContext or ExecutionContext?  There is only a single CommandContext created for each user request, so there shouldn't be two of them.

                             

                            > I think I found a bug related to DataNotFoundException and isForkable.  It appears that when isForkable=false, the delay specified in DataNotFoundException is not honored.  I have seen similar behavior even when isForkable=true but it's a Heisenbug:  when I add logging, it goes away.  It smells like a race condition.  Test case is attached.

                             

                            In looking at the class attached to the issue, I see an initial pause.  Can you respond over one the issue with what you are seeing?

                            • 11. Re: Some quesitons/observations on TEIID-2301
                              markaddleman

                              > Do you mean CommandContext or ExecutionContext?  There is only a single CommandContext created for each user request, so there shouldn't be two of them.

                              CommandContext.  I'm pretty sure I got two different instances of CommandContext for each execution when my query included a UNION:

                              SELECT 1 FROM coordinate_executions
                              UNION

                              SELECT 1 FROM delay WHERE delayMs=1000

                              WITHOUT RETURN;

                               

                              I ended up writing a CommandContext wrapper whose equals and hashCode behavior was based on request id so I could tie together state across the entire command.  I doubt I'll have time to experiment further until I get back from vacation in about a week.

                              • 12. Re: Some quesitons/observations on TEIID-2301
                                shawkins

                                > I ended up writing a CommandContext wrapper whose equals and hashCode behavior was based on request id so I could tie together state across the entire command.  I doubt I'll have time to experiment further until I get back from vacation in about a week.

                                 

                                After thinking about it, yes you can have two different instances, but they will represent the same user request.  As needed we create clones that share all of the higher level state of the request.  I guess the question is why do you need them to be the same instance?  Does it just offer some convenience over using the request id?

                                • 13. Re: Some quesitons/observations on TEIID-2301
                                  markaddleman

                                  > Does it just offer some convenience over using the request id?

                                   

                                  Yes, it's purely convenience.  Essentially, this is due to how I structured my code around using a Guava loading cache as a map of request id->state.  When the cache entry is initialized, I attach a command listener to the first command instance seen by the cache but I must tie all future cache requests so the same request id.  Thus, I want the cache keyed by request id but I must have access to the command in order to attach the listener.