7 Replies Latest reply on Aug 14, 2007 2:27 AM by Ben Wang

    JBAS-4574 and JBAS-1476

    Adrian Brock Master

      As mentioned on JBAS-4574 you need to discuss things on the forums
      before opening JIRA issues. This avoids wrong conclusions and hacks
      to workaround problems (JBAS-1476) when the original problem should be fixed.

      On JBAS-4574 Ben appears to claim (its not clear actually what he claims)
      that the problem is that naming context is getting cached.
      This is irrelevant. For HAJNDI the naming context uses the cluster view
      for the transport (regardless of how many naming contexts you create).

      It is the fact that it is using the cluster view from a different cluster instance (JBAS-1476)
      that is the problem. It doesn't update the view because the "view change" is less
      than it previously was.

      That's not to say that there isn't a case for flushing the cached naming context
      on a naming CommunicationException and transparently recreating it,
      but that doesn't help in thise case since the underlying problem will not be fixed.
      And such a change would require significant testing.

        • 1. Re: JBAS-4574 and JBAS-1476
          Brian Stansberry Master

          I'm not going to get into any issues related to SFSBs, etc., as my understanding of the issue from the customer test Ben showed me is that it was related to caching of the HA-JNDI proxy in the static org.jnp.interfaces.NamingContext.haServers map. If there's more to it beyond that, I'll let Ben comment.

          I have simple unit test that shows the issue. I haven't checked it in this evening because the test deploys an instance of org.jboss.naming.NamingService and I'm concerned that might screw up the AS in some way.

          But, here's essentially what the test does:

          Properties env = new Properties();
          env.setProperty("java.naming.provider.url", namingURL);
          
          Context ctx1 = new InitialContext(env);
          assertEquals("VALUE", ctx1.lookup("NamingRestartBinding"));
          
          // HOLD ONTO REF to ctx1 so the weak ref to it's Naming stub does
          // not get gc'ed from static map in org.jnp.interfaces.NamingContext.
          
          // Redeploy the local and HA naming services
          redeploy("naming-restart.sar");
          
          Context ctx2 = new InitialContext(env);
          try
          {
           // This lookup will fail
           assertEquals(ObjectBinder.VALUE, ctx2.lookup(ObjectBinder.NAME));
          }
          catch (NamingException e)
          {
           log.error("Caught NamingException", e);
           fail(e.getMessage());
          }


          The test deploys both an alternate local JNDI and an alternate HA-JNDI. (I figure bouncing the real services is not very friendly to other tests ;) ) The test fails when I test against the HA-JNDI service; passes with regular JNDI.

          When I look into it in detail, the failure mode is clear. The lookup by ctx1 results in a naming proxy being cached. Server is restarted, so the RMI stub in the cached proxy no longer matches the one exported by the server. When ctx2 does a lookup, the cached proxy is used and the call fails with "java.rmi.NoSuchObjectException: no such object in table". I see no indication the failure has nothing at all to do with the correctness or incorrectness of the the viewId.

          If I let the test continue after the failure and do another lookup with ctx2, it succeeds, since the failure flushes the stale proxy out of the haServers cache.

          The interesting thing is the test passes with regular JNDI. Not sure at this point why. In both cases the call uses RMI. With regular JNDI a simple RMI stub is used; with HA-JNDI the RMI stub is encapsulated in an HARMIClient.

          Part of the problem here is the use of RMI for the HA-JNDI transport. If Remoting's socket transport were used, bouncing the server would not invalidate the client-side InvokerLocator.

          [OT] Re: 'the "view change" is less than it previously was" after a cluster restart. No. The viewId passed between the server and HA clients is not a counter. It is a hash of the service's cluster topology.

          • 2. Re: JBAS-4574 and JBAS-1476
            Ben Wang Master

            Basically, this is most of it. We will also need to fix the retry interceptor logic to flush out the cached server info such that a subsquent jndi look up will succeed.

            • 3. Re: JBAS-4574 and JBAS-1476
              Adrian Brock Master

              Brian's comment just confirms to me that the problem is still not understood,
              making JIRA issues to fix "random stuff" premature.

              When you know why JNP/RMI works (assuming it really does, it sounds like it
              shouldn't? :-) and JNP/HARMI doesn't work, then you'll know what the fix is.

              • 4. Re: JBAS-4574 and JBAS-1476
                Adrian Brock Master

                 

                "bstansberry@jboss.com" wrote:

                [OT] Re: 'the "view change" is less than it previously was" after a cluster restart. No. The viewId passed between the server and HA clients is not a counter. It is a hash of the service's cluster topology.


                Nice fix. You just broke all the old clients that assume its a counter.

                • 5. Re: JBAS-4574 and JBAS-1476
                  Brian Stansberry Master

                   

                  "adrian@jboss.org" wrote:
                  "bstansberry@jboss.com" wrote:

                  [OT] Re: 'the "view change" is less than it previously was" after a cluster restart. No. The viewId passed between the server and HA clients is not a counter. It is a hash of the service's cluster topology.


                  Nice fix. You just broke all the old clients that assume its a counter.


                  What fix? I never changed anything; this is the way viewId has worked in any clustering code I've looked at. I have no idea where this counter concept you have comes from; maybe some early version. It's not the way it works at least since 4.0.3 (and I didn't touch it there).

                  • 6. Re: JBAS-4574 and JBAS-1476
                    Adrian Brock Master

                     

                    "bstansberry@jboss.com" wrote:
                    "adrian@jboss.org" wrote:
                    "bstansberry@jboss.com" wrote:

                    [OT] Re: 'the "view change" is less than it previously was" after a cluster restart. No. The viewId passed between the server and HA clients is not a counter. It is a hash of the service's cluster topology.


                    Nice fix. You just broke all the old clients that assume its a counter.


                    What fix? I never changed anything; this is the way viewId has worked in any clustering code I've looked at. I have no idea where this counter concept you have comes from; maybe some early version. It's not the way it works at least since 4.0.3 (and I didn't touch it there).


                    You calling me an old timer? :-)

                    My guess is Sacha changed it when refactored the whole thing to do his
                    cluster family stuff?
                    Either that or I'm just going senile. ;-)

                    • 7. Re: JBAS-4574 and JBAS-1476
                      Ben Wang Master

                      Since I have been traveling, I finally had the time to look into it more for the last couple days. Like I mentioned, the first part is from the jndi cache problem. I have posted a forum topic on it and propose a fix.
                      http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4073831#4073831

                      Please take a look and see if there is other issue to it.

                      -Ben