1 2 3 4 Previous Next 50 Replies Latest reply on Nov 18, 2008 12:43 PM by vblagojevic

    Changes on the JBM stack on JBoss5

    clebert.suconic

      Brian,

      I have just taken the configuration you changed on JBoss5 into our clustered-testsuite.

      from here:

      http://anonsvn.jboss.org/repos/jbossas/trunk/cluster/src/resources/jgroups/jgroups-channelfactory-stacks.xml


      into here:


      http://viewvc.jboss.org/cgi-bin/viewvc.cgi/messaging/branches/Branch_1_4/tests/etc/server/default/deploy/mock-channelfactory-stacks.xml?r1=4985&r2=5284


      and I started seeing these errors on our testsuite:


      SomeException:
      ...
      Caused by: java.lang.IllegalStateException: Could not flush the cluster and proceed with state retrieaval
       at org.jgroups.JChannel.getState(JChannel.java:1041)
       at org.jgroups.JChannel.getState(JChannel.java:973)
       at org.jgroups.JChannel.getState(JChannel.java:927)
       at org.jboss.messaging.core.impl.postoffice.GroupMember.start(GroupMember.java:152)
      ...
      
      



      Any idea why this would happen?

      Also, any particular reason why you needed to change the config on the jboss5 tree? If you did, maybe you could help us with a better config as this one is clearly not working.

      At this point I would like to release with the config we have tested so far.


      Thanks

        • 1. Re: Changes on the JBM stack on JBoss5
          brian.stansberry

          Please send me the logs showing the full startup that results in the error.

          What JGroups version are you testing against? AS is using 2.6.6.CR1 (2.6.6.GA will be released tomorrow); 2.6.5.GA is fine too. I don't know of any reason why an fairly recent earlier version would give you trouble though.

          I'll post separately about the differences between your r4985 and r5284.

          • 2. Re: Changes on the JBM stack on JBoss5
            clebert.suconic

            I'm using 2.6.5.GA, as I thought that was the version available at JBoss5.


            It is hard to replicate this error. I have to run the whole testsuite (about 2-3 hours) and it will happen eventually.

            • 3. Re: Changes on the JBM stack on JBoss5
              brian.stansberry

              The differences in the control channel config:

              UDP.singleton_name. In the AS this instance of the UDP transport protocol is shared across numerous channels. Giving the transport a name is what allows this to work.

              UDP.mcast_addr and mcast_port. These are just the values for the shared transport rather than the old JBM-specific ones. Don't see what difference this would make.

              UDP.loopback=true. We were seeing inscrutable startup failures for the AS when this was false on machines that had improperly configured multicast. With true you get startup failures (nodes can't cluster because multicast doesn't work) but they aren't inscrutable. I'd consider changing this back if we could somehow establish it's a cause of whatever your problem is.

              UDP.enable_bundling=false is just putting in the config file the default you had before. No change.

              UDP.ip_ttl=2. Longstanding AS default value to limit multicast propagation. In most testsuites, multicast doesn't even need to propagate off the test machine, so I doubt this is your problem.

              UDP.timer.num_threads=12. Your unspecified value defaults to 4. This is because the transport is meant to be shared between different services in the AS, so the number of threads available to run timer tasks is increased.

              UDP.thread_pool.min_threads="20". Old value = 1. With thread_pool.queue_enabled="true" and thread_pool.queue_max_size="1000", once that 1 min_thread was carrying a message up the stack or handling it at the application level, you would have to receive 1000 more messages and fill the queue before a 2nd thread would be created in the pool to handle to take a message off the queue. With a shared transport, it's possible those messages are for completely unrelated services and while the 1 thread is busy say in the session replication cache, 1000 JBM messages pile up in the queue. You need a larger # of min threads to ensure threads are available to read the queue. Testing with just one showed very poor performance in multi-node clusters. I can't see why having more threads available in a pool would cause a problem.

              UDP.thread_pool.rejection_policy="discard". Was "run". You can hang the entire cluster with "run", since it allows the single thread that reads messages off the wire to end up going into code that blocks in NAKACK or UNICAST or even into arbitrary application code. With multi-node clusters in tests under load, we found it was quite easy to hang the cluster with "run".

              UDP.oob_thread_pool.max_threads="20". See UDP.thread_pool.min_threads="20" above.

              UDP.oob_thread_pool.rejection_policy="run" Was "Run". This is just consistency in capitalization.

              FD.timeout and maxTries. With old values, would take 50 secs to detect a hung node. That's a long time. That was reasonable in AS 4, where the single-threaded channel and no OOB thread pool made it quite possible for FD heartbeats to go unacknowledged for a long time while the single thread was busy doing something else. With the thread pool and OOB messages, there's no reason FD heartbeats should go unacknowledged for so long, so we reduced the timeout period to 30 secs.

              GMS.shun="true". Previous value of false makes no logical sense in conjunction with FD.shun="true". This was discussed on a JBM forum thread a while back.


              TBH, I don't see why any of these would cause the error you reported, but until I see more details I don't really know what the error was.

              • 4. Re: Changes on the JBM stack on JBoss5
                brian.stansberry

                 

                "clebert.suconic@jboss.com" wrote:
                I'm using 2.6.5.GA, as I thought that was the version available at JBoss5.


                Ok, that should be fine. FYI, here's what's in 2.6.6:

                https://jira.jboss.org/jira/secure/IssueNavigator.jspa?reset=true&pid=10053&fixfor=12312848

                The big one is JGRP-849. Does your testsuite concurrently start 2 control channels from the same channel factory?

                It is hard to replicate this error. I have to run the whole testsuite (about 2-3 hours) and it will happen eventually.


                But do you have logs from the failure you reported?

                • 5. Re: Changes on the JBM stack on JBoss5
                  clebert.suconic

                  I will have to find the logs... the whole testsuite takes a while to run, and the logs are huge. I should have it by tomorrow.


                  Also, can you please make sure you guys also release JGroups 2.6.6.CRs and GA to the Non-maven repo please?

                  I will not hold the JBM 1 release because of this, but I will open a JIRA on JBAS for this issue.

                  • 6. Re: Changes on the JBM stack on JBoss5
                    brian.stansberry

                    I have the logs. They don't show anything helpful, at least not right around the time of the failure, which is in the server2 log at 17:04:09,119.

                    They do show this test server2 was started about 50 times before this failure occurred. So, it seems quite intermittent and likely hard to track down. A key thing is understanding what exactly changed that triggered this, or if we don't really know, understanding we don't really know so we can avoid chasing the wrong thing. So,

                    1) Did this failure start appearing regularly after a certain point? Or did it just happen once?

                    2) If it's regular, was the change to the stack config the *only* change before it started happening?

                    3) When did you start testing with JGroups 2.6.5? How much testing did you do with JGroups 2.6.5 before this issue appeared?


                    I will not hold the JBM 1 release because of this, but I will open a JIRA on JBAS for this issue.


                    +1. The only thing happening when this failed was an ordinary start of a server. Your tests failed because the server didn't start properly. The AS testsuite starts servers all the time; it's perfectly reasonably to track this issue there.

                    However, if you think the problem will consistently appear over the course of a testsuite run, I make ask for some help in running your testsuite occasionally, to see if changing settings makes the issue go away.

                    • 7. Re: Changes on the JBM stack on JBoss5
                      clebert.suconic

                       

                      "Brian" wrote:

                      1) Did this failure start appearing regularly after a certain point? Or did it just happen once?

                      2) If it's regular, was the change to the stack config the *only* change before it started happening?

                      3) When did you start testing with JGroups 2.6.5? How much testing did you do with JGroups 2.6.5 before this issue appeared?


                      The testsuite started failing right after this change, which was the only change:

                      http://viewvc.jboss.org/cgi-bin/viewvc.cgi/messaging/branches/Branch_1_4/tests/etc/server/default/deploy/mock-channelfactory-stacks.xml?r1=4985&r2=5284

                      I was using 2.6.5. before that change already and I got all successful runs. It happens most of the times now. It could happen on any test on the clustering-testsuite.

                      To run the clustering testsuite on JBM you need to:

                      svn co http://anonsvn.jboss.org/repos/messaging/branches/Branch_1_4 jbm-1.4
                      cd jbm-1.4
                      
                      edit build.properties
                      
                      #(make sure it says AS5)
                      
                      cd tests
                      
                      #(you may want to edit build.properties on tests also and put a local IP there. It should run fine with localhost thought, but I have used a local IP just in case)
                      
                      ant clustered-tests
                      



                      • 8. Re: Changes on the JBM stack on JBoss5
                        clebert.suconic

                         

                        "bstansberry@jboss.com" wrote:

                        They do show this test server2 was started about 50 times before this failure occurred. So, it seems quite intermittent and likely hard to track down. A key thing is understanding what exactly changed that triggered this, or if we don't really know, understanding we don't really know so we can avoid chasing the wrong thing. So,



                        This is because each test on the clustered-testsuite will start/stop all the servers.

                        So I believe we could make a test that only start/stops the servers until this issue appears.

                        We could actually do that by putting a single test to run in a loop. under /tests/bin you will find a way of doing this. If you contact me offline i will tell you how to use that.

                        • 9. Re: Changes on the JBM stack on JBoss5
                          vblagojevic

                          Hi,

                          You guys correctly point to the source of the problem which is a group that could not be flushed during getState call. I first thought that is a transient problem, however I was confused when I saw that you are not using FLUSH in your stack, are you?

                          Anyway, the root cause is that the cluster could not be flushed for whatever underlying reason that might have happened and therefore the state retrieval did not proceed. You have the following options:

                          a) keep things as they are - catch the exception on getState. The state will not be transferred to a joining node
                          b) use another getState method where a parameter useFlushIfPresent is set to false. Cluster will not be flushed for state transfer in that case.

                          I am not familiar with message ordering and state semantics that have to be satisfied in JMS. Are you familiar with FLUSH? There is a good summary here [1]. You might not need to flush the cluster during state transfer. Let us know.

                          Regards,
                          Vladimir

                          [1] http://www.theserverside.com/tt/articles/article.tss?l=NewFeaturesJGroups

                          • 10. Re: Changes on the JBM stack on JBoss5
                            clebert.suconic

                             

                            I am not familiar with message ordering


                            If you are saying that because of the latest testsuite that failed on Hudson, this error don't have anything to do with message-ordering. Last run it happened on the message ordering test, but it could happen on any clustered test on JBM.

                            am not familiar with .... state semantics that have to be satisfied



                            When we start JBM, we aways get the state from the cluster on the jbm-control channel.


                            Look at GroupMember.start method (on the JBM tree):

                             public void start() throws Exception
                             {
                             this.controlChannel = jChannelFactory.createControlChannel();
                            
                             this.dataChannel = jChannelFactory.createDataChannel();
                            
                             // We don't want to receive local messages on any of the channels
                             controlChannel.setOpt(Channel.LOCAL, Boolean.FALSE);
                            
                             dataChannel.setOpt(Channel.LOCAL, Boolean.FALSE);
                            
                             MessageListener messageListener = new ControlMessageListener();
                            
                             MembershipListener membershipListener = new ControlMembershipListener();
                            
                             RequestHandler requestHandler = new ControlRequestHandler();
                            
                             dispatcher = new MessageDispatcher(controlChannel, messageListener, membershipListener, requestHandler, true);
                            
                             Receiver dataReceiver = new DataReceiver();
                            
                             dataChannel.setReceiver(dataReceiver);
                            
                             starting = true;
                            
                             controlChannel.connect(groupName + CONTROL_SUFFIX);
                            
                             if (!((JChannel)controlChannel).flushSupported())
                             {
                             throw new IllegalStateException("Flush is not supported on the UDP Channel, please check your JGroups UDP stack as Flush is required");
                             }
                            
                             //The first thing that happens after connect is a view change arrives
                             //Then the state will arrive (if we are not the first member)
                             //Then the control messages will start arriving.
                             //We can guarantee that messages won't arrive until after the state is set because we use
                             //the FLUSH protocol on the control channel
                            
                             boolean first = !(controlChannel.getState(null, stateTimeout));
                            



                            So, the semantic we expect is FLUSH aways returning the state from the cluster, right after the connect. Now something on the change presented at the first post on this thread changed this semantic causing the exception somehow.

                            • 11. Re: Changes on the JBM stack on JBoss5
                              brian.stansberry

                               

                              "vblagojevic@jboss.com" wrote:
                              ... I was confused when I saw that you are not using FLUSH in your stack, are you?


                              FLUSH is used. See http://viewvc.jboss.org/cgi-bin/viewvc.cgi/messaging/branches/Branch_1_4/tests/etc/server/default/deploy/mock-channelfactory-stacks.xml?revision=5284&view=markup

                              Anyway, the root cause is that the cluster could not be flushed for whatever underlying reason that might have happened and therefore the state retrieval did not proceed.


                              OK, but we need to figure out why this happens. I looked further at this one, and all that's going on is normal channel startup repeated many times over the course of a testsuite run; eventually one fails. Like you say, intermittent. My gut impression from observing various testsuite runs is flush issues happen on this kind of normal startup quite infrequently, but still too frequently. In this particular case, the testsuite started 3 nodes about 50 times (stopping all 3 after the start); the failure occurred on the 3rd node in the ~50th cycle.

                              I think Clebert's suggestion on the forum of using some JBM test infrastructure to automate doing this over and over w/o wasting time running actual JBM tests is a good one. Do you have bandwidth to pursue such a task? [/url]

                              • 12. Re: Changes on the JBM stack on JBoss5
                                vblagojevic

                                Ok, lets pursue this one until the root cause is found. I will test flush on my usual tests on 2.6 branch on a continuous loop until a failure like this one shows up.

                                I will test two options one with connect and state transfer and the other where connect and state transfer are separate calls. Do you have a better game plan or a suggestion?

                                Regards,
                                Vladimir

                                • 13. Re: Changes on the JBM stack on JBoss5
                                  vblagojevic

                                  Now that I think about it a bit more it would be better to make the tests that emulates server startup where all channels share underlying transport?

                                  1 2 3 4 Previous Next