1 2 3 4 Previous Next 50 Replies Latest reply on Nov 18, 2008 12:43 PM by vblagojevic Go to original post
      • 15. Re: Changes on the JBM stack on JBoss5
        brian.stansberry

        This specific case uses 3 peers connecting two channels each, with the channels not sharing a transport. One is UDP, one is TCP.

        My recommendation is to ignore the 2 channel part, since they are not using a shared transport. So, test multiple peers (i.e. > 2) with a UDP config matching the "jbm-control" config shown at http://viewvc.jboss.org/cgi-bin/viewvc.cgi/messaging/branches/Branch_1_4/tests/etc/server/default/deploy/mock-channelfactory-stacks.xml?revision=5284&view=markup

        The call pattern is connect, then getState.

        Recommend isolating the problem using that setup; once that is done testing can be expanded to include permutations like connect+state transfer and shared transport.

        • 16. Re: Changes on the JBM stack on JBoss5
          clebert.suconic

          We just got this log on hudson, after adding DEBUG on JGroups:


          11:52:15,344 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
          11:52:15,351 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FRAG2] frag_size=60000, overhead=200, new frag_size=59800
          11:52:15,367 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FRAG2] received CONFIG event: {bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111}
          11:52:15,367 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] received CONFIG event: {bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111}
          11:52:15,369 INFO @RMI TCP Connection(2)-10.18.1.111 [JChannel] JGroups version: 2.6.5.GA
          11:52:15,379 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
          11:52:15,379 DEBUG @RMI TCP Connection(2)-10.18.1.111 [TCP] received CONFIG event: {bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111}
          11:52:15,394 DEBUG @RMI TCP Connection(2)-10.18.1.111 [ThreadPool] created a pool of 128 threads
          11:52:15,395 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] received CONFIG event: {flush_supported=true}
          11:52:15,395 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FRAG2] received CONFIG event: {flush_supported=true}
          11:52:15,399 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] creating sockets and starting threads
          11:52:15,399 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] sockets will use interface 10.18.1.111
          11:52:15,401 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] socket information:
          local_addr=10.18.1.111:38938, mcast_addr=228.11.11.11:45688, bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111, ttl=2
          sock: bound to 10.18.1.111:38938, receive buffer size=131071, send buffer size=131071
          mcast_sock: bound to 10.18.1.111:45688, send buffer size=131071, receive buffer size=131071
          11:52:15,406 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] created unicast receiver thread Thread[UDP ucast,10.18.1.111:38938,5,JGroups]
          11:52:15,412 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] created multicast receiver thread Thread[UDP mcast,10.18.1.111:38938,10,JGroups]
          11:52:15,420 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] initial_mbrs are [[own_addr=10.18.1.111:38936, coord_addr=10.18.1.111:38936, is_server=true]]
          11:52:15,420 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] election results: {10.18.1.111:38936=1}
          11:52:15,420 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] sending handleJoin(10.18.1.111:38938) to 10.18.1.111:38936
          11:52:15,493 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] [10.18.1.111:38938]: JoinRsp=[10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938] 
          
          
          11:52:15,494 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] new_view=[10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:15,494 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] [local_addr=10.18.1.111:38938] view is [10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:15,497 DEBUG @Timer-1,10.18.1.111:38938 [FD_SOCK] VIEW_CHANGE received: [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:15,499 DEBUG @FD_SOCK pinger,JBM_GRP1-CTRL,10.18.1.111:38938 [FD_SOCK] determinePingDest()=10.18.1.111:38936
           11:52:15,500 DEBUG @FD_SOCK pinger,JBM_GRP1-CTRL,10.18.1.111:38938 [FD_SOCK] ping_dest=10.18.1.111:38936, ping_sock=Socket[addr=/10.18.1.111,port=51107,localport=37613], cache={10.18.1.111:38937=10.18.1.111:59241, 10.18.1.111:38938=10.18.1.111:55796, 10.18.1.111:38936=10.18.1.111:51107}
           11:52:15,500 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@425eb9 got new view [10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938], old view is null
           11:52:15,500 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] I am (10.18.1.111:38938)
           11:52:15,500 DEBUG @RMI TCP Connection(2)-10.18.1.111 [MessagingPostOffice] org.jboss.messaging.core.impl.postoffice.MessagingPostOffice@e5f46e: 10.18.1.111:38936 joined
           11:52:15,500 DEBUG @RMI TCP Connection(2)-10.18.1.111 [MessagingPostOffice] org.jboss.messaging.core.impl.postoffice.MessagingPostOffice@e5f46e: 10.18.1.111:38937 joined
           11:52:15,501 DEBUG @RMI TCP Connection(2)-10.18.1.111 [MessagingPostOffice] org.jboss.messaging.core.impl.postoffice.MessagingPostOffice@e5f46e: 10.18.1.111:38938 joined
           11:52:15,501 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] New Members : 3 ([10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938])
           11:52:15,501 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] All Members : 3 ([10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938])
           11:52:15,501 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Installing view at 10.18.1.111:38938 view is [10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:15,502 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] 10.18.1.111:38938 changed role to org.jgroups.protocols.pbcast.ParticipantGmsImpl
           11:52:15,503 DEBUG @Incoming-4,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
           11:52:15,504 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Received Event[type=SUSPEND, arg=null] at 10.18.1.111:38938. Running FLUSH...
           11:52:15,505 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:15,506 DEBUG @Incoming-5,10.18.1.111:38938 [FLUSH] Received START_FLUSH at 10.18.1.111:38938 responded with FLUSH_COMPLETED to 10.18.1.111:38938
           11:52:15,506 DEBUG @Incoming-7,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 FLUSH_COMPLETED from 10.18.1.111:38938,completed false,flushCompleted [10.18.1.111:38938]
           11:52:15,508 DEBUG @Incoming-8,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 FLUSH_COMPLETED from 10.18.1.111:38936,completed false,flushCompleted [10.18.1.111:38936]
           11:52:18,568 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 4
           11:52:18,569 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:18,569 DEBUG @Incoming-10,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
           11:52:18,570 DEBUG @Incoming-10,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
           11:52:18,572 DEBUG @Incoming-14,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 FLUSH_COMPLETED from 10.18.1.111:38937,completed false,flushCompleted [10.18.1.111:38937]
           11:52:21,501 DEBUG @Timer-7,10.18.1.111:38938 [FD] sending are-you-alive msg to 10.18.1.111:38936 (own address=10.18.1.111:38938)
           11:52:21,631 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 3
           11:52:21,632 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:21,632 DEBUG @Incoming-15,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
           11:52:21,632 DEBUG @Incoming-15,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
           11:52:24,694 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 2
           11:52:24,695 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:24,695 DEBUG @Incoming-20,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
           11:52:24,695 DEBUG @Incoming-20,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
           11:52:27,504 DEBUG @Timer-12,10.18.1.111:38938 [FD] sending are-you-alive msg to 10.18.1.111:38936 (own address=10.18.1.111:38938)
           11:52:27,758 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 1
           11:52:27,758 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
           11:52:27,759 DEBUG @Incoming-6,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
           11:52:27,759 DEBUG @Incoming-6,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
           11:52:27,761 ERROR @RMI TCP Connection(2)-10.18.1.111 [ExceptionUtil] org.jboss.messaging.core.jmx.MessagingPostOfficeService@2d59a3 startService
           java.lang.IllegalStateException: Could not flush the cluster and proceed with state retrieaval
           at org.jgroups.JChannel.getState(JChannel.java:1041)
           at org.jgroups.JChannel.getState(JChannel.java:973)
           at org.jgroups.JChannel.getState(JChannel.java:927)
           at org.jboss.messaging.core.impl.postoffice.GroupMember.start(GroupMember.java:152)
           at org.jboss.messaging.core.impl.postoffice.MessagingPostOffice.start(MessagingPostOffice.java:357)
           at org.jboss.messaging.core.jmx.MessagingPostOfficeService.startService(MessagingPostOfficeService.java:456)
           at org.jboss.system.ServiceMBeanSupport.jbossInternalStart(ServiceMBeanSupport.java:289)
           at org.jboss.system.ServiceMBeanSupport.start(ServiceMBeanSupport.java:196)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
           at java.lang.reflect.Method.invoke(Method.java:585)
           at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:155)
           at org.jboss.mx.server.Invocation.dispatch(Invocation.java:94)
           at org.jboss.mx.interceptor.AbstractInterceptor.invoke(AbstractInterceptor.java:133)
           at org.jboss.mx.server.Invocation.invoke(Invocation.java:88)
           at org.jboss.mx.interceptor.ModelMBeanOperationInterceptor.invoke(ModelMBeanOperationInterceptor.java:142)
           at org.jboss.mx.server.Invocation.invoke(Invocation.java:88)
           at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264)
           at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
           at org.jboss.test.messaging.tools.container.ServiceContainer.invoke(ServiceContainer.java:712)
           at org.jboss.test.messaging.tools.container.LocalTestServer.startServerPeer(LocalTestServer.java:349)
           at org.jboss.test.messaging.tools.container.LocalTestServer.start(LocalTestServer.java:137)
           at org.jboss.test.messaging.tools.container.RMITestServer.start(RMITestServer.java:146)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
           at java.lang.reflect.Method.invoke(Method.java:585)
           at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:294)
           at sun.rmi.transport.Transport$1.run(Transport.java:153)
           at java.security.AccessController.doPrivileged(Native Method)
           at sun.rmi.transport.Transport.serviceCall(Transport.java:149)
           at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:466)
           at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:707)
           at java.lang.Thread.run(Thread.java:595)


          • 17. Re: Changes on the JBM stack on JBoss5
            brian.stansberry

            Vlad,

            You have a more legible version of that in your e-mail. ;-)

            If the problem here isn't obvious to you once you review the logs, please initially consider whether setting loopback="true" on the UDP transport could be relevant. It was "false" in JBM's previous config. Restoring "false" is one tweak they could make in their (immediately) pending 1.x release.

            • 18. Re: Changes on the JBM stack on JBoss5
              clebert.suconic

              I just started a run on hudson with loopback changed.

              If it runs ok, it will increase the chances of the problem being that. But I would need another run to make sure. (This is intermitent and sometimes the testsuite could pass even with that bug present)

              • 19. Re: Changes on the JBM stack on JBoss5
                clebert.suconic

                I got the same failure with the loopback = false

                • 20. Re: Changes on the JBM stack on JBoss5
                  vblagojevic

                  I was able to recreate this in my own bare jgroups tests. I made a test where 4 members join and do a separate getState. The test does this repeatedly 10 times. Having loopback true fails at least on of these 10 runs. Having loopback false gives much better performance.

                  I have isolated the problem and I am testing a fix that would handle this problem regardless of loopback being turned of or on!


                  • 21. Re: Changes on the JBM stack on JBoss5
                    clebert.suconic

                    I don't think this is to the loopback?

                    I had this error with and without the loopback on (in a test I did last week).

                    • 22. Re: Changes on the JBM stack on JBoss5
                      vblagojevic

                      Ok, might be. It could also be increased concurrency of channel startup in AS.

                      Unfortunately, we don't currently have an option to test extensively concurrent shared transport startup in jgroups. AS tests are our best option. Why don't you guys make a test that repeatedly does what Brian suggested: joining a few channels on two AS nodes where each node has a shared startup and each channel does either connect and state transfer or separate connect and getstate. I will have a fix soon I hope.

                      Vladimir

                      • 23. Re: Changes on the JBM stack on JBoss5
                        clebert.suconic

                        I was assuming you replicated the issue...

                        I was just saying this could also happen without the loopback.



                        Maybe Brian wants to add a similar test like you did on JBAS to avoid regressions? Well.. I don't know.. if you keep that test on the JGroups testsuite maybe we don't need to do anything?

                        • 24. Re: Changes on the JBM stack on JBoss5
                          brian.stansberry

                           

                          "vblagojevic@jboss.com" wrote:
                          Ok, might be. It could also be increased concurrency of channel startup in AS.


                          These failures are not occurring in the AS. AFAIK, there is no concurrent startup of channels involved here (unless the JBM testsuite is trying to speed up the process by parallelizing start of the nodes, but I didn't see any evidence of that).

                          • 25. Re: Changes on the JBM stack on JBoss5
                            clebert.suconic

                            We use two channels... Data and Control, and we don't do anything special with them, besides get the getState, right after the connect on the control (UDP) channel,

                            • 26. Re: Changes on the JBM stack on JBoss5
                              vblagojevic

                              Hey guys,

                              Are you making any progress on that AS test? I have completed a fix for FLUSH and it looks very good. I have done over a hundreds of tests on both loopback off/on and all of the tests pass - not a single failure.

                              How should we proceed? Clebert, do you want me to give you a jar file so you can give it a spin? If it looks good we can soon proceed to cut 2.6.7.

                              Let me know,
                              Vladimir

                              • 27. Re: Changes on the JBM stack on JBoss5
                                brian.stansberry

                                Sorry, there's not going to be any test of this in the AS, I have no bandwidth and TBH I don't see why tests of core JGroups functionality belong in the AS testsuite.

                                Recommend you follow the approach you suggested of providing Clebert a jar and seeing if your fix resolves the problem he was seeing.

                                • 28. Re: Changes on the JBM stack on JBoss5
                                  clebert.suconic

                                  I'm running a test on hudon:



                                  http://hudson.qa.jboss.com/hudson/view/JBM1.4 (JBoss5)/job/JBMS-clustering-mysql-msgci1/62

                                  (Internal RedHat link... sorry for the outsiders :-) )

                                  If this test pass, I will start a test in loop (about 200) to see if the issue was gone.


                                  And BTW: what was the conclusion on the loopback thing? should I use it or not?

                                  • 29. Re: Changes on the JBM stack on JBoss5
                                    brian.stansberry

                                    Use loopback="true"; I'll post details as to why on Design of Clustering forum in a while and add a link to the post here.