Changes on the JBM stack on JBoss5| JBoss.org Content Archive (Read Only)

15. Re: Changes on the JBM stack on JBoss5

brian.stansberry Nov 7, 2008 10:54 AM (in response to clebert.suconic)

This specific case uses 3 peers connecting two channels each, with the channels not sharing a transport. One is UDP, one is TCP.

My recommendation is to ignore the 2 channel part, since they are not using a shared transport. So, test multiple peers (i.e. > 2) with a UDP config matching the "jbm-control" config shown at http://viewvc.jboss.org/cgi-bin/viewvc.cgi/messaging/branches/Branch_1_4/tests/etc/server/default/deploy/mock-channelfactory-stacks.xml?revision=5284&view=markup

The call pattern is connect, then getState.

Recommend isolating the problem using that setup; once that is done testing can be expanded to include permutations like connect+state transfer and shared transport.

16. Re: Changes on the JBM stack on JBoss5

clebert.suconic Nov 7, 2008 12:33 PM (in response to clebert.suconic)

We just got this log on hudson, after adding DEBUG on JGroups:

11:52:15,344 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
11:52:15,351 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FRAG2] frag_size=60000, overhead=200, new frag_size=59800
11:52:15,367 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FRAG2] received CONFIG event: {bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111}
11:52:15,367 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] received CONFIG event: {bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111}
11:52:15,369 INFO @RMI TCP Connection(2)-10.18.1.111 [JChannel] JGroups version: 2.6.5.GA
11:52:15,379 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
11:52:15,379 DEBUG @RMI TCP Connection(2)-10.18.1.111 [TCP] received CONFIG event: {bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111}
11:52:15,394 DEBUG @RMI TCP Connection(2)-10.18.1.111 [ThreadPool] created a pool of 128 threads
11:52:15,395 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] received CONFIG event: {flush_supported=true}
11:52:15,395 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FRAG2] received CONFIG event: {flush_supported=true}
11:52:15,399 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] creating sockets and starting threads
11:52:15,399 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] sockets will use interface 10.18.1.111
11:52:15,401 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] socket information:
local_addr=10.18.1.111:38938, mcast_addr=228.11.11.11:45688, bind_addr=soa8.qa.atl2.redhat.com/10.18.1.111, ttl=2
sock: bound to 10.18.1.111:38938, receive buffer size=131071, send buffer size=131071
mcast_sock: bound to 10.18.1.111:45688, send buffer size=131071, receive buffer size=131071
11:52:15,406 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] created unicast receiver thread Thread[UDP ucast,10.18.1.111:38938,5,JGroups]
11:52:15,412 DEBUG @RMI TCP Connection(2)-10.18.1.111 [UDP] created multicast receiver thread Thread[UDP mcast,10.18.1.111:38938,10,JGroups]
11:52:15,420 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] initial_mbrs are [[own_addr=10.18.1.111:38936, coord_addr=10.18.1.111:38936, is_server=true]]
11:52:15,420 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] election results: {10.18.1.111:38936=1}
11:52:15,420 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] sending handleJoin(10.18.1.111:38938) to 10.18.1.111:38936
11:52:15,493 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] [10.18.1.111:38938]: JoinRsp=[10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938] 


11:52:15,494 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] new_view=[10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:15,494 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] [local_addr=10.18.1.111:38938] view is [10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:15,497 DEBUG @Timer-1,10.18.1.111:38938 [FD_SOCK] VIEW_CHANGE received: [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:15,499 DEBUG @FD_SOCK pinger,JBM_GRP1-CTRL,10.18.1.111:38938 [FD_SOCK] determinePingDest()=10.18.1.111:38936
 11:52:15,500 DEBUG @FD_SOCK pinger,JBM_GRP1-CTRL,10.18.1.111:38938 [FD_SOCK] ping_dest=10.18.1.111:38936, ping_sock=Socket[addr=/10.18.1.111,port=51107,localport=37613], cache={10.18.1.111:38937=10.18.1.111:59241, 10.18.1.111:38938=10.18.1.111:55796, 10.18.1.111:38936=10.18.1.111:51107}
 11:52:15,500 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@425eb9 got new view [10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938], old view is null
 11:52:15,500 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] I am (10.18.1.111:38938)
 11:52:15,500 DEBUG @RMI TCP Connection(2)-10.18.1.111 [MessagingPostOffice] org.jboss.messaging.core.impl.postoffice.MessagingPostOffice@e5f46e: 10.18.1.111:38936 joined
 11:52:15,500 DEBUG @RMI TCP Connection(2)-10.18.1.111 [MessagingPostOffice] org.jboss.messaging.core.impl.postoffice.MessagingPostOffice@e5f46e: 10.18.1.111:38937 joined
 11:52:15,501 DEBUG @RMI TCP Connection(2)-10.18.1.111 [MessagingPostOffice] org.jboss.messaging.core.impl.postoffice.MessagingPostOffice@e5f46e: 10.18.1.111:38938 joined
 11:52:15,501 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] New Members : 3 ([10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938])
 11:52:15,501 INFO @RMI TCP Connection(2)-10.18.1.111 [GroupMember] All Members : 3 ([10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938])
 11:52:15,501 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Installing view at 10.18.1.111:38938 view is [10.18.1.111:38936|2] [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:15,502 DEBUG @RMI TCP Connection(2)-10.18.1.111 [GMS] 10.18.1.111:38938 changed role to org.jgroups.protocols.pbcast.ParticipantGmsImpl
 11:52:15,503 DEBUG @Incoming-4,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
 11:52:15,504 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Received Event[type=SUSPEND, arg=null] at 10.18.1.111:38938. Running FLUSH...
 11:52:15,505 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:15,506 DEBUG @Incoming-5,10.18.1.111:38938 [FLUSH] Received START_FLUSH at 10.18.1.111:38938 responded with FLUSH_COMPLETED to 10.18.1.111:38938
 11:52:15,506 DEBUG @Incoming-7,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 FLUSH_COMPLETED from 10.18.1.111:38938,completed false,flushCompleted [10.18.1.111:38938]
 11:52:15,508 DEBUG @Incoming-8,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 FLUSH_COMPLETED from 10.18.1.111:38936,completed false,flushCompleted [10.18.1.111:38936]
 11:52:18,568 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 4
 11:52:18,569 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:18,569 DEBUG @Incoming-10,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
 11:52:18,570 DEBUG @Incoming-10,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
 11:52:18,572 DEBUG @Incoming-14,10.18.1.111:38938 [FLUSH] At 10.18.1.111:38938 FLUSH_COMPLETED from 10.18.1.111:38937,completed false,flushCompleted [10.18.1.111:38937]
 11:52:21,501 DEBUG @Timer-7,10.18.1.111:38938 [FD] sending are-you-alive msg to 10.18.1.111:38936 (own address=10.18.1.111:38938)
 11:52:21,631 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 3
 11:52:21,632 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:21,632 DEBUG @Incoming-15,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
 11:52:21,632 DEBUG @Incoming-15,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
 11:52:24,694 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 2
 11:52:24,695 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:24,695 DEBUG @Incoming-20,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
 11:52:24,695 DEBUG @Incoming-20,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
 11:52:27,504 DEBUG @Timer-12,10.18.1.111:38938 [FD] sending are-you-alive msg to 10.18.1.111:38936 (own address=10.18.1.111:38938)
 11:52:27,758 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Retrying FLUSH at 10.18.1.111:38938. Attempts left 1
 11:52:27,758 DEBUG @RMI TCP Connection(2)-10.18.1.111 [FLUSH] Flush coordinator 10.18.1.111:38938 is starting FLUSH with participants [10.18.1.111:38936, 10.18.1.111:38937, 10.18.1.111:38938]
 11:52:27,759 DEBUG @Incoming-6,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938, previous flush has to finish first
 11:52:27,759 DEBUG @Incoming-6,10.18.1.111:38938 [FLUSH] Rejecting flush at 10.18.1.111:38938 to flush requester 10.18.1.111:38938 coordinator is null
 11:52:27,761 ERROR @RMI TCP Connection(2)-10.18.1.111 [ExceptionUtil] org.jboss.messaging.core.jmx.MessagingPostOfficeService@2d59a3 startService
 java.lang.IllegalStateException: Could not flush the cluster and proceed with state retrieaval
 at org.jgroups.JChannel.getState(JChannel.java:1041)
 at org.jgroups.JChannel.getState(JChannel.java:973)
 at org.jgroups.JChannel.getState(JChannel.java:927)
 at org.jboss.messaging.core.impl.postoffice.GroupMember.start(GroupMember.java:152)
 at org.jboss.messaging.core.impl.postoffice.MessagingPostOffice.start(MessagingPostOffice.java:357)
 at org.jboss.messaging.core.jmx.MessagingPostOfficeService.startService(MessagingPostOfficeService.java:456)
 at org.jboss.system.ServiceMBeanSupport.jbossInternalStart(ServiceMBeanSupport.java:289)
 at org.jboss.system.ServiceMBeanSupport.start(ServiceMBeanSupport.java:196)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:155)
 at org.jboss.mx.server.Invocation.dispatch(Invocation.java:94)
 at org.jboss.mx.interceptor.AbstractInterceptor.invoke(AbstractInterceptor.java:133)
 at org.jboss.mx.server.Invocation.invoke(Invocation.java:88)
 at org.jboss.mx.interceptor.ModelMBeanOperationInterceptor.invoke(ModelMBeanOperationInterceptor.java:142)
 at org.jboss.mx.server.Invocation.invoke(Invocation.java:88)
 at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264)
 at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
 at org.jboss.test.messaging.tools.container.ServiceContainer.invoke(ServiceContainer.java:712)
 at org.jboss.test.messaging.tools.container.LocalTestServer.startServerPeer(LocalTestServer.java:349)
 at org.jboss.test.messaging.tools.container.LocalTestServer.start(LocalTestServer.java:137)
 at org.jboss.test.messaging.tools.container.RMITestServer.start(RMITestServer.java:146)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:294)
 at sun.rmi.transport.Transport$1.run(Transport.java:153)
 at java.security.AccessController.doPrivileged(Native Method)
 at sun.rmi.transport.Transport.serviceCall(Transport.java:149)
 at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:466)
 at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:707)
 at java.lang.Thread.run(Thread.java:595)

17. Re: Changes on the JBM stack on JBoss5

brian.stansberry Nov 7, 2008 12:38 PM (in response to clebert.suconic)

Vlad,

You have a more legible version of that in your e-mail. ;-)

If the problem here isn't obvious to you once you review the logs, please initially consider whether setting loopback="true" on the UDP transport could be relevant. It was "false" in JBM's previous config. Restoring "false" is one tweak they could make in their (immediately) pending 1.x release.

18. Re: Changes on the JBM stack on JBoss5

clebert.suconic Nov 7, 2008 1:15 PM (in response to clebert.suconic)

I just started a run on hudson with loopback changed.

If it runs ok, it will increase the chances of the problem being that. But I would need another run to make sure. (This is intermitent and sometimes the testsuite could pass even with that bug present)

19. Re: Changes on the JBM stack on JBoss5

clebert.suconic Nov 7, 2008 2:31 PM (in response to clebert.suconic)

I got the same failure with the loopback = false

20. Re: Changes on the JBM stack on JBoss5

vblagojevic Nov 10, 2008 9:42 AM (in response to clebert.suconic)

I was able to recreate this in my own bare jgroups tests. I made a test where 4 members join and do a separate getState. The test does this repeatedly 10 times. Having loopback true fails at least on of these 10 runs. Having loopback false gives much better performance.

I have isolated the problem and I am testing a fix that would handle this problem regardless of loopback being turned of or on!

21. Re: Changes on the JBM stack on JBoss5

clebert.suconic Nov 10, 2008 9:47 AM (in response to clebert.suconic)

I don't think this is to the loopback?

I had this error with and without the loopback on (in a test I did last week).

22. Re: Changes on the JBM stack on JBoss5

vblagojevic Nov 10, 2008 10:09 AM (in response to clebert.suconic)

Ok, might be. It could also be increased concurrency of channel startup in AS.

Unfortunately, we don't currently have an option to test extensively concurrent shared transport startup in jgroups. AS tests are our best option. Why don't you guys make a test that repeatedly does what Brian suggested: joining a few channels on two AS nodes where each node has a shared startup and each channel does either connect and state transfer or separate connect and getstate. I will have a fix soon I hope.

Vladimir

23. Re: Changes on the JBM stack on JBoss5

clebert.suconic Nov 10, 2008 10:11 AM (in response to clebert.suconic)

I was assuming you replicated the issue...

I was just saying this could also happen without the loopback.

Maybe Brian wants to add a similar test like you did on JBAS to avoid regressions? Well.. I don't know.. if you keep that test on the JGroups testsuite maybe we don't need to do anything?

24. Re: Changes on the JBM stack on JBoss5

brian.stansberry Nov 10, 2008 10:27 AM (in response to clebert.suconic)

"vblagojevic@jboss.com" wrote:
Ok, might be. It could also be increased concurrency of channel startup in AS.

These failures are not occurring in the AS. AFAIK, there is no concurrent startup of channels involved here (unless the JBM testsuite is trying to speed up the process by parallelizing start of the nodes, but I didn't see any evidence of that).

25. Re: Changes on the JBM stack on JBoss5

clebert.suconic Nov 10, 2008 11:49 AM (in response to clebert.suconic)

We use two channels... Data and Control, and we don't do anything special with them, besides get the getState, right after the connect on the control (UDP) channel,

26. Re: Changes on the JBM stack on JBoss5

vblagojevic Nov 11, 2008 9:24 AM (in response to clebert.suconic)

Hey guys,

Are you making any progress on that AS test? I have completed a fix for FLUSH and it looks very good. I have done over a hundreds of tests on both loopback off/on and all of the tests pass - not a single failure.

How should we proceed? Clebert, do you want me to give you a jar file so you can give it a spin? If it looks good we can soon proceed to cut 2.6.7.

Let me know,
Vladimir

27. Re: Changes on the JBM stack on JBoss5

brian.stansberry Nov 11, 2008 10:18 AM (in response to clebert.suconic)

Sorry, there's not going to be any test of this in the AS, I have no bandwidth and TBH I don't see why tests of core JGroups functionality belong in the AS testsuite.

Recommend you follow the approach you suggested of providing Clebert a jar and seeing if your fix resolves the problem he was seeing.

28. Re: Changes on the JBM stack on JBoss5

clebert.suconic Nov 12, 2008 12:15 PM (in response to clebert.suconic)

I'm running a test on hudon:

http://hudson.qa.jboss.com/hudson/view/JBM1.4 (JBoss5)/job/JBMS-clustering-mysql-msgci1/62

(Internal RedHat link... sorry for the outsiders :-) )

If this test pass, I will start a test in loop (about 200) to see if the issue was gone.

And BTW: what was the conclusion on the loopback thing? should I use it or not?

29. Re: Changes on the JBM stack on JBoss5

brian.stansberry Nov 12, 2008 12:23 PM (in response to clebert.suconic)

Use loopback="true"; I'll post details as to why on Design of Clustering forum in a while and add a link to the post here.