9 Replies Latest reply on Apr 2, 2008 3:58 AM by lerdek

Failover on clustered queues causes thread leak

lerdek Mar 26, 2008 5:46 AM

Hi,

We have a setup of 2 clustered JBoss servers with clustered messaging with ~150 clustered queues (including EQs and DLQs). If we 'kill -9' one of the nodes, the failover mechanism kicks in, but it creates ~200 new threads which are all in a WAITING state. These threads keep waiting forever, and on every failover this happens again (thus, every failover increases the active thread count of the surviving node by ~200).

Since we still run on a Linux 2.4 kernel, with a practical upper thread count limit of ~600, this is a big problem because if a single node fails for 1 or 2 times, the 'surviving' node dies as a consequence, basically eliminating the concept of a failover mechanism.

We would be very thankful for suggestions on how we can fix this, what appears to be a thread leak.

We have observed already the following things:

Active thread count before failover: ~150
Active thread count after 1 failover: ~350
Active thread count after 2 failovers: ~550
The active thread count does not decrease over time (it seems the threads are blocked forever)
Failover scenario without clustered JBoss Messaging is without any apparent thread leak problem
Failover scenario with only 3 clustered queues is with the thread leak problem, but at a much lower rate (~5 threads per failover) so it appears that the thread leak is linear with the number of clustered queues.

At the moment it is not an option to upgrade our kernel to 2.6, which would in fact only reduce the frequency of the symptoms, but not fix the problem. Also, we cannot use a multicast protocol stack because it is not supported on our production environment.

Here's our configuration for both clsutered messaging nodes, which are dedicated messaging servers (thus, no other custom applications have been deployed):

JBoss AS 4.2.2.GA
JBoss Messaging 1.4.0sp3
JGroups 2.4.1 SP-4

After a failover, a lot of increasingly numbered threads are lying around that look like this (only the thread-name-number differs, e.g. after 1 failover it goes to Thread-274, after 2 failovers to Thread-473):

 Thread: Thread-50 : priority:5, demon:true, threadId:268, threadState:WAITING, lockName:java.lang.Object@62685f

 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:474)
 EDU.oswego.cs.dl.util.concurrent.LinkedQueue.take(LinkedQueue.java:122)
 EDU.oswego.cs.dl.util.concurrent.QueuedExecutor$RunLoop.run(QueuedExecutor.java:83)
 java.lang.Thread.run(Thread.java:595)

Here's the output of the JMX bean jboss.system/ServerInfo of the surviving node after 2 consequetive failovers:

HostAddress java.lang.String R 192.168.86.9 MBean Attribute.
AvailableProcessors java.lang.Integer R 1 MBean Attribute.
OSArch java.lang.String R i386 MBean Attribute.
OSVersion java.lang.String R 2.4.27-vmware-k7-nosmp MBean Attribute.
HostName java.lang.String R xxx MBean Attribute.
JavaVendor java.lang.String R Sun Microsystems Inc. MBean Attribute.
JavaVMName java.lang.String R Java HotSpot(TM) Server VM MBean Attribute.
FreeMemory java.lang.Long R 87777680 MBean Attribute.
ActiveThreadGroupCount java.lang.Integer R 10 MBean Attribute.
TotalMemory java.lang.Long R 166629376 MBean Attribute.
JavaVMVersion java.lang.String R 1.5.0_07-b03 MBean Attribute.
ActiveThreadCount java.lang.Integer R 548 MBean Attribute.
JavaVMVendor java.lang.String R Sun Microsystems Inc. MBean Attribute.
OSName java.lang.String R Linux MBean Attribute.
JavaVersion java.lang.String R 1.5.0_07 MBean Attribute.
MaxMemory java.lang.Long R 265486336 MBean Attribute.

Our JGroups protocol stack is the following (for both JBoss cluster, and JBoss Messaging cluster Post-office data and control channel):

STATE_TRANSFER
use_flush=false
up_thread=false
down_thread=false

GMS
shun=true
print_local_addr=true
up_thread=false
view_bundling=true
join_timeout=3000
join_retry_timeout=2000
down_thread=false

STABLE
max_bytes=400000
up_thread=false
stability_delay=1000
desired_avg_gossip=50000
down_thread=false

NAKACK
max_xmit_size=60000
up_thread=false
retransmit_timeout=300,600,1200,2400,4800
use_mcast_xmit=false
discard_delivered_msgs=true
down_thread=false
gc_lag=0

VERIFY_SUSPECT
up_thread=false
timeout=1500
down_thread=false

FD
max_tries=5
shun=true
up_thread=false
timeout=10000
down_thread=false

FD_SOCK
up_thread=false
down_thread=false

MERGE2
max_interval=10000
up_thread=false
down_thread=false
min_interval=2000

TCPPING
port_range=3
num_initial_members=13
up_thread=false
initial_hosts=j2msgtest1[8800],j2msgtest2[8800]
timeout=3000
down_thread=false

TCP
discard_incompatible_packets=true
sock_conn_timeout=300
enable_bundling=false
bind_addr=192.168.86.201
max_bundle_size=64000
use_outgoing_packet_handler=false
use_send_queues=false
down_thread=false
start_port=8800
recv_buf_size=20000000
skip_suspected_members=true
send_buf_size=640000
use_incoming_packet_handler=true
loopback=true
up_thread=false
tcp_nodelay=true
max_bundle_timeout=30

Thank you in advance!

With regards,

Thijs Reus
Click&Buy Services AG

1. Re: Failover on clustered queues causes thread leak

timfox Mar 26, 2008 6:19 AM (in response to lerdek)

How many nodes did you start with, and how many did you failover?
Actions
2. Re: Failover on clustered queues causes thread leak

timfox Mar 26, 2008 6:25 AM (in response to lerdek)

One more question:

How many connections and sessions do you have attached to each node when failover occurs?

Do you see the same problem if you failover with no jms connections using the nodes?
Actions

3. Re: Failover on clustered queues causes thread leak

lerdek Mar 26, 2008 6:43 AM (in response to lerdek)

Thank you for your quick reply!

We have a cluster of 2 nodes, and have killed/started 1 node a couple of times. We did the following scenario:

1. Start node 1
2. Start node 2 after node 1 has started completely
3. Once node 2 has started completely, 'kill -9' node 2 (first failover)
4. Start node 2
5. Once node 2 has started completely, 'kill -9' node 2 (second failover)
6. etc.

At the moment we have no JMS connections or sessions at either node, because we're testing the failover scenario 'pure' without any additional load.
Please note that the JMS functionality as such works very fine from the client perspective in the clustered setup (sending, receiving, reconnecting once a node a gone, etc.). Our problem at the moment only concerns the messaging-node failover with the mentioned thread-increase.

After killing node2 several times (depending on the amount of available OS memory) we get the following exception (stack trace varies, but all are OOME-native-thread):

2008-03-26 10:23:03,967 ERROR [org.jgroups] uncaught exception in Thread[ServerSocket acceptor thread (channel=DefaultPartition-EntityCache),5,JGroups threa
ds] (thread group=org.jgroups.util.Util$1[name=JGroups threads,maxpri=10] )
java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:574)
 at org.jgroups.protocols.FD_SOCK$ServerSocketHandler.run(FD_SOCK.java:1089)
 at java.lang.Thread.run(Thread.java:595)

Here's an excerpt from our log from node 1 from the moment where it has started, until when the failover has finished:

2008-03-26 10:01:46,914 INFO [org.jboss.system.server.Server] JBoss (MX MicroKernel) [4.2.2.GA (build: SVNTag=JBoss_4_2_2_GA date=200710221139)] Started in 59s:380ms
2008-03-26 10:02:45,989 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45958|1] [192.168.86.201:45958, 192.168.86.202:45969]
2008-03-26 10:02:46,041 INFO [org.jboss.cache.statetransfer.StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
2008-03-26 10:02:53,732 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster view for partition devel (id: 1, delta: 1) : [192.168.86.201:1099, 192.168.86.202:1099]
2008-03-26 10:02:53,734 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am (192.168.86.201:1099) received membershipChanged event:
2008-03-26 10:02:53,734 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead members: 0 ([])
2008-03-26 10:02:53,734 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New Members : 1 ([192.168.86.202:1099])
2008-03-26 10:02:53,734 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All Members : 2 ([192.168.86.201:1099, 192.168.86.202:1099])
2008-03-26 10:02:56,742 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45964|1] [192.168.86.201:45964, 192.168.86.202:45972]
2008-03-26 10:02:59,322 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45966|1] [192.168.86.201:45966, 192.168.86.202:45974]
2008-03-26 10:03:35,464 WARN [org.jgroups.blocks.ConnectionTable] peer closed connection, trying to re-send msg
2008-03-26 10:03:35,464 ERROR [org.jgroups.blocks.ConnectionTable] 2nd attempt to send data failed too
2008-03-26 10:03:35,465 WARN [org.jgroups.blocks.ConnectionTable] peer closed connection, trying to re-send msg
2008-03-26 10:03:35,465 ERROR [org.jgroups.blocks.ConnectionTable] 2nd attempt to send data failed too
2008-03-26 10:03:35,466 WARN [org.jgroups.blocks.ConnectionTable] peer closed connection, trying to re-send msg
2008-03-26 10:03:35,466 ERROR [org.jgroups.blocks.ConnectionTable] 2nd attempt to send data failed too
2008-03-26 10:03:35,633 ERROR [org.jboss.remoting.transport.socket.SocketClientInvoker] Got marshalling exception, exiting
java.net.SocketException: end of file
 at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.transport(MicroSocketClientInvoker.java:604)
 at org.jboss.remoting.transport.bisocket.BisocketClientInvoker.transport(BisocketClientInvoker.java:418)
 at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:122)
 at org.jboss.remoting.ConnectionValidator.doCheckConnection(ConnectionValidator.java:133)
 at org.jboss.remoting.ConnectionValidator.run(ConnectionValidator.java:308)
 at java.util.TimerThread.mainLoop(Timer.java:512)
 at java.util.TimerThread.run(Timer.java:462)
2008-03-26 10:03:36,972 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] Suspected member: 192.168.86.202:8800
2008-03-26 10:03:37,031 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster view for partition devel (id: 2, delta: -1) : [192.168.86.201:1099]
2008-03-26 10:03:37,032 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am (192.168.86.201:1099) received membershipChanged event:
2008-03-26 10:03:37,032 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead members: 1 ([192.168.86.202:1099])
2008-03-26 10:03:37,032 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New Members : 0 ([])
2008-03-26 10:03:37,032 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All Members : 1 ([192.168.86.201:1099])
2008-03-26 10:03:37,039 INFO [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging is failing over for failed node 2. If there are many messages to reload this may take some time...
2008-03-26 10:03:37,043 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45958|2] [192.168.86.201:45958]
2008-03-26 10:03:37,047 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45964|2] [192.168.86.201:45964]
2008-03-26 10:03:37,048 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45966|2] [192.168.86.201:45966]
2008-03-26 10:03:37,511 WARN [org.jboss.remoting.LeasePinger] LeasePinger[SocketClientInvoker[19985cb, bisocket://192.168.86.202:4457](5c4o2e9-na0je7-fe9q826g-1-fe9qb3hb-j)] failed to ping to server: Can not get connection to server. Problem establishing socket connection for InvokerLocator [bisocket://192.168.86
.202:4457/?JBM_clientMaxPoolSize=200&clientLeasePeriod=10000&clientSocketClass=org.jboss.jms.client.remoting.ClientSocketWrapper&dataType=jms&
marshaller=org.jboss.jms.wireformat.JMSWireFormat&numberOfCallRetries=1&numberOfRetries=10&pingFrequency=214748364&pingWindowFactor=10&socket.check_connection=false&timeout=
0&unmarshaller=org.jboss.jms.wireformat.JMSWireFormat]
2008-03-26 10:03:37,985 WARN [org.jboss.remoting.Client] unable to remove remote callback handler: Can not get connection to server. Problem establishing socket connection for InvokerLocator [bisocket://192.168.86.202:4457/?JBM_clientMaxPoolSize=200&clientLeasePeriod=10000&clientSocketClass=org.jboss.jms.client.r
emoting.ClientSocketWrapper&dataType=jms&marshaller=org.jboss.jms.wireformat.JMSWireFormat&numberOfCallRetries=1&numberOfRetries=10&pingFrequency=214748364&pingWindowFactor=10&
socket.check_connection=false&timeout=0&unmarshaller=org.jboss.jms.wireformat.JMSWireFormat]
2008-03-26 10:03:37,985 WARN [org.jboss.remoting.LeasePinger] LeasePinger[SocketClientInvoker[19985cb, bisocket://192.168.86.202:4457](5c4o2e9-na0je7-fe9q826g-1-fe9qb3hb-j)] failed sending disconnect for client lease for client with session ID 5c4o2e9-na0je7-fe9q826g-1-fe9qb3ha-h
2008-03-26 10:03:37,986 ERROR [org.jboss.remoting.MicroRemoteClientInvoker] error shutting down lease pinger
2008-03-26 10:03:37,990 INFO [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging failover completed
2008-03-26 10:03:57,382 WARN [org.jboss.jms.server.connectionmanager.SimpleConnectionManager] ConnectionManager[14790be] cannot look up remoting session ID 5c4o2e9-lf5156-fe9q9x46-1-fe9qb3cg-e
2008-03-26 10:03:57,382 WARN [org.jboss.jms.server.connectionmanager.SimpleConnectionManager] A problem has been detected with the connection to remote client 5c4o2e9-lf5156-fe9q9x46-1-fe9qb3cg-e, jmsClientID=null. It is possible the client has exited without closing its connection(s) or the network has failed. All connection resources corresponding to that client process will now be removed.

After we start node 2 again, the log on node 1 looks like this:

2008-03-26 10:04:24,333 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45958|3] [192.168.86.201:45958, 192.168.86.202:45976]
2008-03-26 10:04:24,370 INFO [org.jboss.cache.statetransfer.StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
2008-03-26 10:04:31,502 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster view for partition devel (id: 3, delta: 1) : [192.168.86.201:1099, 192.168.86.202:1099]
2008-03-26 10:04:31,502 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am (192.168.86.201:1099) received membershipChanged event:
2008-03-26 10:04:31,502 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead members: 0 ([])
2008-03-26 10:04:31,502 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New Members : 1 ([192.168.86.202:1099])
2008-03-26 10:04:31,502 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All Members : 2 ([192.168.86.201:1099, 192.168.86.202:1099])
2008-03-26 10:04:34,472 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45964|3] [192.168.86.201:45964, 192.168.86.202:45979]
2008-03-26 10:04:36,692 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45966|3] [192.168.86.201:45966, 192.168.86.202:45981]

In the meantime, we've configured all our EQs and DLQs as non-clustered, reducing the thread-increase to ~75 new threads per failover instead of ~200. This allows us to failover a couple of extra times before the OOME pops up, but still doesn't solve the real problem...

We've also tried to decrease the initial thread stack size with the -Xss JVM parameter, but this doesn't appear to make a difference on our 2.4 kernel wrt the maximum number of threads the JVM can start/keep alive.

4. Re: Failover on clustered queues causes thread leak

timfox Mar 26, 2008 8:22 AM (in response to lerdek)

Can i confirm you have no MDBs or any other consumers on the nodes?

Also can you confirm which version of JBoss remoting you are using?
Actions
5. Re: Failover on clustered queues causes thread leak

andrei.martchouk Mar 26, 2008 9:33 AM (in response to lerdek)

There aren't any consumers deployed on the messaging nodes. Only the queues are deployed.

JBoss remoting 2.2.2sp4

(Thijs and me working together on the problem)
Actions
6. Re: Failover on clustered queues causes thread leak

timfox Mar 26, 2008 1:15 PM (in response to lerdek)

Ok, I think I know what is going on here. I should have it sorted by the end of the week.
Actions
7. Re: Failover on clustered queues causes thread leak

timfox Mar 26, 2008 1:16 PM (in response to lerdek)

http://jira.jboss.org/jira/browse/JBMESSAGING-1266
Actions
8. Re: Failover on clustered queues causes thread leak

lerdek Mar 27, 2008 4:06 AM (in response to lerdek)

Thanks, we're looking forward to the fix!
Actions
9. Re: Failover on clustered queues causes thread leak

lerdek Apr 2, 2008 3:58 AM (in response to lerdek)

We've downloaded, built and tested the fix, and it appears to work very well. The number of threads does not go up anymore on failover scenarios. We will test some more, and report any peculiarities.

Thank you very much for your quick response on this issue!

With regards,

Thijs Reus
Click&Buy Services AG
Actions

Go to original post