3 Replies Latest reply on Aug 16, 2011 1:27 PM by gernot.bauer

Problems with DIST-SYNC Cluster when restarting nodes

gernot.bauer Aug 11, 2011 9:13 AM

Hi,

I am running a DIST-SYNC infinispan cluster with hot-rod server, BDBJE cache store and Infinispan 4.2.1-FINAL on EC2. When the cluster crashes or was shut down, the nodes do not join correctly after I restarted them.

The steps I perform are (IP addresses are masked - I use the machine's public IP address for startup):

Start node1 (./startServer.sh -r hotrod -p 11222 -l 0.0.0.1 -c ec2InfinispanConfig.xml)
Wait until this node is up and running (= all caches have bean loaded and the server is bound to port 11222)
Start node2 (./startServer.sh -r hotrod -p 11222 -l 0.0.0.2 -c ec2InfinispanConfig.xml)

The log shows that the second node tries to join the cluster, but soon after the start i see log messages like the following on node 1:

ERROR [org.jgroups.protocols.TCP] (OOB-2,infinispan-cluster-set,ip-0-0-0-1-493) failed sending message to ip-0-0-0-2-60013 (60108 bytes): java.lang.IllegalStateException: Queue full, cause: null

Does anyone have any idea what is going wrong? Right now, my only "workaround" is to clear the cache store. This is acceptable for dev, but unfortunately not for production.

1. Re: Problems with DIST-SYNC Cluster when restarting nodes

galder.zamarreno Aug 12, 2011 11:30 AM (in response to gernot.bauer)

I've pinged the JGroups guys to see if they can help.
Actions
2. Re: Problems with DIST-SYNC Cluster when restarting nodes

belaban Aug 13, 2011 2:32 AM (in response to gernot.bauer)

The exception you're seeing is a bug in JGroups, I've created [1] to fix it. However, there is something else wrong for those many messages to get queued up...

As a workaround you can set TCP.use_send_queues=false.

[1] https://issues.jboss.org/browse/JGRP-1351
1 of 1 people found this helpful
Actions
3. Re: Problems with DIST-SYNC Cluster when restarting nodes

gernot.bauer Aug 16, 2011 1:27 PM (in response to gernot.bauer)

I've spent a whole afternoon trying to reproduce the problem in various ways (including out of memory conditions), but I did not succeed - everything seems to work fine. I can only guess that the issue was due to the inhomogenous server environment or some kind of misconfiguration.

Anyway, thanks for your help and the pointer to the workaround if the problem arises another time.
Actions

Go to original post