JGroupsLargetStateOrPacketTransferProblems

Version 16

Created by galder.zamarreno on Sep 24, 2008 7:39 AM. Last modified by galder.zamarreno on Oct 13, 2008 12:29 PM.

Introduction

The aim of this wiki is to explain symptoms, root causes and methodology to debug issues related to large state and/or packet transfer between clustered nodes.

Symptoms

Such issues tend show signs like this:

"Initial state transfer failed: Channel.getState() returned false" Exception

In some cases, when a Nth (but not 1st) node in the cluster is started up, you see exceptions like this:

16:41:45,101 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds): 
java.lang.IllegalStateException: Initial state transfer failed: Channel.getState() returned false
at org.jboss.ha.framework.server.HAPartitionImpl.fetchState(HAPartitionImpl.java:356)
at org.jboss.ha.framework.server.HAPartitionImpl.startPartition(HAPartitionImpl.java:282) 
...

or like this:

09:35:38,988 WARN [ServiceController] Problem starting service jboss.cache:service=TomcatClusteringCache
org.jboss.cache.CacheException: Initial state transfer failed: Channel.getState() returned false
at org.jboss.cache.TreeCache.fetchStateOnStartup(TreeCache.java:3191)
at org.jboss.cache.TreeCache.startService(TreeCache.java:1429) 
...

Both exceptions mean the same: Nth (but not 1st) node requested state (in the first case, state related to main cluster channel, i.e. EJB proxy deployment information, and in the second case, state related to HTTP session replication cache) from the coordinator of the cluster but it never got it. So, after a certain timeout (60 seconds in the case of main cluster channel and value of InitialStateRetrievalTimeout attribute in the case of a cache instance), you get an exception like this.

"ClusterFileTransfer$ClusterFileTransferException: Did not receive response..." Exception

Other times, when a Nth (but not 1st) node in the cluster is started up, you see exceptions like this:

2008-05-07 16:22:15,544 ERROR [FarmMemberService] org.jboss.ha.framework.server.ClusterFileTransfer$ClusterFileTransferException: 
Did not receive response from remote machine trying to open file 'farm/xyz.ear'.  Check remote machine error log.

This is pretty much the same issue but applied to farming file transfer. Nth (but not 1st) node requested the farmed deployment archive to be sent but it did not receive it after a certain timeout which in this case is 60 seconds by default.

Possible Causes

In no particular order and based on past experience:

Cause 1. State or farmed deployment is too big or it takes too long to be sent due to low bandwidth. Note: This is hardly ever the case.

Cause 2. Firewall drops/blocks UDP packages because either:

UDP packet size is over certain size.
UDP packet length and the size on the wire do not match (fake or malformed UDP packet)
Fragmented UDP packets are not allowed

Cause 3. MTU in network cards used by different cluster nodes to direct cluster traffic through are different.

Debugging Methodology

Step 1. The easiest thing to try out is increasing state transfer timeout (in milliseconds) and see whether it resolves the issue in case the initial transfer took longer than the timeout set. It's worth noting that this is hardly ever the case and the default state transfer timeouts are generally good enough. In any case, if you wanna try increasing these timeouts, you can do so this way:

If the state transfer issue affects a JBoss Cache instance, change the state transfer timeout doing something like this:

<attribute name="InitialStateRetrievalTimeout">75000</attribute>

If the state transfer issue is due to a "Channel.getState() returned false" affecting the main cluster partition, you can try changing the default state transfer timeout by modifying the following attribute to the jboss:service=${jboss.partition.name:DefaultPartition} MBean in deploy/cluster-service.xml:

<attribute name="StateTransferTimeout">75000</attribute>

If the issue is related to farming, you can increase the default 60000 milliseconds cluster RPC method call by adding the following attribute to the jboss:service=${jboss.partition.name:DefaultPartition} MBean in deploy/cluster-service.xml (Please note that changing this parameter affects timeout of all cluster wide RPC calls):

<attribute name="MethodCallTimeout">75000</attribute>

Step 2. Run JGroups' Large State test using the configuration of the channel that is showing issues. This attempts to replicate a similar scenario by starting a provider that will send a state of certain size to a state requester. It doesn't matter which of the symptoms you're trying to debug. The aim is to test sending something fairly big over the network. Quite often this test fails which helps narrow down the culprit. Example (assumes that IP address state provider is 192.168.10.11 and state receiver is 192.168.10.12, ):

Take the JGroups <Config> element section within the relevant descriptor (deploy/cluster-service.xml, deploy/jboss-web-cluster.sar/META-INF/jboss-service.xml...etc) and put it in a separate file called config.xml at the root of the AS/EAP.

Go to the root of AS/EAP in the cluster node that is supposed to provide the state or the farmed file and start a 10MB state provider:

java -Djava.net.preferIPv4Stack=true -Dbind.address=192.168.10.11 
-cp server/all/lib/jgroups.jar:lib/commons-logging.jar:lib/concurrent.jar org.jgroups.tests.LargeState 
-props ./config.xml -provider -size 10000000

Go to the root of AS/EAP in the cluster node that is supposed to receive the state or the farmed file and start a state requester:

java -Djava.net.preferIPv4Stack=true -Dbind.address=192.168.10.12 
-cp server/all/lib/jgroups.jar:lib/commons-logging.jar:lib/concurrent.jar org.jgroups.tests.LargeState 
-props ./config.xml

If things were to work correctly, you should see both 192.168.10.11 and 192.168.10.12 forming a cluster and 192.168.10.11 send a state of 10 MB to 192.168.10.12:

In 192.168.10.11

-------------------------------------------------------
GMS: address is 192.168.10.11:xxxxx
-------------------------------------------------------
-- view: [192.168.10.11:xxxxx|0] [192.168.10.11:xxxxx]
-- connected to channel
Waiting for other members to join and fetch large state
-- view: [192.168.10.11:xxxxx|1] [192.168.10.11:xxxxx, 192.168.10.12:yyyyy]
creating state of 10000000 bytes
--> returning state: 10000000 bytes
-- view: [192.168.10.11:xxxxx|2] [192.168.10.11:xxxxx]

In 192.168.10.12

-------------------------------------------------------
GMS: address is  192.168.10.12:yyyyy
-------------------------------------------------------
-- connected to channel
Getting state
getState(), rc=true
<-- Received state, size =10000000 (took NNNNNms)

If the test fails or the state is not received, try these things:

- If config.xml uses FRAG or FRAG2, reduce the frag_size gradually and run the test again. You might see that after reducing it to a certain level, the test actually passes. In this case, the root cause is likely to be related to Cause 2. or Cause 3. in which case, it is highly recommended that you talk to the Network team within your company and explain them the situation. They should be able to help you fix the issue.

- If sending 10MB fails, try sending smaller states gradually and see whether it works when sending states of less than certain size. In this case, the root cause is likely to be related to Cause 2. or Cause 3. in which case, it is highly recommended that you talk to the Network team within your company and explain them the situation. They should be able to help you fix the issue.

Step 3. If large state test passes, there could be some deadlock or contention within the AS/EAP code in which case you need to generate thread dumps from the node that is supposed to serve the state or the farmed file while the transfer should be happening. This would give you an idea of where the code might be getting stuck.

Step 4. If none of this works, try using a TCP based JGroups configuration rather than default UDP one.

JBossDeveloper