3 Replies Latest reply on Feb 8, 2008 12:31 PM by brian.stansberry

StateTransferTimeout appears to be ignored

dcolwell Jan 30, 2008 6:05 PM

I have deployed JBoss 4.2.1GA across a two node cluster. It's configured for asynchronous, non-transactional replication of user session data using UDP. Application state is exclusively servlet based with no EJBs. After starting up the first node I fire up the second and have observed it pause from 1 to 10 minutes after logging:

Fetching state (will wait for 30000 milliseconds)

The next line in my server.log file appears successful, just long:

state was retrieved successfully (in 189067 milliseconds)

The 30000 millis is configured by setting a StateTransferTimeout attribute, but different values appear to be equally ignored. Does anyone have thoughts on what may be causing this? The amount of user data potentially active on the first node is rather small (< 100k) if any.

Thanks much!

1. Re: StateTransferTimeout appears to be ignored

brian.stansberry Feb 4, 2008 10:16 PM (in response to dcolwell)

In the future, please post the full log message, it will save time. :-)

The code logging what you posted is not the web session replication service. It's the jboss:service=DefaultPartition mbean from cluster-service.xml.

Please enable DEBUG logging for category org.jboss.ha. You should get more information on what's happening during the delay.

The state transfer timeout param controls how long the service will wait for a state tranfer response from the other node. After receiving the response it still needs to integrate the received state. That can take time, although several minutes certainly seems excessive. The debug logging should help show what the problem is.
Actions
2. Re: StateTransferTimeout appears to be ignored

dcolwell Feb 8, 2008 12:17 PM (in response to dcolwell)

The following is my log4j trace of org.jboss.ha category. Unfortunately it does not appear to contain useful information during the 3+ minute pause after the line "Fetching State". Is there other information in this log that could give a clue or would you have other thoughts on how I should investigate?

Thank you,
Dave

2008-02-08 11:04:59,787 [main] DEBUG [ClusterPartition] Setting JGProps from xml to: UDP(discard_incompatible_packets=true;down_thread=false;enable_bundling=false;ip_ttl=2;loopback=false;max_
bundle_size=64000;max_bundle_timeout=30;mcast_addr=231.1.2.2;mcast_port=47102;mcast_recv_buf_size=25000000;mcast_send_buf_size=640000;tos=8;ucast_recv_buf_size=20000000;ucast_send_buf_size=64
0000;up_thread=false;use_incoming_packet_handler=true;use_outgoing_packet_handler=false):PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):MERGE2(down_thread=false;ma
x_interval=100000;min_interval=20000;up_thread=false):FD_SOCK(down_thread=false;up_thread=false):FD(down_thread=false;max_tries=5;shun=true;timeout=10000;up_thread=false):VERIFY_SUSPECT(down_
thread=false;timeout=1500;up_thread=false):pbcast.NAKACK(discard_delivered_msgs=true;down_thread=false;gc_lag=0;max_xmit_size=60000;retransmit_timeout=300,600,1200,2400,4800;up_thread=false;u
se_mcast_xmit=false):UNICAST(down_thread=false;timeout=300,600,1200,2400,3600;up_thread=false):pbcast.STABLE(desired_avg_gossip=50000;down_thread=false;max_bytes=400000;stability_delay=1000;u
p_thread=false):pbcast.GMS(down_thread=false;join_retry_timeout=2000;join_timeout=3000;print_local_addr=true;shun=true;up_thread=false;view_bundling=true):FRAG2(down_thread=false;frag_size=60
000;up_thread=false):pbcast.STATE_TRANSFER(down_thread=false;up_thread=false;use_flush=false)
2008-02-08 11:05:00,014 [main] DEBUG [ClusterPartition] Creating jboss:service=MSPTEST_emtrack
2008-02-08 11:05:00,014 [main] DEBUG [ClusterPartition] Creating JGroups JChannel
2008-02-08 11:05:00,279 [main] DEBUG [ClusterPartition] Creating HAPartition
2008-02-08 11:05:00,308 [main] DEBUG [ClusterPartition] Initing HAPartition: org.jboss.ha.framework.server.HAPartitionImpl@18837f1
2008-02-08 11:05:00,308 [main] INFO [MSPTEST_emtrack] Initializing
2008-02-08 11:05:00,309 [main] DEBUG [MSPTEST_emtrack] setMembershipListener
2008-02-08 11:05:00,309 [main] DEBUG [MSPTEST_emtrack] setMessageListener
2008-02-08 11:05:00,309 [main] DEBUG [MSPTEST_emtrack] create replicant manager
2008-02-08 11:05:00,337 [main] DEBUG [MSPTEST_emtrack] init replicant manager
2008-02-08 11:05:00,337 [main] DEBUG [MSPTEST_emtrack] registerRPCHandler
2008-02-08 11:05:00,337 [main] DEBUG [MSPTEST_emtrack] subscribeToStateTransferEvents
2008-02-08 11:05:00,337 [main] DEBUG [MSPTEST_emtrack] registerMembershipListener
2008-02-08 11:05:00,342 [main] DEBUG [MSPTEST_emtrack] bind replicant manager
2008-02-08 11:05:00,342 [main] DEBUG [MSPTEST_emtrack] create distributed state
2008-02-08 11:05:00,354 [main] DEBUG [MSPTEST_emtrack] init distributed state service
2008-02-08 11:05:00,359 [main] DEBUG [MSPTEST_emtrack] bind distributed state service
2008-02-08 11:05:00,366 [main] DEBUG [MSPTEST_emtrack] done initing.
2008-02-08 11:05:00,366 [main] DEBUG [ClusterPartition] HAPartition initialized
2008-02-08 11:05:00,366 [main] DEBUG [ClusterPartition] Created jboss:service=MSPTEST_emtrack
2008-02-08 11:05:00,367 [main] DEBUG [HASessionStateService] Creating jboss:service=HASessionState
2008-02-08 11:05:00,385 [main] DEBUG [HASessionStateService] Created jboss:service=HASessionState
2008-02-08 11:05:00,386 [main] DEBUG [HANamingService] Creating jboss:service=HAJNDI
2008-02-08 11:05:00,386 [main] DEBUG [HANamingService] Initializing HAJNDI server on partition: MSPTEST_emtrack
2008-02-08 11:05:00,397 [main] DEBUG [HANamingService] initialize HAJNDI
2008-02-08 11:05:00,397 [main] DEBUG [HAJNDI] subscribeToStateTransferEvents
2008-02-08 11:05:00,397 [main] DEBUG [TreeHead] registerRPCHandler
2008-02-08 11:05:00,398 [main] DEBUG [HANamingService] Created jboss:service=HAJNDI
2008-02-08 11:05:00,401 [main] DEBUG [ClusterPartition] Starting jboss:service=MSPTEST_emtrack
2008-02-08 11:05:00,404 [main] DEBUG [ClusterPartition] Starting ClusterPartition: MSPTEST_emtrack
2008-02-08 11:05:02,490 [main] DEBUG [MSPTEST_emtrack] ViewAccepted: initial members set
2008-02-08 11:05:02,494 [main] DEBUG [ClusterPartition] Starting channel
2008-02-08 11:05:02,495 [main] DEBUG [MSPTEST_emtrack] get nodeName
2008-02-08 11:05:02,495 [main] DEBUG [MSPTEST_emtrack] Get current members
2008-02-08 11:05:02,496 [main] INFO [MSPTEST_emtrack] Number of cluster members: 2
2008-02-08 11:05:02,496 [main] INFO [MSPTEST_emtrack] Other members: 1
2008-02-08 11:05:02,496 [main] INFO [MSPTEST_emtrack] Fetching state (will wait for 30000 milliseconds):
2008-02-08 11:05:02,502 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [MSPTEST_emtrack] setState called
2008-02-08 11:08:11,545 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [MSPTEST_emtrack] setState for DistributedReplicantManager
2008-02-08 11:08:11,549 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [MSPTEST_emtrack] setState for HASessionStateTransfer
2008-02-08 11:08:11,549 [DRM Async Publisher#0] DEBUG [MSPTEST_emtrack] DRM: Sleeping before re-publishing for 50ms just in case
2008-02-08 11:08:11,550 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [/HASessionState/Default] Receiving state of HASessionState
2008-02-08 11:08:11,551 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [MSPTEST_emtrack] setState for HAJNDI
2008-02-08 11:08:11,551 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [TreeHead] registerRPCHandler
2008-02-08 11:08:11,551 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [MSPTEST_emtrack] setState for DistributedState
2008-02-08 11:08:11,552 [IncomingPacketHandler (channel=MSPTEST_emtrack)] DEBUG [MSPTEST_emtrack] received a state of 1466 bytes; expanded memory by 4915968 bytes (used memory before: 40976664, used memory after: 45892632)
2008-02-08 11:08:11,552 [main] INFO [MSPTEST_emtrack] state was retrieved successfully (in 189056 milliseconds)
Actions
3. Re: StateTransferTimeout appears to be ignored

brian.stansberry Feb 8, 2008 12:31 PM (in response to dcolwell)
Here's the code that's being executed during the logging gap:

log.debug("setState called"); if (obj == null) { log.debug("state is null"); return; } long used_mem_before, used_mem_after; int state_size=obj != null? obj.length : 0; Runtime rt=Runtime.getRuntime(); used_mem_before=rt.totalMemory() - rt.freeMemory(); HashMap state = (HashMap)objectFromByteBuffer(obj); java.util.Iterator keys = state.keySet().iterator(); while (keys.hasNext()) { String key = (String)keys.next(); log.debug("setState for " + key);

Here's the objectFromByteBuffer() code:

public static Object objectFromByteBuffer (byte[] buffer) throws Exception { if(buffer == null) return null; ByteArrayInputStream bais = new ByteArrayInputStream(buffer); MarshalledValueInputStream mvis = new MarshalledValueInputStream(bais); return mvis.readObject(); }

This is all in the org.jboss.ha.framework.server.HAPartitionImpl class, found in the cluster module of the AS source.

It's possible there is some issue with the MarshalledValueInputStream.readObject() call although I doubt it. In this case it's basically just an ordinary java.io.ObjectInputStream.readObject() call. And as you can see from the final log message, the byte[] it's deserializing is only 1466 bytes; hardly anything. If that were having issues I expect I'd hear more about it from others.

Perhaps something the memory calculation business? That at least is somewhat environment-specific. Just a wild guess.
Actions

Go to original post