AS 6.1 - HTTP Replication on Startup Not Working
brandonfuller Mar 22, 2012 11:25 PMSummary
I have 1 server running. User is accessing the server just fine. I bring up another server and add it to the cluster. Via the load balancer, I send the user's next request to the new server. The user receives an error because their session is invalid (not found) on that server. I had assumed bringing a new server up and joining the cluster would transfer all the state to the new server so it could take requests.
Components
+ JBoss AS 6.1 servers
+ Upgraded to JGroups 2.12.2.FINAL (to get JDBC_PING)
+ Running in Amazon EC2
+ Using TCP stack (not UDP issues)
Description
If I bring the 2 servers up and then let the user login, I see the requests load balanced in the cluster as I would expect. I get no session errors of any kind. Basically, the cluster operates just fine in that mode so I believe everythign is configured properly.
The problem is specifically about when I bring a new server into the mix. The new server doesn't seem to have the state from the other servers. I am allowing that server to come up and giving it minutes before I try to access it. I was allowing time for the state transfer to occur. I believe it is according to the log. However, when the user makes that next request that goes to this server it does not work. It seems like the state transfer didn't happen correctly.
Here is a log snippet from the 1st server (existing) for when the 2nd server comes online and asks it for a state transfer:
{code} 2012-03-23 00:05:25,881 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) new=[ip-10-62-83-172.ec2.internal:1099], suspected=[], leaving=[], new view: [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099] 2012-03-23 00:05:25,894 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-1,null) Block processed at ip-10-32-45-187.ec2.internal:1099 2012-03-23 00:05:25,900 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ip-10-32-45-187.ec2.internal:1099: flush coordinator is starting FLUSH with participants [ip-10-32-45-187.ec2.internal:1099] 2012-03-23 00:05:25,909 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-1,null) ip-10-32-45-187.ec2.internal:1099: received START_FLUSH, responded with FLUSH_COMPLETED to ip-10-32-45-187.ec2.internal:1099 2012-03-23 00:05:25,914 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-3,null) ip-10-32-45-187.ec2.internal:1099: FLUSH_COMPLETED from ip-10-32-45-187.ec2.internal:1099, completed true, flushMembers [ip-10-32-45-187.ec2.internal:1099], flushCompleted [ip-10-32-45-187.ec2.internal:1099] 2012-03-23 00:05:25,915 DEBUG [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) suspending message garbage collection 2012-03-23 00:05:25,919 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-3,null) ip-10-32-45-187.ec2.internal:1099: all FLUSH_COMPLETED received 2012-03-23 00:05:25,920 DEBUG [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) resume task started, max_suspend_time=33000 2012-03-23 00:05:25,926 DEBUG [org.jgroups.protocols.pbcast.NAKACK] (Incoming-5,null) [setDigest()] existing digest: ip-10-32-45-187.ec2.internal:1099: [241 : 246 (246)] new digest: ip-10-32-45-187.ec2.internal:1099: [241 : 245 (245)], ip-10-62-83-172.ec2.internal:1099: [0 : 0] resulting digest: ip-10-32-45-187.ec2.internal:1099: [241 : 246 (246)], ip-10-62-83-172.ec2.internal:1099: [0 : 0 (0)] 2012-03-23 00:05:25,926 DEBUG [org.jgroups.protocols.pbcast.GMS] (Incoming-5,null) ip-10-32-45-187.ec2.internal:1099: view is [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099] 2012-03-23 00:05:25,930 DEBUG [org.jgroups.protocols.FD_SOCK] (Incoming-5,null) VIEW_CHANGE received: [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099] 2012-03-23 00:05:25,937 DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ping_dest is ip-10-62-83-172.ec2.internal:1099, pingable_mbrs=[ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099] 2012-03-23 00:05:25,946 DEBUG [org.jgroups.protocols.pbcast.STABLE] (Incoming-5,null) [ergonomics] setting max_bytes to 800KB (2 members) 2012-03-23 00:05:25,947 INFO [org.jboss.ha.framework.server.ClusterPartition.lifecycle.DefaultPartition] (Incoming-5,null) New cluster view for partition DefaultPartition (id: 1, delta: 1, merge: false) : [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099] 2012-03-23 00:05:25,948 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-5,null) dead members: [] 2012-03-23 00:05:25,948 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-5,null) membership changed from 1 to 2 2012-03-23 00:05:25,949 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (AsynchViewChangeHandler Thread) Begin notifyListeners, viewID: 1 2012-03-23 00:05:25,950 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) I am (ip-10-32-45-187.ec2.internal:1099) received membershipChanged event: 2012-03-23 00:05:25,950 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) Dead members: 0 ([]) 2012-03-23 00:05:25,951 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) New Members : 1 ([ip-10-62-83-172.ec2.internal:1099]) 2012-03-23 00:05:25,952 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) All Members : 2 ([ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]) 2012-03-23 00:05:25,952 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (AsynchViewChangeHandler Thread) End notifyListeners, viewID: 1 2012-03-23 00:05:25,953 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-5,null) Received new cluster view: [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099] 2012-03-23 00:05:25,956 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-5,null) ip-10-32-45-187.ec2.internal:1099: installing view [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099] 2012-03-23 00:05:26,054 DEBUG [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) resuming message garbage collection 2012-03-23 00:05:26,057 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ip-10-32-45-187.ec2.internal:1099: sending RESUME event 2012-03-23 00:05:26,057 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ip-10-32-45-187.ec2.internal:1099: received RESUME, sending STOP_FLUSH to all 2012-03-23 00:05:26,062 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-9,null) ip-10-32-45-187.ec2.internal:1099: received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up 2012-03-23 00:05:26,063 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-9,null) Unblock processed at ip-10-32-45-187.ec2.internal:1099 2012-03-23 00:05:26,237 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (Incoming-11,null) Responding to state requester ip-10-62-83-172.ec2.internal:1099 with address 10.32.45.187:35530 and digest null 2012-03-23 00:05:26,239 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530... 2012-03-23 00:05:26,248 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530... 2012-03-23 00:05:26,251 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Accepted request for state transfer from /10.62.83.172:55185 handing of to PooledExecutor thread 2012-03-23 00:05:26,257 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Running on Thread[STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099,5,JGroups]. Accepted request for state transfer from /10.62.83.172:55185, original buffer size was 8192 and was reset to 8192, passing outputstream up... 2012-03-23 00:05:26,261 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Generating state. Can provide? true 2012-03-23 00:05:26,374 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Writing 1 StoredEntries to stream 2012-03-23 00:05:26,833 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State generated, closing object stream 2012-03-23 00:05:26,834 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State writer is closing the socket 2012-03-23 00:05:26,969 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-15,null) Block processed at ip-10-32-45-187.ec2.internal:1099 2012-03-23 00:05:26,970 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-15,null) ip-10-32-45-187.ec2.internal:1099: received START_FLUSH, responded with FLUSH_COMPLETED to ip-10-62-83-172.ec2.internal:1099 2012-03-23 00:05:26,973 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (Incoming-17,null) Responding to state requester ip-10-62-83-172.ec2.internal:1099 with address 10.32.45.187:35530 and digest null 2012-03-23 00:05:26,975 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530... 2012-03-23 00:05:26,975 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Accepted request for state transfer from /10.62.83.172:32951 handing of to PooledExecutor thread 2012-03-23 00:05:26,976 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Running on Thread[STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099,5,JGroups]. Accepted request for state transfer from /10.62.83.172:32951, original buffer size was 8192 and was reset to 8192, passing outputstream up... 2012-03-23 00:05:26,978 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) getState called for service DistributedReplicantManager 2012-03-23 00:05:27,000 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State writer is closing the socket 2012-03-23 00:05:27,156 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-19,null) ip-10-32-45-187.ec2.internal:1099: received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up 2012-03-23 00:05:27,157 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-19,null) Unblock processed at ip-10-32-45-187.ec2.internal:1099 2012-03-23 00:05:27,254 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchKeyChangeHandler Thread) election result =true, electionPolicy=DistributedReplicantManager 2012-03-23 00:05:27,255 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchKeyChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=true, viewID=356682581, partition=DefaultPartition 2012-03-23 00:05:27,344 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] (AsynchKeyChangeHandler Thread) replicantsChanged 'HAJNDI' to 2 (intra-view id: 356682581) 2012-03-23 00:05:27,577 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (Incoming-5,null) Responding to state requester ip-10-62-83-172.ec2.internal:1099 with address 10.32.45.187:35530 and digest null 2012-03-23 00:05:27,579 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530... 2012-03-23 00:05:27,579 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Accepted request for state transfer from /10.62.83.172:39044 handing of to PooledExecutor thread 2012-03-23 00:05:27,581 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Running on Thread[STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099,5,JGroups]. Accepted request for state transfer from /10.62.83.172:39044, original buffer size was 8192 and was reset to 8192, passing outputstream up... 2012-03-23 00:05:27,582 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Generating state. Can provide? true 2012-03-23 00:05:27,582 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Writing 2 StoredEntries to stream 2012-03-23 00:05:27,589 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State generated, closing object stream 2012-03-23 00:05:27,596 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State writer is closing the socket 2012-03-23 00:05:27,844 DEBUG [org.jboss.ha.singleton.HASingletonElectionPolicySimple] (AsynchKeyChangeHandler Thread) election with 2 nodes and position=0, will choose node[0] picked node ip=ip-10-32-45-187.ec2.internal/10.32.45.187 port=60062 2012-03-23 00:05:27,845 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchKeyChangeHandler Thread) election result =true, electionPolicy=org.jboss.ha.singleton.HASingletonElectionPolicySimple 2012-03-23 00:05:27,845 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchKeyChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=true, viewID=356682581, partition=DefaultPartition 2012-03-23 00:05:27,859 DEBUG [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge] (AsynchKeyChangeHandler Thread) The list of replicants for the JG bridge has changed, computing and updating local info... 2012-03-23 00:05:31,946 DEBUG [org.jgroups.protocols.FD] (Timer-3,) sending are-you-alive msg to ip-10-62-83-172.ec2.internal:1099 (own address=ip {code}
Here is my jgroups-channelfactory-stacks.xml contents (with JDBC creds removed):
{code:xml}<?xml version="1.0" encoding="UTF-8"?>
<protocol_stacks>
<stack name="tcp"
description="TCP based stack, with flow control and message bundling.
TCP stacks are usually used when IP multicasting cannot
be used in a network, e.g. because it is disabled (e.g.
routers discard multicast)">
<config>
<TCP
singleton_name="tcp"
start_port="${jboss.jgroups.tcp.tcp_port:7600}"
tcp_nodelay="true"
loopback="true"
recv_buf_size="20000000"
send_buf_size="640000"
discard_incompatible_packets="true"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
enable_bundling="false"
use_send_queues="false"
sock_conn_timeout="300"
skip_suspected_members="true"
timer.num_threads="12"
enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"
diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.75.75}"
diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"
thread_pool.enabled="true"
thread_pool.min_threads="20"
thread_pool.max_threads="200"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="1000"
thread_pool.rejection_policy="discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="20"
oob_thread_pool.max_threads="200"
oob_thread_pool.keep_alive_time="1000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.rejection_policy="discard"/>
<JDBC_PING connection_driver="com.mysql.jdbc.Driver" connection_username="username" connection_password="password" connection_url="jdbc:mysql://server:3306/db" />
<MERGE2 max_interval="100000" min_interval="20000"/>
<FD_SOCK start_port="${jboss.jgroups.tcp.fd_sock_port:57600}"/>
<FD timeout="6000" max_tries="5"/>
<VERIFY_SUSPECT timeout="1500"/>
<BARRIER/>
<pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<VIEW_SYNC avg_send_interval="10000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
view_bundling="true"
view_ack_collection_timeout="5000"
resume_task_timeout="7500"/>
<UFC max_credits="2000000" ignore_synchronous_response="true"/>
<MFC max_credits="2000000" ignore_synchronous_response="true"/>
<FRAG2 frag_size="60000"/>
<pbcast.STREAMING_STATE_TRANSFER/>
<pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>
</config>
</stack>
</protocol_stacks>
{code}
Anyone have an ideas on what I missing?
It seems to be related to https://issues.jboss.org/browse/ISPN-835