JBoss 4.2.3.GA - Issue when a node is removed from the LAN
eduardo_thp Dec 8, 2010 12:00 PMHello,
I'm using JBoss 4.2.3.GA on a TCP clustered environment (the issue I'm describing has been seen on a cluster with 2 nodes and on a cluster with 4 nodes)
I have the following sceneario:
- no loadbalancers
- a cluster of two or four jboss servers (shouldn't matter, issue was seen on both envs)
- WebApplication that has a pushlet connection for pushing events from the server to the browser
- On the web server side of the application I have a thread running for monitoring the state of the pushlet connection
- On the client side I'm also monitoring the state of the pushlet connection (JavaScript)
* If the clientside (browser) detects that the pushlet connection has been lost, it tries reconnecting to the server, in case it can't reconnect to the same server it goes to a new server in the cluster (same domain - no browser security issues)
* If the serverside (thread) detects that the pushlet connection has been lost, it specifies a timeout, if after this timeout the connection hasn't been stablished the thread invalidates the user session.
Notice:
(my code also sets into the session an attribute that represents the current server to which the pushlet is currently connected to)
(Application failover occurs without problem when the failover is caused by a server or jboss shutdown/restart)
--------------------
The issue:
- Server A is up and Server B is up
- The user has opened the browser and connected to Server A (pushlet is up, my monitoring thread starts running)
- Server A has it's LAN cable disconnected from the network
- Browser code detects the failure, starts trying a reconnection, reconnects to Server B (Failover Successful)
- User uses the app without problems
- after about 5 minutes Server A has it's LAN cable reconnected to the network
***** now the problem starts *****
There seem to be no merge issues:
ServerALog:
2010-12-07 21:54:17,792 INFO [org.jboss.cache.TreeCache] viewAccepted(): MergeView::[161.134.28.20:7810|3] [161.134.28.20:7810, 161.134.28.21:7810], subgroup
s=[[161.134.28.20:7810|2] [161.134.28.20:7810], [161.134.28.21:7810|2] [161.134.28.21:7810]]
ServerBLog:
2010-12-07 21:54:17,819 INFO [org.jboss.cache.TreeCache] viewAccepted(): MergeView::[161.134.28.20:7810|3] [161.134.28.20:7810, 161.134.28.21:7810], subgroup
s=[[161.134.28.20:7810|2] [161.134.28.20:7810], [161.134.28.21:7810|2] [161.134.28.21:7810]]
FD_SOCK suspicious message on both servers:
ServerALog:
2010-12-07 21:54:38,178 WARN [org.jgroups.protocols.FD_SOCK] I was suspected by 161.134.28.21:7810; ignoring the SUSPECT message
ServerBLog:
2010-12-07 21:54:38,031 WARN [org.jgroups.protocols.FD_SOCK] I was suspected by 161.134.28.20:7810; ignoring the SUSPECT message
My monitor retrieves different values for the attribute that is stored in the session
ServerA:
... SessionMinder] [] [] **** SESSION SERVER: 161.134.28.20
ServerB
... SessionMinder] [] [] **** SESSION SERVER: 161.134.28.21
Monitoring thread on Server A invalidates the session after the timeout
and on server B I see the following message:
2010-12-07 21:56:08,167 INFO [org.jboss.web.tomcat.service.session.CacheListener] Possible concurrency problem: Replicated version id 50 matches in-memory ve
rsion for session 8K3xQotPH-OjVp91acqZRw**
2010-12-07 21:56:08,167 DEBUG [org.jboss.web.tomcat.service.session.ClusteredSession] The session has expired with id: 8K3xQotPH-OjVp91acqZRw** -- is it local
? true
2010-12-07 21:56:08,167 DEBUG [org.jboss.cache.TreeCache] Performing a real remove for node /JSESSION/localhost/HISWebUI/8K3xQotPH-OjVp91acqZRw**, marked for
removal.
User is redirected to the logon page
My cluster configuration is the default that comes with jboss, the only thing I modified was to use TCP instead of UDP:
...
<attribute name="CacheMode">REPL_ASYNC</attribute>
<attribute name="UseRegionBasedMarshalling">false</attribute>
...
...
<config>
<TCP bind_addr="${jboss.bind.address}" start_port="7810" loopback="true"
tcp_nodelay="true"
recv_buf_size="20000000"
send_buf_size="640000"
discard_incompatible_packets="true"
enable_bundling="true"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
use_outgoing_packet_handler="false"
down_thread="false" up_thread="false"
use_send_queues="false"
sock_conn_timeout="300"
skip_suspected_members="true"/>
<TCPPING initial_hosts="${jboss.bind.address}[7810]${jboss.cluster.members}" port_range="3"
timeout="3000"
down_thread="false" up_thread="false"
num_initial_members="3"/>
<MERGE2 max_interval="100000"
down_thread="false" up_thread="false" min_interval="20000"/>
<FD_SOCK down_thread="false" up_thread="false"/>
<FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
<pbcast.NAKACK max_xmit_size="60000"
use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
down_thread="false" up_thread="false"
discard_delivered_msgs="true"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
down_thread="false" up_thread="false"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
down_thread="false" up_thread="false"
join_retry_timeout="2000" shun="true"
view_bundling="true"/>
<FC max_credits="2000000" down_thread="false" up_thread="false"
min_threshold="0.10"/>
<FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
<pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>
</config>
....
Any idea of what could be happening or how can I obtain more information on what's going on ?
I didn't want to switch to repl_sync, also saw something about configuration the cache for instead of replication doing invalidation, how configure that ?
Thanks,
Eddie
Additional Info:
* Tried modifying the configuration for using REPL_SYNC and that didn't resolve the problem
* We are using AIX
* Having a look at the logs, I also noticed that our pushlet only has its outputStream closed when the server gets its LAN cable reconnected to the network.
Seems that on a LAN failure streams aren't properly closed and stay open for quite some time... not sure if that could be causing problems to the replication code as well.
Is it possible that by modifying an OS configuration could we have a different result when a LAN disconnection happens ?