1 Reply Latest reply on Dec 8, 2010 10:24 PM by eduardo_thp

    JBoss 4.2.3.GA - Issue when a node is removed from the LAN

    eduardo_thp

      Hello,

       

           I'm using JBoss 4.2.3.GA on a TCP clustered environment (the issue I'm describing has been seen on a cluster with 2 nodes and on a cluster with 4 nodes)

       

           I have the following sceneario:

       

           - no loadbalancers

           - a cluster of two or four jboss servers (shouldn't matter, issue was seen on both envs)

           - WebApplication that has a pushlet connection for pushing events from the server to the browser

           - On the web server side of the application I have a thread running for monitoring the state of the pushlet connection

           - On the client side I'm also monitoring the state of the pushlet connection (JavaScript)

       

           * If the clientside (browser) detects that the pushlet connection has been lost, it tries reconnecting to the server, in case it can't reconnect to the same server it goes to a new server in the cluster (same domain - no browser security issues)

           * If the serverside (thread) detects that the pushlet connection has been lost, it specifies a timeout, if after this timeout the connection hasn't been stablished the thread invalidates the user session.

       

           Notice:

                (my code also sets into the session an attribute that represents the current server to which the pushlet is currently connected to)

                (Application failover occurs without problem when the failover is caused by a server or jboss shutdown/restart)

       

      --------------------

       

          The issue:

       

           - Server A is up and Server B is up

           - The user has opened the browser and connected to Server A (pushlet is up, my monitoring thread starts running)

           - Server A has it's LAN cable disconnected from the network

           - Browser code detects the failure, starts trying a reconnection, reconnects to Server B (Failover Successful)

           - User uses the app without problems

           - after about 5 minutes Server A has it's LAN cable reconnected to the network

       

           ***** now the problem starts *****

       

           There seem to be no merge issues:

      ServerALog:

      2010-12-07 21:54:17,792 INFO  [org.jboss.cache.TreeCache] viewAccepted(): MergeView::[161.134.28.20:7810|3] [161.134.28.20:7810, 161.134.28.21:7810], subgroup
      s=[[161.134.28.20:7810|2] [161.134.28.20:7810], [161.134.28.21:7810|2] [161.134.28.21:7810]]

      ServerBLog:

      2010-12-07 21:54:17,819 INFO  [org.jboss.cache.TreeCache] viewAccepted(): MergeView::[161.134.28.20:7810|3] [161.134.28.20:7810, 161.134.28.21:7810], subgroup
      s=[[161.134.28.20:7810|2] [161.134.28.20:7810], [161.134.28.21:7810|2] [161.134.28.21:7810]]

       

       

           FD_SOCK suspicious message on both servers:

      ServerALog:

      2010-12-07 21:54:38,178 WARN  [org.jgroups.protocols.FD_SOCK] I was suspected by 161.134.28.21:7810; ignoring the SUSPECT message

      ServerBLog:

      2010-12-07 21:54:38,031 WARN  [org.jgroups.protocols.FD_SOCK] I was suspected by 161.134.28.20:7810; ignoring the SUSPECT message

       

          My monitor retrieves different values for the attribute that is stored in the session

      ServerA:

      ... SessionMinder] [] [] **** SESSION SERVER: 161.134.28.20

      ServerB

      ... SessionMinder] [] [] **** SESSION SERVER: 161.134.28.21

       

           Monitoring thread on Server A invalidates the session after the timeout


           and on server B I see the following message:

       

      2010-12-07 21:56:08,167 INFO  [org.jboss.web.tomcat.service.session.CacheListener] Possible concurrency problem: Replicated version id 50 matches in-memory ve
      rsion for session 8K3xQotPH-OjVp91acqZRw**
      2010-12-07 21:56:08,167 DEBUG [org.jboss.web.tomcat.service.session.ClusteredSession] The session has expired with id: 8K3xQotPH-OjVp91acqZRw** -- is it local
      ? true
      2010-12-07 21:56:08,167 DEBUG [org.jboss.cache.TreeCache] Performing a real remove for node /JSESSION/localhost/HISWebUI/8K3xQotPH-OjVp91acqZRw**, marked for
      removal.

       

           User is redirected to the logon page

       

           My cluster configuration is the default that comes with jboss, the only thing I modified was to use TCP instead of UDP:

       

      ...

      <attribute name="CacheMode">REPL_ASYNC</attribute>

      <attribute name="UseRegionBasedMarshalling">false</attribute>

      ...

       

      ...

      <config>
                      <TCP bind_addr="${jboss.bind.address}" start_port="7810" loopback="true"
                           tcp_nodelay="true"
                               recv_buf_size="20000000"
                               send_buf_size="640000"
                               discard_incompatible_packets="true"
                               enable_bundling="true"
                               max_bundle_size="64000"
                               max_bundle_timeout="30"
                               use_incoming_packet_handler="true"
                               use_outgoing_packet_handler="false"
                               down_thread="false" up_thread="false"
                               use_send_queues="false"
                               sock_conn_timeout="300"
                               skip_suspected_members="true"/>
                          <TCPPING initial_hosts="${jboss.bind.address}[7810]${jboss.cluster.members}" port_range="3"
                                   timeout="3000"
                                   down_thread="false" up_thread="false"
                                   num_initial_members="3"/>
                          <MERGE2 max_interval="100000"
                                  down_thread="false" up_thread="false" min_interval="20000"/>
                          <FD_SOCK down_thread="false" up_thread="false"/>
                          <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
                          <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
                          <pbcast.NAKACK max_xmit_size="60000"
                                         use_mcast_xmit="false" gc_lag="0"
                                         retransmit_timeout="300,600,1200,2400,4800"
                                         down_thread="false" up_thread="false"
                                         discard_delivered_msgs="true"/>
                          <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                                         down_thread="false" up_thread="false"
                                         max_bytes="400000"/>
                          <pbcast.GMS print_local_addr="true" join_timeout="3000"
                                      down_thread="false" up_thread="false"
                                      join_retry_timeout="2000" shun="true"
                                      view_bundling="true"/>
                          <FC max_credits="2000000" down_thread="false" up_thread="false"
                              min_threshold="0.10"/>
                          <FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
                          <pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>
                 </config>

      ....

       

       

           Any idea of what could be happening or how can I obtain more information on what's going on ?

       

           I didn't want to switch to repl_sync, also saw something about configuration the cache for instead of replication doing invalidation, how configure that ?

       

      Thanks,

      Eddie

       

      Additional Info:

       

      * Tried modifying the configuration for using REPL_SYNC and that didn't resolve the problem

       

      * We are using AIX

       

      * Having a look at the logs, I also noticed that our pushlet only has its  outputStream closed when the server gets its LAN cable reconnected to  the network.

       

      Seems  that on a LAN failure streams aren't properly closed and stay open for  quite some time... not sure if that could be causing problems to the  replication code as well.

       

      Is it possible that by modifying an OS configuration could we have a different result when a LAN disconnection happens ?

        • 1. Re: JBoss 4.2.3.GA - Issue when a node is removed from the LAN
          eduardo_thp

          I've noticed some issues on my code where an InterruputedException was being swallowed....

           

          after properly dealing with the issue ( Thread.currentThread().interrupt() )

           

          I've re-executed the tests and now JBossCacheService on Server A throws an exception when the nodes try merging.

           

          Not sure if that could be related to the AIX JVM implementation or ....

           

          2010-12-08 20:53:42,276 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] processSessionRepl(): failed with exception
          java.lang.RuntimeException: JBossCacheService: exception occurred in cache put ...
                  at org.jboss.web.tomcat.service.session.JBossCacheWrapper.put(JBossCacheWrapper.java:147)
                  at org.jboss.web.tomcat.service.session.JBossCacheService.putSession(JBossCacheService.java:325)
                  at org.jboss.web.tomcat.service.session.JBossCacheClusteredSession.processSessionRepl(JBossCacheClusteredSession.java:123)
                  at org.jboss.web.tomcat.service.session.JBossCacheManager.processSessionRepl(JBossCacheManager.java:1127)
                  at org.jboss.web.tomcat.service.session.JBossCacheManager.storeSession(JBossCacheManager.java:682)
                  at org.jboss.web.tomcat.service.session.InstantSnapshotManager.snapshot(InstantSnapshotManager.java:49)
                  at org.jboss.web.tomcat.service.session.ClusteredSessionValve.invoke(ClusteredSessionValve.java:108)
                  at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:432)
                  at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
                  at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
                  at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
                  at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
                  at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
                  at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
                  at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
                  at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
                  at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
                  at java.lang.Thread.run(Thread.java:736)
          Caused by:
          java.lang.RuntimeException: java.lang.InterruptedException
                  at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5931)
                  at org.jboss.cache.TreeCache.put(TreeCache.java:3784)
                  at sun.reflect.GeneratedMethodAccessor137.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
                  at java.lang.reflect.Method.invoke(Method.java:600)
                  at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:155)
                  at org.jboss.mx.server.Invocation.dispatch(Invocation.java:94)
                  at org.jboss.mx.server.Invocation.invoke(Invocation.java:86)
                  at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:193)
                  at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
                  at org.jboss.mx.util.MBeanProxyExt.invoke(MBeanProxyExt.java:210)
                  at $Proxy58.put(Unknown Source)
                  at org.jboss.web.tomcat.service.session.JBossCacheWrapper.put(JBossCacheWrapper.java:138)
                  ... 17 more
          Caused by:
          java.lang.InterruptedException
                  at org.jboss.cache.lock.ReadWriteLockWithUpgrade$ReaderLock.attempt(ReadWriteLockWithUpgrade.java:303)
                  at org.jboss.cache.lock.IdentityLock.acquireReadLock(IdentityLock.java:252)
                  at org.jboss.cache.Node.acquireReadLock(Node.java:545)
                  at org.jboss.cache.Node.acquire(Node.java:507)
                  at org.jboss.cache.interceptors.PessimisticLockInterceptor.acquireNodeLock(PessimisticLockInterceptor.java:410)
                  at org.jboss.cache.interceptors.PessimisticLockInterceptor.lock(PessimisticLockInterceptor.java:322)
                  at org.jboss.cache.interceptors.PessimisticLockInterceptor.invoke(PessimisticLockInterceptor.java:189)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.UnlockInterceptor.invoke(UnlockInterceptor.java:32)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:39)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:379)
                  at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:174)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:167)
                  at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5919)
                  ... 29 more
          2010-12-08 20:53:42,278 WARN  [org.jboss.web.tomcat.service.session.InstantSnapshotManager./HISWebUI] Failed to replicate session LqAw8zAQHwaQH04WQvb2HQ**
          java.lang.RuntimeException: JBossCacheService: exception occurred in cache put ...
                  at org.jboss.web.tomcat.service.session.JBossCacheWrapper.put(JBossCacheWrapper.java:147)
                  at org.jboss.web.tomcat.service.session.JBossCacheService.putSession(JBossCacheService.java:325)
                  at org.jboss.web.tomcat.service.session.JBossCacheClusteredSession.processSessionRepl(JBossCacheClusteredSession.java:123)
                  at org.jboss.web.tomcat.service.session.JBossCacheManager.processSessionRepl(JBossCacheManager.java:1127)
                  at org.jboss.web.tomcat.service.session.JBossCacheManager.storeSession(JBossCacheManager.java:682)
                  at org.jboss.web.tomcat.service.session.InstantSnapshotManager.snapshot(InstantSnapshotManager.java:49)
                  at org.jboss.web.tomcat.service.session.ClusteredSessionValve.invoke(ClusteredSessionValve.java:108)
                  at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:432)
                  at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
                  at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
                  at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
                  at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
                  at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
                  at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
                  at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
                  at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
                  at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
                  at java.lang.Thread.run(Thread.java:736)
          Caused by:
          java.lang.RuntimeException: java.lang.InterruptedException
                  at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5931)
                  at org.jboss.cache.TreeCache.put(TreeCache.java:3784)
                  at sun.reflect.GeneratedMethodAccessor137.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
                  at java.lang.reflect.Method.invoke(Method.java:600)
                  at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:155)
                  at org.jboss.mx.server.Invocation.dispatch(Invocation.java:94)
                  at org.jboss.mx.server.Invocation.invoke(Invocation.java:86)
                  at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:193)
                  at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
                  at org.jboss.mx.util.MBeanProxyExt.invoke(MBeanProxyExt.java:210)
                  at $Proxy58.put(Unknown Source)
                  at org.jboss.web.tomcat.service.session.JBossCacheWrapper.put(JBossCacheWrapper.java:138)
                  ... 17 more
          Caused by:
          java.lang.InterruptedException
                  at org.jboss.cache.lock.ReadWriteLockWithUpgrade$ReaderLock.attempt(ReadWriteLockWithUpgrade.java:303)
                  at org.jboss.cache.lock.IdentityLock.acquireReadLock(IdentityLock.java:252)
                  at org.jboss.cache.Node.acquireReadLock(Node.java:545)
                  at org.jboss.cache.Node.acquire(Node.java:507)
                  at org.jboss.cache.interceptors.PessimisticLockInterceptor.acquireNodeLock(PessimisticLockInterceptor.java:410)
                  at org.jboss.cache.interceptors.PessimisticLockInterceptor.lock(PessimisticLockInterceptor.java:322)
                  at org.jboss.cache.interceptors.PessimisticLockInterceptor.invoke(PessimisticLockInterceptor.java:189)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.UnlockInterceptor.invoke(UnlockInterceptor.java:32)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:39)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:379)
                  at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:174)
                  at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
                  at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:167)
                  at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5919)
                  ... 29 more