1 2 Previous Next 15 Replies Latest reply on May 19, 2014 9:39 AM by rhusar

    clustered Wildfly unexpected nodes SuspectException

    hielkehoeve

      Our current production system has a cluster of 2 nodes using Wildfly 8.0.1. We have used the "ha" profile and "ha" socket-binding-group as a baseline to create the profile for the clusters. The cluster has a REST application deployed. This application uses 2 infinispan cache containers (hibernate and application).  In total these views have 5 local caches, 3 invalidation caches and 7 replicated caches. The cache containers use udp as transport. Our hibernate container is the default config as provided by the "ha" profile.

       

      Our environment is Ubuntu 13.04, jboss wildfly 8.0.1 Snapshot (infinispan 6.0.1.Final, jgroups 3.4.3.Final)

       

      Thursday the 10th and Tuesday the 15th the one of the nodes started throwing SuspectException during a hibernate flush. The applications had only a few active users.

       

       

      Node 1 says:

      -----------------

      2014-04-15 17:40:53,380 ERROR [org.jboss.as.ejb3.invocation] (default task-22) JBAS014134: EJB Invocation failed on component ConversatieResourceRESTService
      for method public javax.ws.rs.core.Response ...ConversatieResourceRESTService.setStatus(...event.Con
      versatieStatusObject): javax.ejb.EJBTransactionRolledbackException: Transaction rolled back
      ...
      ...
              at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:32) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:70) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:333) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1306) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.infinispan.CacheImpl.putInternal(CacheImpl.java:878) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.infinispan.CacheImpl.put(CacheImpl.java:870) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.infinispan.DecoratedCache.put(DecoratedCache.java:401) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.infinispan.AbstractDelegatingCache.put(AbstractDelegatingCache.java:276) [infinispan-core-6.0.1.Final.jar:6.0.1.Final]
              at org.hibernate.cache.infinispan.access.TransactionalAccessDelegate.update(TransactionalAccessDelegate.java:192) [hibernate-infinispan-4.3.4.Final.j
      ar:4.3.4.Final]
              at org.hibernate.cache.infinispan.entity.TransactionalAccess.update(TransactionalAccess.java:89) [hibernate-infinispan-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.action.internal.EntityUpdateAction.cacheUpdate(EntityUpdateAction.java:234) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.action.internal.EntityUpdateAction.execute(EntityUpdateAction.java:209) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:461) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:347) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:350) [hibernate-core-4.3.4.Final.j
      ar:4.3.4.Final]
              at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:56) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1222) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:425) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              at org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorNonTrackingImpl.beforeCompletion(SynchronizationCallba
      ckCoordinatorNonTrackingImpl.java:110) [hibernate-core-4.3.4.Final.jar:4.3.4.Final]
              ... 117 more
      Caused by: SuspectedException
              at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:406)
              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:353) [infinispan-core-6.0.1.F
      inal.jar:6.0.1.Final]
              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167) [infinispan-core-6.0.1
      .Final.jar:6.0.1.Final]
              ... 161 more
      2014-04-15 17:40:53,707 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-7,shared=udp) ISPN000094: Received new cluster view: [node1:rest1-even/hibernate-even|2] (1) [node1:rest1-even/hibernate-even]
      2014-04-15 17:40:53,707 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-14,shared=udp) ISPN000094: Received new cluster view: [node1:rest1-even/application-even|2] (1) [node1:rest1-even/application-even]
      

       

      Node 2 says:

      -----------------

      2014-04-15 17:40:17,934 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-19,shared=udp) ISPN000094: Received new cluster view: [node2:rest2-even/hibernate-even|2] (1) [node2:rest2-even/hibernate-even]
      2014-04-15 17:40:17,934 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-17,shared=udp) ISPN000094: Received new cluster view: [node2:rest2-even/application-even|2] (1) [node2:rest2-even/application-even]
      ...
      ...
      2014-04-15 17:40:50,737 WARN  [org.jgroups.protocols.UDP] (Timer-2,shared=udp) JGRP000032: null: no physical address for node1:rest1-even/hibernate-even, dropping message
      2014-04-15 17:40:52,543 WARN  [org.jgroups.protocols.UDP] (Timer-5,shared=udp) JGRP000032: null: no physical address for node1:rest1-even/application-even, dropping message
      (repeats 100 times)
      ...
      

       

      Is there any way I can get more logging than the WARNs above? Does anyone have pointers how or when this SuspectException is thrown?

        • 1. Re: clustered Wildfly unexpected nodes SuspectException
          wdfink

          no physical address for node1 sounds like a network issue.

          Is this a permanent issue?

           

          Also I move the thread to the wildfly forum as it is not really related to infinispan

          • 2. Re: clustered Wildfly unexpected nodes SuspectException
            rhusar

            What is the problem besides the logs? Can you verify that there wasn't really an outage in the lab, the SuspectException is expected to happen when network fails or the node is unresponsive.

             

            This may be another manifestation of [WFLY-2632] JGroups drops unicast messages after shutdown/restart - JBoss Issue Tracker

            • 3. Re: clustered Wildfly unexpected nodes SuspectException
              hielkehoeve

              This issue occurs about 4-6 hours after every restart. I have set the firewall to allow any traffic from/to the internal network. I have not found any error/warn/block log entries in the firewall logs.

               

              It does not seem to be just log entries as suggested in the linked bug report. After the SuspectException the Hibernate caches are out of sync and the 2 nodes throw EntityNotFoundExceptions.

               

              I will see if I can setup a standalone app which sends messages back and forth for about a day.

              • 4. Re: clustered Wildfly unexpected nodes SuspectException
                rhusar

                Scott Marlow, any ideas here?

                • 5. Re: clustered Wildfly unexpected nodes SuspectException
                  smarlow

                  Sure, try asking Bela

                   

                  Some of my questions:

                  What is the cpu utilization on the clustered application server nodes prior/during/after the failure?  Perhaps the timeout needs to be increased if the cpu utilization is too high.  If your cpu utilization on these machines is too high (e.g. pegged at 100%), it would be better to either throttle the workload or add more machines than increase timeouts too much.  The downside of increasing timeouts, is that it takes longer to notice when the other nodes, leave the group.

                   

                  Are you customizing the JGroups configuration stack?  What is the JGroups configuration stack in use on each machine?

                   

                  You might also verify that each cluster node is initially visible to the other node. 

                  • 6. Re: Re: clustered Wildfly unexpected nodes SuspectException
                    hielkehoeve

                    The CPU load is only a bit high (0.5 load) at nights when data is being imported from a different application. The load has not passed 0.5 since the last 4 weeks, the median is 0.07. I would rather not change the timeout to a higher setting, I assume the default is the best configuration for most applications.

                     

                    The nodes find each other during startup:

                    2014-04-23 20:03:37,397 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 58) ISPN000094: Received new cluster view: [node1:rest1-even/ejb|1] (2) [node1:rest1-even/ejb, node2:rest2-even/ejb]
                    2014-04-23 20:03:47,860 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 61) ISPN000094: Received new cluster view: [node1:rest1-even/hibernate-even|1] (2) [node1:rest1-even/hibernate-even, node2:rest2-even/hibernate-even]
                    2014-04-23 20:04:22,173 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-6) ISPN000094: Received new cluster view: [node1:rest1-even/application-even|1] (2) [node1:rest1-even/application-even, node2:rest2-even/application-even]
                    

                     

                    Then lose each during the night and morning:

                    2014-04-24 00:05:39,572 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-8,shared=udp) ISPN000094: Received new cluster view: [node2:rest2-even/ejb|2] (1) [node2:rest2-even/ejb]
                    2014-04-24 12:20:40,749 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-9,shared=udp) ISPN000094: Received new cluster view: [node2:rest2-even/hibernate-even|2] (1) [node2:rest2-even/hibernate-even]
                    2014-04-24 12:20:42,753 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-8,shared=udp) ISPN000094: Received new cluster view: [node2:rest2-even/application-even|2] (1) [node2:rest2-even/application-even]
                    

                     

                    When clustered the load is divided evenly. When the nodes lose each other logging in is no longer possible because the OAUTH tokens are cached in a cache container and requests are balanced very nicely :-) (node1 created the token, node2 denies the OAUTH login because it has no idea the token exists)

                     

                     

                    The JBoss configuration is nearly the same as the default HA profile.

                    JGroups:

                    <subsystem xmlns="urn:jboss:domain:jgroups:2.0" default-stack="udp">
                        <stack name="udp">
                            <transport type="UDP" socket-binding="jgroups-udp"/>
                            <protocol type="PING"/>
                            <protocol type="MERGE3"/>
                            <protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/>
                            <protocol type="FD_ALL"/>
                            <protocol type="VERIFY_SUSPECT"/>
                            <protocol type="pbcast.NAKACK2"/>
                            <protocol type="UNICAST3"/>
                            <protocol type="pbcast.STABLE"/>
                            <protocol type="pbcast.GMS"/>
                            <protocol type="UFC"/>
                            <protocol type="MFC"/>
                            <protocol type="FRAG2"/>
                            <protocol type="RSVP"/>
                        </stack>
                    </subsystem>
                    
                    

                     

                    Socket Bindings:

                    <socket-binding-groups>
                            <socket-binding-group name="ha-sockets" default-interface="public">
                            <socket-binding name="ajp" interface="internal" port="${jboss.ajp.port:8009}"/>
                            <socket-binding name="http" interface="internal" port="${jboss.http.port:8080}"/>
                            <socket-binding name="https" interface="internal" port="${jboss.https.port:8443}"/>
                            <socket-binding name="jgroups-mping" interface="internal" port="0" multicast-address="${jboss.default.multicast.address:230.0.0.4}" multicast-port="45700"/>
                            <socket-binding name="jgroups-udp" interface="internal" port="55200" multicast-address="${jboss.default.multicast.address:230.0.0.4}" multicast-port="45688"/>
                            <socket-binding name="jgroups-udp-fd" interface="internal" port="54200"/>
                            <socket-binding name="modcluster" interface="internal" port="0" multicast-address="224.0.1.105" multicast-port="23364"/>
                            <socket-binding name="txn-recovery-environment" port="4712"/>
                            <socket-binding name="txn-status-manager" port="4713"/>
                            <outbound-socket-binding name="mail-smtp">
                                <remote-destination host="localhost" port="25"/>
                            </outbound-socket-binding>
                        </socket-binding-group>
                    </socket-binding-groups>
                    
                    

                     

                    Infinispan:

                    <subsystem xmlns="urn:jboss:domain:infinispan:2.0">
                        <cache-container name="server" default-cache="default" module="org.wildfly.clustering.server" aliases="singleton cluster">
                            <transport lock-timeout="60000"/>
                            <replicated-cache name="default" batching="true" mode="SYNC">
                                <locking isolation="REPEATABLE_READ"/>
                            </replicated-cache>
                        </cache-container>
                        <cache-container name="web" default-cache="dist" module="org.wildfly.clustering.web.infinispan">
                            <transport lock-timeout="60000"/>
                            <distributed-cache name="dist" batching="true" mode="ASYNC" owners="4" l1-lifespan="0">
                                <file-store/>
                            </distributed-cache>
                        </cache-container>
                        <cache-container name="ejb" default-cache="dist" module="org.wildfly.clustering.ejb.infinispan" aliases="sfsb">
                            <transport lock-timeout="60000"/>
                            <distributed-cache name="dist" batching="true" mode="ASYNC" owners="4" l1-lifespan="0">
                                <file-store/>
                            </distributed-cache>
                        </cache-container>
                        <cache-container name="hibernate" default-cache="local-query" module="org.hibernate">
                            <transport stack="udp" cluster="hibernate-odd" lock-timeout="60000"/>
                            <local-cache name="local-query">
                                <transaction mode="NONE"/>
                                <eviction strategy="LRU" max-entries="10000"/>
                                <expiration max-idle="100000"/>
                            </local-cache>
                            <invalidation-cache name="entity" mode="SYNC">
                                <transaction mode="NONE"/>
                                <eviction strategy="LRU" max-entries="10000"/>
                                <expiration max-idle="100000"/>
                            </invalidation-cache>
                            <replicated-cache name="timestamps" mode="ASYNC">
                                <transaction mode="NONE"/>
                                <eviction strategy="NONE"/>
                            </replicated-cache>
                        </cache-container>
                        <cache-container name="application" default-cache="default" jndi-name="java:jboss/infinispan/application" start="EAGER">
                            <transport stack="udp" cluster="application-odd" lock-timeout="60000"/>
                            <local-cache name="default">
                                <transaction mode="NONE"/>
                                <eviction strategy="LRU" max-entries="10000"/>
                                <expiration max-idle="100000"/>
                            </local-cache>
                            <local-cache name="PERMISSION_CATEGORY">
                                <transaction mode="NONE"/>
                                <eviction strategy="LIRS" max-entries="20000"/>
                                <expiration lifespan="60000"/>
                            </local-cache>
                            <local-cache name="IMPLIES">
                                <transaction mode="NONE"/>
                                <eviction strategy="LIRS" max-entries="100000"/>
                                <expiration lifespan="60000"/>
                            </local-cache>
                            <local-cache name="PERMISSIONS">
                                <transaction mode="NONE"/>
                                <eviction strategy="LIRS" max-entries="20000"/>
                                <expiration lifespan="60000"/>
                            </local-cache>
                            <invalidation-cache name="AUTHENTICATION" mode="SYNC">
                                <transaction mode="NONE"/>
                                <eviction strategy="LIRS" max-entries="1000"/>
                                <expiration lifespan="600000"/>
                            </invalidation-cache>
                            <invalidation-cache name="AUTHORIZATION" mode="ASYNC">
                                <transaction mode="NONE"/>
                                <eviction strategy="LIRS" max-entries="1000"/>
                                <expiration lifespan="600000"/>
                            </invalidation-cache>
                            <replicated-cache name="AUTHENTICATION_FAILURES" mode="ASYNC">
                                <transaction mode="NONE"/>
                                <eviction strategy="LRU" max-entries="50000"/>
                                <expiration lifespan="600000"/>
                            </replicated-cache>
                            <replicated-cache name="REMOTE_IDENTITY_TOKENS" mode="SYNC">
                                <transaction mode="NONE"/>
                                <expiration lifespan="600000"/>
                            </replicated-cache>
                            <replicated-cache name="OPENID_ASSOCIATIONS" mode="SYNC">
                                <transaction mode="NONE"/>
                                <expiration lifespan="3600000"/>
                            </replicated-cache>
                            <replicated-cache name="OPENID_NONCES" mode="SYNC">
                                <transaction mode="NONE"/>
                                <expiration lifespan="60000"/>
                            </replicated-cache>
                            <replicated-cache name="TEMP_ACCOUNTS" mode="SYNC">
                                <transaction mode="NONE"/>
                                <expiration max-idle="60000" lifespan="3600000"/>
                            </replicated-cache>
                            <replicated-cache name="TRANSLOADIT_COLLECTEDUPLOADS" mode="SYNC">
                                <transaction mode="NONE"/>
                                <expiration lifespan="3600000"/>
                            </replicated-cache>
                        </cache-container>
                    </subsystem>
                    
                    

                     

                    Both instances use the following server-group:

                    <server-groups>
                            <server-group name="rest-group-odd" profile="rest-odd">
                                <jvm name="default">
                                    <heap size="1g" max-size="3g"/>
                                    <permgen size="512m" max-size="768m"/>
                                </jvm>
                                <socket-binding-group ref="ha-sockets"/>
                                <deployments>
                                    <deployment name="product.ear" runtime-name="product.ear"/>
                                    <deployment name="product-version.ear" runtime-name="product-version.ear"/>
                                </deployments>
                            </server-group>
                    <server-groups>
                    
                    
                    • 7. Re: Re: clustered Wildfly unexpected nodes SuspectException
                      smarlow

                      belaban, any suggestions for these guys?

                       

                      I wonder if the firewall is configured to close connections after 4-5 hours (or less).

                      • 8. Re: clustered Wildfly unexpected nodes SuspectException
                        rhusar

                        Wolf-Dieter Fink wrote:

                         

                        no physical address for node1 sounds like a network issue.

                        Is this a permanent issue?

                        Well, sort of network issue, the problem is that warning is caused by A attempting to send a unicast message to B, but the physical (IP) address of B is not in the cache (TP.logical_addr_cache). Which can happen around cluster recovery.

                         

                        Is there any way I can get more logging than the WARNs above?

                        Well sure, just add that group and make it DEBUG or TRACE level. But then those profiles might become unusable due to huge amout of logging.

                         

                        Does anyone have pointers how or when this SuspectException is thrown?

                        Well its thrown because the operations fail to execute in time. You see that the cluster split into multiple.

                         

                        So this is all hinting at simply the servers and of the cluster being overloaded.

                         

                        Also, can you try disable the 2nd level cache? Are you sure that your data access pattern can use the 2nd level cache effectively?

                        1 of 1 people found this helpful
                        • 9. Re: Re: clustered Wildfly unexpected nodes SuspectException
                          belaban

                          When you see a warning like "no physical address found for P" then this means that the local cache maintained by each node with mappings between addresses (P) and physical addresses didn't contain an entry for P (e.g. 192.168.1.5:7803), therefore the transport didn't know how to send the message to P.

                          If this happens, the discovery protocol (PING) tries to fetch the physical address for P. Next time the message is retransmitted (typically 500 ms later), the physical address for P should be found and the message sent.

                          You can look at the local caches of the nodes via JMX or probe: there's a method printLogicalAddressCache in UDP which can be invoked via jconsole, or simpler, using probe:

                          probe.sh uuids

                          This will dump the caches for all nodes in the network.

                          If P is missing, we need to enable tracing on UNICAST3 to see who's sending a message to P and what message it is. This can also be done via jconsole or probe:

                          probe.sh op=UNICAST3.setLevel["trace"]

                          You need to have logging configured and a valid appender.

                          If P left and there was a spurious message, then it should stop after UNICAST3.max_retransmit_time + UNICAST3.conn_close_timeouts milliseconds, as the connection to a non-member P will be closed.

                          1 of 1 people found this helpful
                          • 10. Re: clustered Wildfly unexpected nodes SuspectException
                            hielkehoeve

                            Thanks for all the tips guys!

                             

                            I have enabled logging to a separate file on a similar environment. Here I can mess about and see if the issue can be reproduced. (same hosting provider, same Ubuntu, same wildfly etc).

                            I have also setup a Wildfly 8.1 environment with the same configuration. I'll leave both on for a day (production cluster never lasts more than 1/2 a day) and see what happens.

                            • 11. Re: clustered Wildfly unexpected nodes SuspectException
                              rhusar

                              Also can you describe what you are doing in the import and how?

                               

                              The CPU load is only a bit high (0.5 load) at nights when data is being imported from a different application.

                              Is this some timer task that imports stuff to cache or?

                              • 12. Re: clustered Wildfly unexpected nodes SuspectException
                                hielkehoeve

                                Data is being read from CSV files to Hibernate Entities using the batching framework. We update and create entities. This is partly done in parallel so this causes some higher load.

                                • 13. Re: clustered Wildfly unexpected nodes SuspectException
                                  hielkehoeve

                                  I have tried reproducing the bug on a replicated environment but alas it is working fine. Reading the log is quite tedious as log files are 18GB per hour. I have found no evidence as to why they would suddenly disconnect. I'll keep you posted.

                                  • 14. Re: clustered Wildfly unexpected nodes SuspectException
                                    hielkehoeve

                                    Hey guys, its been a while. I have upgraded our wildfly install on our production servers to a version build on the 30th of April and the nodes have been connected since last friday. Seems that the JGroups version, 3.4.3.Final, is handling things much better. Thanks for the help in searching for the problem!

                                    1 2 Previous Next