4 Replies Latest reply on Feb 11, 2016 3:19 PM by thiago.presa

    Troubleshooting JGroups/Infinispan timeouts?

    thiago.presa

      With a certain frequency (say, every 1-2 weeks) we have a bit of trouble when some application is redeployed/updated in our domain cluster with ha (jgroups via TCP). What I'd like to know is how can I gather more information about it? My impression is that the timeout happens when the server tries to return to the jgroups/infinispan cluster. It may well be the case that our network is the real problem, but I don't know how can I gather evidence to prove or disprove this.

        • 1. Re: Troubleshooting JGroups/Infinispan timeouts?
          rhusar

          Can you paste some stack trace?

          • 2. Re: Troubleshooting JGroups/Infinispan timeouts?
            thiago.presa

            I meant as a general question, but here's a particular case:

             

            15:00:10,107 WARN  [org.infinispan.statetransfer.StateConsumerImpl] (ServerService Thread Pool -- 68) ISPN000286: Issue when retrieving cluster listeners from <slave-node>:<app-name>: org.infinispan.util.concurrent.TimeoutException: Replication timeout for <slave-node>:<app-name>

                at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:755)

                at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$80(JGroupsTransport.java:589)

                at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)

                at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

                at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)

                at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)

                at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46)

                at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17)

                at java.util.concurrent.FutureTask.run(FutureTask.java:266)

                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

                at java.lang.Thread.run(Thread.java:745)

             

            This is the only stack trace that seems related to the timeout issue. It comes before the AS timeout:

             

            15:00:54,649 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("interface" => "unsecure")]'

             

            And after the timeout there are many stack traces, probably related with the server shutdown. For instance:

             

            15:01:19,909 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.controller.AbstractAddStepHandler$1@7e599053 for operation {"operation" => "add","address" => [("socket-binding-group" => "ha-sockets"),("remote-destination-outbound-socket-binding" => "mc-prox1p")],"host" => "<ip-address>","port" => 6666,"source-interface" => undefined,"source-port" => undefined,"fixed-source-port" => undefined} at address [

                ("socket-binding-group" => "ha-sockets"),

                ("remote-destination-outbound-socket-binding" => "mc-prox1p")

            ] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException

                at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:506)

                at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1369)

                at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1328)

                at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1311)

                at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1185)

                at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:767)

                at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:644)

                at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370)

                at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1336)

                at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:485)

                at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:387)

                at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:349)

                at org.jboss.as.server.ServerService.boot(ServerService.java:392)

                at org.jboss.as.server.ServerService.boot(ServerService.java:365)

                at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:299)

                at java.lang.Thread.run(Thread.java:745)

             

            It seems to me that if I upgrade to WF10 Final, I'll get the fix for this bug[1], which probably will provide me more info on why the timeout happened.

             

            [1] - [ISPN-5613] Replication timeouts should show more information of the affected entries - JBoss Issue Tracker

            • 3. Re: Troubleshooting JGroups/Infinispan timeouts?
              rhusar

              There has been a significant amount of issues resolved and component upgrades with fixes in WF 10 and more in the upcoming version (see master and the PR queue on GitHub) so please test on the latest (at least released) version because its unmanageable to investigate bugs that were likely already resolved.

              • 4. Re: Troubleshooting JGroups/Infinispan timeouts?
                thiago.presa

                Yes, I'm already working on upgrading our Wildfly clusters. Thanks!