4 Replies Latest reply on Feb 11, 2016 3:19 PM by thiago.presa

Troubleshooting JGroups/Infinispan timeouts?

thiago.presa Feb 10, 2016 11:19 AM

With a certain frequency (say, every 1-2 weeks) we have a bit of trouble when some application is redeployed/updated in our domain cluster with ha (jgroups via TCP). What I'd like to know is how can I gather more information about it? My impression is that the timeout happens when the server tries to return to the jgroups/infinispan cluster. It may well be the case that our network is the real problem, but I don't know how can I gather evidence to prove or disprove this.

1. Re: Troubleshooting JGroups/Infinispan timeouts?

rhusar Feb 10, 2016 12:06 PM (in response to thiago.presa)

Can you paste some stack trace?
Actions
2. Re: Troubleshooting JGroups/Infinispan timeouts?

thiago.presa Feb 10, 2016 12:43 PM (in response to rhusar)

I meant as a general question, but here's a particular case:

15:00:10,107 WARN [org.infinispan.statetransfer.StateConsumerImpl] (ServerService Thread Pool -- 68) ISPN000286: Issue when retrieving cluster listeners from <slave-node>:<app-name>: org.infinispan.util.concurrent.TimeoutException: Replication timeout for <slave-node>:<app-name>
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:755)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$80(JGroupsTransport.java:589)
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
    at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
    at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46)
    at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

This is the only stack trace that seems related to the timeout issue. It comes before the AS timeout:

15:00:54,649 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("interface" => "unsecure")]'

And after the timeout there are many stack traces, probably related with the server shutdown. For instance:

15:01:19,909 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.controller.AbstractAddStepHandler$1@7e599053 for operation {"operation" => "add","address" => [("socket-binding-group" => "ha-sockets"),("remote-destination-outbound-socket-binding" => "mc-prox1p")],"host" => "<ip-address>","port" => 6666,"source-interface" => undefined,"source-port" => undefined,"fixed-source-port" => undefined} at address [
    ("socket-binding-group" => "ha-sockets"),
    ("remote-destination-outbound-socket-binding" => "mc-prox1p")
] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
    at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:506)
    at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1369)
    at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1328)
    at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1311)
    at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1185)
    at org.jboss.as.controller.AbstractOperationContext.executeResultHandlerPhase(AbstractOperationContext.java:767)
    at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:644)
    at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370)
    at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1336)
    at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:485)
    at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:387)
    at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:349)
    at org.jboss.as.server.ServerService.boot(ServerService.java:392)
    at org.jboss.as.server.ServerService.boot(ServerService.java:365)
    at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:299)
    at java.lang.Thread.run(Thread.java:745)

It seems to me that if I upgrade to WF10 Final, I'll get the fix for this bug[1], which probably will provide me more info on why the timeout happened.

[1] - [ISPN-5613] Replication timeouts should show more information of the affected entries - JBoss Issue Tracker
Actions
3. Re: Troubleshooting JGroups/Infinispan timeouts?

rhusar Feb 11, 2016 2:43 PM (in response to thiago.presa)

There has been a significant amount of issues resolved and component upgrades with fixes in WF 10 and more in the upcoming version (see master and the PR queue on GitHub) so please test on the latest (at least released) version because its unmanageable to investigate bugs that were likely already resolved.
Actions
4. Re: Troubleshooting JGroups/Infinispan timeouts?

thiago.presa Feb 11, 2016 3:19 PM (in response to rhusar)

Yes, I'm already working on upgrading our Wildfly clusters. Thanks!
Actions

Go to original post