11 Replies Latest reply on Jul 8, 2016 12:32 PM by mohilkhare

Infinispan Singleton silently dies in wildfly 9 cluster setup

mohilkhare Jun 10, 2016 4:34 PM

Hello,

I am using wildfly 9 in a cluster setup of 3 nodes (standalone-full-ha.xml) and use Singleton service for some of our operations. Sometimes (during heavy load/traffic) we are seeing that singleton service silently dies without giving any error or exception. There was no exception like : "Failed to get quorum.."

When load reduces (number of concurrent requests) on wildfly, then also it doesn't recover i.e. reactivate singleton in some node. In order to start singleton again, the only option that works is manually restarting wildfly

Following is my standalone-full-ha.xml config for infinispan and Jgroups.

10.0.1.32[7600],10.0.1.38[7600],10.0.1.39[7600]</property>

</property>

</protocol>

5000

</property>

</protocol>

</stack>

<cache-container aliases="singleton cluster" default-cache="default" module="org.wildfly.clustering.server" name="server">

<replicated-cache mode="ASYNC" name="default">

<state-transfer enabled="true" timeout="300000"/>

</replicated-cache>

</cache-container>

<cache-container default-cache="session" module="org.wildfly.clustering.web.infinispan" name="web">

<replicated-cache mode="ASYNC" name="session">

<state-transfer enabled="true" timeout="300000"/>

</replicated-cache>

</cache-container>

</subsystem>

Following is a java code snippet that we use to activate and start singleton in a cluster:

public class SingletonServiceActivator implements ServiceActivator {

public static final ServiceName SINGLETON_SERVICE_NAME =

ServiceName.JBOSS.append("ha", "singleton");

private static final String CONTAINER_NAME = "server";

private static final String CACHE_NAME = "default";

@Override

public void activate(ServiceActivatorContext context) throws ServiceRegistryException {

int quorum = 2;

InjectedValue<ServerEnvironment> env = new InjectedValue<>();

SingletonServiceClient srv = new SingletonServiceClient(env);

ServiceController<?> factoryService = context.getServiceRegistry().getRequiredService(SingletonServiceBuilderFactory.SERVICE_NAME.append(CONTAINER_NAME, CACHE_NAME));

SingletonServiceBuilderFactory factory = (SingletonServiceBuilderFactory) factoryService.getValue();

SingletonElectionPolicy policy = new SimpleSingletonElectionPolicy(0);

factory.createSingletonServiceBuilder(SINGLETON_SERVICE_NAME, srv)

.requireQuorum(quorum)

.electionPolicy(policy)

.build(new DelegatingServiceContainer(context.getServiceTarget(),context.getServiceRegistry()))

.addDependency(ServerEnvironmentService.SERVICE_NAME, ServerEnvironment.class, env)

.setInitialMode(ServiceController.Mode.ACTIVE)

.install();

}

public final class SingletonServiceClient extends AbstractService<Serializable> {

private final Value<ServerEnvironment> env;

public SingletonServiceClient(Value<ServerEnvironment> env) {

this.env = env;

}

@Override

public void start(StartContext startContext) {

// startContext.

log("SingletonService started");

//do work

}

@Override

public void stop(StopContext stopContext) {

log("SingletonService stopped"); // THIS NEVER GETS CALLED

//stop

}

Is there something wrong in the config or in the way I am trying to activate and start singleton ?

I thought that there could be some connectivity issue between nodes in a cluster because of which its unable to get desired quorum to start singleton. Just to experiment, I changed quorum to 1. But still sometimes I see this issue during heavy load.

Also, is there a way to monitor state of singleton from application code and trigger it from our application code ?

I will really appreciate some help or suggestions on this issue.

Thanks

Mohil

1. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

pferraro Jun 13, 2016 2:08 PM (in response to mohilkhare)

First, you don't need to use DelegatingServiceContainer to build your singleton service. Just use the ServiceTarget from the ServiceActivatorContext.

Second, what do you need from ServerEnvironment? The wildfly-server module, containing this class, is not public.

Third, can you paste your log?
Actions
2. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

mohilkhare Jun 13, 2016 3:06 PM (in response to pferraro)

Sure, we can get rid of DelegatingServiceContainer. And We added this ServerEnvironment for future use cases, in case we want to extract any values.
But do you think, is this the reason that we are seeing issue with singleton ? Is config w.r.t jgroups (we are using TCP) and infinispan looks fine to you ? Do you think we should use MERGE3 instead of MEGE2? As per documentation MERGE3 is better than MERGE2, but its not that efficient for TCP.

Anyways, just to experiment, I ran cluster with 3 nodes with MERGE protocol in jgroups set to MERGE3. It ran fine for some hours, then suddenly Singleton stopped. Luckily this time I have some logs indicating that callback for stopping singleton got triggered. From logs it looks like within a short span of time, Nodes first failed to meet quorum (around 13-Jun-2016 10:32:05) , then Node1 got selected as a singleton with consensus from all nodes in a cluster ( around 13-Jun-2016 10:37:43), then they all failed again to reach quorum and singleton stopped (around 13-Jun-2016 10:42:24). It didn't start again.

Cluster Node IDs are: 62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d, d57132d8-63f6-4291-93ac-f1911ba8aa06

Node1 (c1564e13-c3ff-4adb-a008-bfff42c9a23d):
3-Jun-2016 10:32:05,687 INFO [JGroupsTransport] (Incoming-16,ee,c1564e13-c3ff-4adb-a008-bfff42c9a23d) ISPN000094: Received new cluster view for channel server: [62508e13-31f1-45b5-a80b-b50319158ce1|5] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:32:05,688 INFO [JGroupsTransport] (Incoming-16,ee,c1564e13-c3ff-4adb-a008-bfff42c9a23d) ISPN000094: Received new cluster view for channel web: [62508e13-31f1-45b5-a80b-b50319158ce1|5] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:32:05,701 ERROR [server] (notification-thread--p2-t1) WFLYCLSV0006: Failed to reach quorum of 2 for jboss.ha.singleton service. No singleton master will be elected.
13-Jun-2016 10:32:05,702 INFO [server] (notification-thread--p2-t1) WFLYCLSV0002: This node will no longer operate as the singleton provider of the jboss.ha.singleton service
13-Jun-2016 10:32:05,729 INFO [SingletonServiceInterface] (MSC service thread 1-3) SingletonService stopped

13-Jun-2016 10:37:43,045 WARN [server] (Incoming-5,ee,c1564e13-c3ff-4adb-a008-bfff42c9a23d) WFLYCLSV0007: Just reached required quorum of 2 for jboss.ha.singleton service. If this cluster loses another member, no node will be chosen to provide this service.
13-Jun-2016 10:37:43,046 INFO [server] (Incoming-5,ee,c1564e13-c3ff-4adb-a008-bfff42c9a23d) WFLYCLSV0003: c1564e13-c3ff-4adb-a008-bfff42c9a23d elected as the singleton provider of the jboss.ha.singleton service
13-Jun-2016 10:37:43,046 INFO [server] (Incoming-5,ee,c1564e13-c3ff-4adb-a008-bfff42c9a23d) WFLYCLSV0001: This node will now operate as the singleton provider of the jboss.ha.singleton service
13-Jun-2016 10:37:43,061 INFO [VDaemonMessageProcessor] (vdaemon-message-processor-4) [5143] Received message from vdaemon for device IP : [172.18.244.164] Status : [DOWN] Host name: [vedge5884] Personality: [VEDGE] UUID: [195b1d3b-8f3f-481e-9a6f-0c50b0052001] Is stand alone vbond: [false] First time after ZTP: [false] Pseudo Commit Status: [NONE] Commit after control down:[false] Instance Id: [6] Transaction Id: [707] Software version: 15.4.6-116

13-Jun-2016 10:37:43,085 INFO [SingletonServiceInterface] (MSC service thread 1-5) SingletonService started

3-Jun-2016 10:42:24,648 INFO [JGroupsTransport] (Incoming-13,ee,c1564e13-c3ff-4adb-a008-bfff42c9a23d) ISPN000094: Received new cluster view for channel server: [62508e13-31f1-45b5-a80b-b50319158ce1|7] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:42:24,650 INFO [JGroupsTransport] (Incoming-13,ee,c1564e13-c3ff-4adb-a008-bfff42c9a23d) ISPN000094: Received new cluster view for channel web: [62508e13-31f1-45b5-a80b-b50319158ce1|7] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:42:24,683 ERROR [server] (notification-thread--p2-t1) WFLYCLSV0006: Failed to reach quorum of 2 for jboss.ha.singleton service. No singleton master will be elected.
13-Jun-2016 10:42:24,684 INFO [server] (notification-thread--p2-t1) WFLYCLSV0002: This node will no longer operate as the singleton provider of the jboss.ha.singleton service
13-Jun-2016 10:42:24,684 INFO [SingletonServiceInterface] (MSC service thread 1-4) SingletonService stopped

Node2 (62508e13-31f1-45b5-a80b-b50319158ce1):

13-Jun-2016 10:32:13,550 INFO [JGroupsTransport] (Incoming-14,ee,62508e13-31f1-45b5-a80b-b50319158ce1) ISPN000094: Received new cluster view for channel server: [62508e13-31f1-45b5-a80b-b50319158ce1|5] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:32:13,571 INFO [JGroupsTransport] (Incoming-14,ee,62508e13-31f1-45b5-a80b-b50319158ce1) ISPN000094: Received new cluster view for channel web: [62508e13-31f1-45b5-a80b-b50319158ce1|5] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:32:13,639 ERROR [server] (Incoming-16,ee,62508e13-31f1-45b5-a80b-b50319158ce1) WFLYCLSV0006: Failed to reach quorum of 2 for jboss.ha.singleton service. No singleton master will be elected.

13-Jun-2016 10:37:50,945 WARN [server] (Incoming-17,ee,62508e13-31f1-45b5-a80b-b50319158ce1) WFLYCLSV0007: Just reached required quorum of 2 for jboss.ha.singleton service. If this cluster loses another member, no node will be chosen to provide this service.
13-Jun-2016 10:37:50,974 INFO [server] (Incoming-17,ee,62508e13-31f1-45b5-a80b-b50319158ce1) WFLYCLSV0003: c1564e13-c3ff-4adb-a008-bfff42c9a23d elected as the singleton provider of the jboss.ha.singleton service

3-Jun-2016 10:42:32,566 INFO [JGroupsTransport] (Incoming-13,ee,62508e13-31f1-45b5-a80b-b50319158ce1) ISPN000094: Received new cluster view for channel server: [62508e13-31f1-45b5-a80b-b50319158ce1|7] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:42:32,567 INFO [JGroupsTransport] (Incoming-13,ee,62508e13-31f1-45b5-a80b-b50319158ce1) ISPN000094: Received new cluster view for channel web: [62508e13-31f1-45b5-a80b-b50319158ce1|7] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d]
13-Jun-2016 10:42:32,601 ERROR [server] (Incoming-15,ee,62508e13-31f1-45b5-a80b-b50319158ce1) WFLYCLSV0006: Failed to reach quorum of 2 for jboss.ha.singleton service. No singleton master will be elected.
13-Jun-2016 10:42:35,381 WARNING [NAKACK2] (Incoming-20,ee,62508e13-31f1-45b5-a80b-b50319158ce1) JGRP000011: 62508e13-31f1-45b5-a80b-b50319158ce1: dropped message 320 from non-member d57132d8-63f6-4291-93ac-f1911ba8aa06 (view=[62508e13-31f1-45b5-a80b-b50319158ce1|7] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d])
13-Jun-2016 10:43:05,186 INFO [JGroupsTransport] (Incoming-4,ee,62508e13-31f1-45b5-a80b-b50319158ce1) ISPN000093: Received new, MERGED cluster view for channel server: MergeView::[d57132d8-63f6-4291-93ac-f1911ba8aa06|8] (3) [d57132d8-63f6-4291-93ac-f1911ba8aa06, c1564e13-c3ff-4adb-a008-bfff42c9a23d, 62508e13-31f1-45b5-a80b-b50319158ce1], 2 subgroups: [62508e13-31f1-45b5-a80b-b50319158ce1|7] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d], [62508e13-31f1-45b5-a80b-b50319158ce1|6] (3) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d, d57132d8-63f6-4291-93ac-f1911ba8aa06]
13-Jun-2016 10:43:05,187 INFO [JGroupsTransport] (Incoming-4,ee,62508e13-31f1-45b5-a80b-b50319158ce1) ISPN000093: Received new, MERGED cluster view for channel web: MergeView::[d57132d8-63f6-4291-93ac-f1911ba8aa06|8] (3) [d57132d8-63f6-4291-93ac-f1911ba8aa06, c1564e13-c3ff-4adb-a008-bfff42c9a23d, 62508e13-31f1-45b5-a80b-b50319158ce1], 2 subgroups: [62508e13-31f1-45b5-a80b-b50319158ce1|7] (2) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d], [62508e13-31f1-45b5-a80b-b50319158ce1|6] (3) [62508e13-31f1-45b5-a80b-b50319158ce1, c1564e13-c3ff-4adb-a008-bfff42c9a23d, d57132d8-63f6-4291-93ac-f1911ba8aa06]

Node3 (d57132d8-63f6-4291-93ac-f1911ba8aa06): No logs present here for Failed to reach quorum

13-Jun-2016 10:37:51,045 WARN [server] (ServerService Thread Pool -- 74) WFLYCLSV0007: Just reached required quorum of 2 for jboss.ha.singleton service. If this cluster loses another member, no node will be chosen to provide this service.
13-Jun-2016 10:37:51,046 INFO [server] (ServerService Thread Pool -- 74) WFLYCLSV0003: c1564e13-c3ff-4adb-a008-bfff42c9a23d elected as the singleton provider of the jboss.ha.singleton service

Looking forward to hearing from you soon.

Thanks
Mohil
Actions
3. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

pferraro Jun 13, 2016 3:54 PM (in response to mohilkhare)

You've encountered [WFLY-4748] Singleton service fails to start after repetitive cluster split with "Failed to reach quorum of 1" - JBoss I…
which was fixed in WF10.
Actions
4. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

mohilkhare Jun 13, 2016 5:56 PM (in response to pferraro)

Thanks a lot Paul for your prompt reply. I understand there is a bug that has been fixed with wildfly10.
We are planning to upgrade wildfly 9 to wildfly 10 in our next product cycle. Meanwhile can you suggest some workaround for wildfly 9 ? Also can you comment on MERGE2 vs MERGE3 that I asked before ?

Looking forward to hearing from you soon.

Thanks
Mohil
Actions
5. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

pferraro Jun 13, 2016 6:30 PM (in response to mohilkhare)

The issue is that the ServiceProviderRegistry service isn't listening to merge events from Infinispan. If cherry-picking the associated commit from the master branch is not an option, the only workaround I can think of, is to prevent the cluster from splitting in the event of CPU starvation. You can do this by removing FD from your protocol stack. FD_SOCK will still detect crashed members, but not hung members (a CPU-starved node appears hung to other members) or other failures that don't close the socket (e.g. kernel panic, switch failure, etc).
Actions
6. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

mohilkhare Jun 13, 2016 9:21 PM (in response to pferraro)

Thanks a lot Paul for your help..
Actions
7. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

mohilkhare Jun 23, 2016 1:44 PM (in response to mohilkhare)

Hello Paul,

I upgraded wildfly to version 10.0.0 and ran into following exceptions:

Caused by: org.infinispan.commons.CacheException: ISPN000242: Missing foreign externalizer with id=270, either externalizer was not configured by client, or module lifecycle implementation adding externalizer was not loaded properly
        at org.infinispan.marshall.core.ExternalizerTable.readObject(ExternalizerTable.java:221)
        at org.infinispan.marshall.core.JBossMarshaller$ExternalizerTableProxy.readObject(JBossMarshaller.java:153)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:354)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:209)
        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:41)
        at org.infinispan.marshall.exts.ReplicableCommandExternalizer.readParameters(ReplicableCommandExternalizer.java:101)
        at org.infinispan.marshall.exts.CacheRpcCommandExternalizer.readObject(CacheRpcCommandExternalizer.java:154)
        at org.infinispan.marshall.exts.CacheRpcCommandExternalizer.readObject(CacheRpcCommandExternalizer.java:65)
        at org.infinispan.marshall.core.ExternalizerTable$ExternalizerAdapter.readObject(ExternalizerTable.java:436)
        at org.infinispan.marshall.core.ExternalizerTable.readObject(ExternalizerTable.java:227)
        at org.infinispan.marshall.core.JBossMarshaller$ExternalizerTableProxy.readObject(JBossMarshaller.java:153)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:354)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:209)
        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:41)
        at org.infinispan.commons.marshall.jboss.AbstractJBossMarshaller.objectFromObjectStream(AbstractJBossMarshaller.java:134)
        at org.infinispan.marshall.core.VersionAwareMarshaller.objectFromByteBuffer(VersionAwareMarshaller.java:101)
        at org.infinispan.commons.marshall.AbstractDelegatingMarshaller.objectFromByteBuffer(AbstractDelegatingMarshaller.java:80)
        at org.infinispan.remoting.transport.jgroups.MarshallerAdapter.objectFromBuffer(MarshallerAdapter.java:28)
        at org.jboss.as.clustering.infinispan.ChannelTransport$1.objectFromBuffer(ChannelTransport.java:75)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:298)
        ... 27 more

22-Jun-2016 18:17:56,080 WARN [StateConsumerImpl] (transport-thread--p15-t12) ISPN000209: Failed to retrieve transactions for segments [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] of cache vmanage.war from node 2c88d8e8-7794-4526-80e7-da3447cc4bfd: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from 2c88d8e8-7794-4526-80e7-da3447cc4bfd, see cause for remote stack trace

        at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:44)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:760)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$72(JGroupsTransport.java:599)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) [rt.jar:1.8.0_72]
        at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) [rt.jar:1.8.0_72]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) [rt.jar:1.8.0_72]
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) [rt.jar:1.8.0_72]
        at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.futureDone(SingleResponseFuture.java:30)
        at org.jgroups.blocks.Request.checkCompletion(Request.java:169)
        at org.jgroups.blocks.UnicastRequest.receiveResponse(UnicastRequest.java:83)
        at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:398)
        at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:250)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:684)
        at org.jgroups.JChannel.up(JChannel.java:738)
...

        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_72]

Caused by: org.infinispan.commons.CacheException: Problems invoking command.
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:318)
        at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:460)
        at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:377)
        at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:250)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:675)
        at org.jgroups.JChannel.up(JChannel.java:739)
        at org.jgroups.fork.ForkProtocolStack.up(ForkProtocolStack.java:118)
        at org.jgroups.stack.Protocol.up(Protocol.java:374)
        at org.jgroups.protocols.FORK.up(FORK.java:103)
        at org.jgroups.protocols.RSVP.up(RSVP.java:201)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:1045)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1064)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:779)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426)
        at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:652)
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
        at org.jgroups.protocols.FD.up(FD.java:260)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:311)

I am using following infinispan and jgroups configuration in my standalone-full-ha.xml:

<subsystem xmlns="urn:jboss:domain:infinispan:4.0">
            <cache-container name="server" aliases="singleton cluster" default-cache="default" module="org.wildfly.clustering.server">
                <transport lock-timeout="120000"/>
                <replicated-cache name="default" mode="ASYNC">
                    <transaction locking="OPTIMISTIC" mode="BATCH"/>
                    <state-transfer timeout="300000"/>
                </replicated-cache>
            </cache-container>
            <cache-container name="web" default-cache="session" module="org.wildfly.clustering.web.infinispan">
                <transport lock-timeout="120000"/>
                <replicated-cache name="session" mode="ASYNC">
                    <locking isolation="READ_COMMITTED"/>
                    <transaction locking="PESSIMISTIC" mode="BATCH"/>
                    <state-transfer timeout="180000"/>
                </replicated-cache>
            </cache-container>
</subsystem>

<subsystem xmlns="urn:jboss:domain:jgroups:4.0">
            <channels default="ee">
                <channel name="ee" stack="tcp"/>
            </channels>
            <stacks default="tcp">
                <stack name="tcp">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <protocol type="TCPPING">
                        <property name="initial_hosts">
                            10.0.1.38[7600],10.0.1.32[7600],10.0.1.39[7600]
                        </property>
                        <property name="port_range">
                            0
                        </property>
                    </protocol>
                    <protocol type="MERGE2"/>
                    <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
                    <protocol type="FD"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS">
                        <property name="join_timeout">
                            5000
                        </property>
                    </protocol>
                    <protocol type="MFC"/>
                    <protocol type="FRAG2"/>
                    <protocol type="RSVP"/>
                </stack>
            </stacks>
        </subsystem>

Am I missing something in my configuration or some tuning of parameters is required ?

Looking forward to hearing from you soon.

Thanks
Mohil
Actions
8. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

pferraro Jun 24, 2016 10:50 AM (in response to mohilkhare)
I can think of a couple of reasons why you might run into this exception:
There is a member of your cluster that is still running an old WF version - all cluster members must use the same version
You have passivated session state that was persisted using an old WF version. You should clear any state prior to starting your upgraded servers.
Actions
9. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

mohilkhare Jul 7, 2016 11:20 PM (in response to pferraro)

Thanks a lot Paul. We ran into this issue because some nodes were running old WF version while some new.

While we are still testing wildfly 10 in our local environment, as I mentioned before we can't update wildfly 9 to wildfly 10 in production environment until next release cycle, I tried your suggestion in wildfly 9 i.e. remove FD and it seems to be working fine as of now. I also changed MERGE protocol from MERGE2 to MERGE3 as it is better i.e.

<subsystem xmlns="urn:jboss:domain:jgroups:4.0">
            <channels default="ee">
                <channel name="ee" stack="tcp"/>
            </channels>
            <stacks default="tcp">
                <stack name="tcp">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <protocol type="TCPPING">
                        <property name="initial_hosts">
                            10.0.1.38[7600],10.0.1.32[7600],10.0.1.39[7600]
                        </property>
                        <property name="port_range">
                            0
                        </property>
                    </protocol>
                    <protocol type="MERGE3"/>
                    <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS">
                        <property name="join_timeout">
                            5000
                        </property>
                    </protocol>
                    <protocol type="MFC"/>
                    <protocol type="FRAG2"/>
                    <protocol type="RSVP"/>
                </stack>
            </stacks>
        </subsystem>

However I am running into following stacktrace in some of the nodes in cluster running wildfly 9, which prevents any login or other activity in application server. Basically node just gets locked. Hope it is not because I removed FD and changed protocol from MERGE2 to MERGE3.

/var/log/nms/vmanage-server.log.3:07-Jul-2016 16:52:34,138 ERROR [InvocationContextInterceptor] (default task-182) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 15 seconds for key ti1_RSFpouZcoDwSOUHCk1XQep0p8I0Py5yc0QQv and requestor GlobalTransaction:<c1564e13-c3ff-4adb-a008-bfff42c9a23d>:2188872:local. Lock is held by GlobalTransaction:<c1564e13-c3ff-4adb-a008-bfff42c9a23d>:2188718:local, while request came from local
/var/log/nms/vmanage-server.log.3:07-Jul-2016 16:52:34,136 ERROR [InvocationContextInterceptor] (default task-131) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 15 seconds for key XBQC-N9AKGDt8nc11bYhDWgAzvZoi5M6-EFb66fs and requestor GlobalTransaction:<c1564e13-c3ff-4adb-a008-bfff42c9a23d>:2188915:local. Lock is held by GlobalTransaction:<c1564e13-c3ff-4adb-a008-bfff42c9a23d>:2188722:local, while request came from local
07-Jul-2016 16:36:07,703 WARN [infinispan] (default task-385) WFLYCLWEBINF0006: Failed to schedule expiration/passivation of session 1RK0J_jLSEXbZi4e8-QgdgDNy6VEbQ3NDTjoqTxs on primary owner.: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: timeout sending message to d57132d8-63f6-4291-93ac-f1911ba8aa06
        at org.jgroups.blocks.UnicastRequest.getValue(UnicastRequest.java:203)
        at org.jgroups.blocks.UnicastRequest.get(UnicastRequest.java:212)
        at org.wildfly.clustering.server.dispatcher.ChannelCommandDispatcher.executeOnNode(ChannelCommandDispatcher.java:151)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager$2.call(InfinispanSessionManager.java:193)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager$2.call(InfinispanSessionManager.java:188)
        at org.wildfly.clustering.ee.infinispan.RetryingInvoker.invoke(RetryingInvoker.java:69)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager.executeOnPrimaryOwner(InfinispanSessionManager.java:196)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager.schedule(InfinispanSessionManager.java:181)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager$SchedulableSession.close(InfinispanSessionManager.java:453)
        at org.wildfly.clustering.web.undertow.session.DistributableSession.requestDone(DistributableSession.java:77)
        at io.undertow.servlet.spec.ServletContextImpl.updateSessionAccessTime(ServletContextImpl.java:765) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.spec.HttpServletResponseImpl.responseDone(HttpServletResponseImpl.java:548) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:329) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:261) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:80) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:172) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.server.Connectors.executeRootHandler(Connectors.java:199) [undertow-core-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:774) [undertow-core-1.2.9.Final.jar:1.2.9.Final]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_72]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_72]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_72]
Caused by: java.util.concurrent.TimeoutException: timeout sending message to d57132d8-63f6-4291-93ac-f1911ba8aa06
        ... 21 more

07-Jul-2016 16:36:07,703 WARN [infinispan] (default task-375) WFLYCLWEBINF0006: Failed to schedule expiration/passivation of session 0psukwiO29DjcpLecIPQldb1NRiTz0zWBMn1O3t0 on primary owner.: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: timeout sending message to d57132d8-63f6-4291-93ac-f1911ba8aa06
        at org.jgroups.blocks.UnicastRequest.getValue(UnicastRequest.java:203)
        at org.jgroups.blocks.UnicastRequest.get(UnicastRequest.java:212)
        at org.wildfly.clustering.server.dispatcher.ChannelCommandDispatcher.executeOnNode(ChannelCommandDispatcher.java:151)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager$2.call(InfinispanSessionManager.java:193)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager$2.call(InfinispanSessionManager.java:188)
        at org.wildfly.clustering.ee.infinispan.RetryingInvoker.invoke(RetryingInvoker.java:69)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager.executeOnPrimaryOwner(InfinispanSessionManager.java:196)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager.schedule(InfinispanSessionManager.java:181)
        at org.wildfly.clustering.web.infinispan.session.InfinispanSessionManager$SchedulableSession.close(InfinispanSessionManager.java:453)
        at org.wildfly.clustering.web.undertow.session.DistributableSession.requestDone(DistributableSession.java:77)
        at io.undertow.servlet.spec.ServletContextImpl.updateSessionAccessTime(ServletContextImpl.java:765) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.spec.HttpServletResponseImpl.responseDone(HttpServletResponseImpl.java:548) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:329) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:261) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:80) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:172) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.server.Connectors.executeRootHandler(Connectors.java:199) [undertow-core-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:774) [undertow-core-1.2.9.Final.jar:1.2.9.Final]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_72]
/Execution error: org.infinispan.util.concurrent.TimeoutException
        at org.infinispan.interceptors.TxInterceptor.visitPrepareCommand(TxInterceptor.java:125) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:97) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:111) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.statetransfer.TransactionSynchronizerInterceptor.visitPrepareCommand(TransactionSynchronizerInterceptor.java:42) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:97) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.statetransfer.StateTransferInterceptor.handleTxCommand(StateTransferInterceptor.java:200) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.statetransfer.StateTransferInterceptor.visitPrepareCommand(StateTransferInterceptor.java:88) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:97) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:111) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:97) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:102) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:71) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.AbstractVisitor.visitPrepareCommand(AbstractVisitor.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:123) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:336) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.transaction.impl.TransactionCoordinator.commit(TransactionCoordinator.java:157) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.transaction.xa.TransactionXaAdapter.commit(TransactionXaAdapter.java:112) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.transaction.tm.DummyTransaction.finishResource(DummyTransaction.java:367) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.transaction.tm.DummyTransaction.commitResources(DummyTransaction.java:413) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.transaction.tm.DummyTransaction.runCommit(DummyTransaction.java:303) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.transaction.tm.DummyTransaction.commit(DummyTransaction.java:104) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.infinispan.transaction.tm.DummyBaseTransactionManager.commit(DummyBaseTransactionManager.java:73) [infinispan-core-7.2.3.Final.jar:7.2.3.Final]
        at org.wildfly.clustering.ee.infinispan.ActiveTransactionBatch.close(ActiveTransactionBatch.java:48)
        at org.wildfly.clustering.web.undertow.session.DistributableSession.requestDone(DistributableSession.java:78)
        at io.undertow.servlet.spec.ServletContextImpl.updateSessionAccessTime(ServletContextImpl.java:765) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.spec.HttpServletResponseImpl.responseDone(HttpServletResponseImpl.java:548) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:329) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:261) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:80) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:172) [undertow-servlet-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.server.Connectors.executeRootHandler(Connectors.java:199) [undertow-core-1.2.9.Final.jar:1.2.9.Final]
        at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:774) [undertow-core-1.2.9.Final.jar:1.2.9.Final]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_72]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_72]
        at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_72]

I see there are couple of bugs that are open in this area. Is there some workaround ?

I am using following infinispan config:

           <cache-container aliases="singleton cluster" default-cache="default" module="org.wildfly.clustering.server" name="server">
                <transport lock-timeout="120000"/>
                <replicated-cache mode="ASYNC" name="default">
                    <state-transfer enabled="true" timeout="300000"/>
                    <transaction locking="OPTIMISTIC" mode="BATCH"/>
                </replicated-cache>
            </cache-container>
            <cache-container default-cache="session" module="org.wildfly.clustering.web.infinispan" name="web">
                <transport lock-timeout="120000"/>
                <replicated-cache mode="ASYNC" name="session">
                    <state-transfer enabled="true" timeout="300000"/>
                    <locking isolation="READ_COMMITTED"/>
                    <transaction locking="OPTIMISTIC" mode="BATCH"/>
                </replicated-cache>
            </cache-container>

Should I add acquire-timeout in locking isolation to some value greater than 15 secs ?

Looking forward to hearing from you soon. This issue is really critical for us.

Thanks
Mohil

PS: Though in our application we are only using above two cache containers, I have not deleted following cache containers from config which comes by default in standalone-viptela.xml.

<cache-container default-cache="default" module="org.wildfly.clustering.server" name="vmanage">
                <transport lock-timeout="60000"/>
                <replicated-cache mode="ASYNC" name="deviceconnection">
                    <locking isolation="READ_COMMITTED"/>
                    <transaction locking="OPTIMISTIC" mode="BATCH"/>
                </replicated-cache>
            </cache-container>
            <cache-container aliases="sfsb" default-cache="dist" module="org.wildfly.clustering.ejb.infinispan" name="ejb">
                <transport lock-timeout="60000"/>
                <distributed-cache l1-lifespan="0" mode="ASYNC" name="dist" owners="2">
                    <locking isolation="REPEATABLE_READ"/>
                    <transaction mode="BATCH"/>
                    <file-store/>
                </distributed-cache>
            </cache-container>
            <cache-container default-cache="local-query" module="org.hibernate.infinispan" name="hibernate">
                <transport lock-timeout="60000"/>
                <local-cache name="local-query">
                    <eviction max-entries="10000" strategy="LRU"/>
                    <expiration max-idle="100000"/>
                </local-cache>
                <invalidation-cache mode="SYNC" name="entity">
                    <transaction mode="NON_XA"/>
                    <eviction max-entries="10000" strategy="LRU"/>
                    <expiration max-idle="100000"/>
                </invalidation-cache>
                <replicated-cache mode="ASYNC" name="timestamps"/>
            </cache-container>
Actions
10. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

pferraro Jul 8, 2016 9:39 AM (in response to mohilkhare)

The first stack trace is a WARNing, not an ERROR. It indicates that a request was received on a node which does not own the associated session. When this happens, session expiration is scheduled on the node that owns the session, not the node that handled the request. However, in this case, there was an issue communicating with the node that owns the session. This could be due to a number of reasons, but most likely, the session ownership is in the process of migrating to another node, due to a topology change. In this situation, eager session expiration is skipped - and the session will expire lazily. In summary, this message is harmless.

The "org.infinispan.util.concurrent.TimeoutException" in the second stack trace looks like an issue with Infinispan exception handling during tx prepare. There were a number of fixes in this area to the Infinispan 8.x branch which are included in WildFly 10.
Actions
11. Re: Infinispan Singleton silently dies in wildfly 9 cluster setup

mohilkhare Jul 8, 2016 12:32 PM (in response to pferraro)

Thanks Paul for the prompt reply.

You mean fixes are made in wildfly 10 w.r.t following exception and NOT available in wildfly 9.

ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 15 seconds for key ti1_RSFpouZcoDwSOUHCk1XQep0p8I0Py5yc0QQv and requestor GlobalTransaction:<c1564e13-c3ff-4adb-a008-bfff42c9a23d>:2188872:local. Lock is held by GlobalTransaction:<c1564e13-c3ff-4adb-a008-bfff42c9a23d>:2188718:local, while request came from local

Since we are seeing this more often currently with wildfly 9, what could have triggered this ? Do you suspect some recent config changes I made or it is due to reasons you mentioned like communication failure between nodes in a cluster etc ? Because when this happens, clients are unable to login or do any activity with wildfly. Everything just gets stuck.

Looking forward to hearing from you soon.

Thanks and Regards
Mohil
Actions

Go to original post