12 Replies Latest reply on Apr 11, 2018 1:53 PM by nadirx

Configuration for three parallel Infinispan cluster

alexd1979 Feb 21, 2018 11:19 AM

Hello,

We have three cluster environments, each have two nodes. We make the hotrod ports equal per environment 11122 ENV1, 11132 ENV2 and 11142 ENV3. But, we see, that the nodes found each other and this should be avoided because it was a three tier system with Integration, Pre-Production and Production.

How we can seperate / isolate each Infinispan Cluster (replicated) and avoid replication to the other envionments? We see on the other env the messages node xxx join and if we stop the node, node xxx has left the cluster.

We change the ports in the clustered.xml

here the default settings, what my idea was, to increment all ports by 1 (or two for the Prod systrem) to separate the envionenments from each other.But it does not work.

 <socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:0}">
        <socket-binding name="management-http" interface="management" port="${jboss.management.http.port:9990}"/>
        <socket-binding name="management-https" interface="management" port="${jboss.management.https.port:9993}"/>
        <socket-binding name="hotrod" port="11222"/>
        <socket-binding name="hotrod-internal" port="11223"/>
        <socket-binding name="hotrod-multi-tenancy" port="11224"/>
        <socket-binding name="jgroups-mping" port="0" multicast-address="${jboss.default.multicast.address:234.99.54.14}" multicast-port="45700"/>
        <socket-binding name="jgroups-tcp" port="7600"/>
        <socket-binding name="jgroups-tcp-fd" port="57600"/>
        <socket-binding name="jgroups-udp" port="55200" multicast-address="${jboss.default.multicast.address:234.99.54.14}" multicast-port="45688"/>
        <socket-binding name="jgroups-udp-fd" port="54200"/>
        <socket-binding name="memcached" port="11211"/>
        <socket-binding name="rest" port="8080"/>
        <socket-binding name="rest-multi-tenancy" port="8081"/>
        <socket-binding name="rest-ssl" port="8443"/>
        <socket-binding name="txn-recovery-environment" port="4712"/>
        <socket-binding name="txn-status-manager" port="4713"/>
        <socket-binding name="websocket" port="8181"/>
        <outbound-socket-binding name="remote-store-hotrod-server">
            <remote-destination host="remote-host" port="11222"/>
        </outbound-socket-binding>
        <outbound-socket-binding name="remote-store-rest-server">
            <remote-destination host="remote-host" port="8080"/>
        </outbound-socket-binding>
    </socket-binding-group>

1. Re: Configuration for three parallel Infinispan cluster

nadirx Feb 22, 2018 2:30 AM (in response to alexd1979)

Nodes find each other using the "*PING" family of JGroups protocols. You should therefore ensure that each cluster uses a dedicated port/address combination. In particular, if you want to use multicast discovery, start your server with:

-Djboss.default.multicast.address=a.b.c.d

where a.b.c.d is unique to your cluster.
Actions
2. Re: Configuration for three parallel Infinispan cluster

alexd1979 Feb 26, 2018 10:03 AM (in response to nadirx)

Hello, Yes this was partially the good answer. By using the offset flag, we separate each environment from the others by using own ports and now it works. And in addtion we choose dedicated Multicast-ports in the clustered.xml file And we rename the cache-container to dedicated names.
Actions
3. Re: Configuration for three parallel Infinispan cluster

alexd1979 Apr 9, 2018 11:36 AM (in response to alexd1979)

Hello, The separation was sucessful, but the communication between the nodes dies from time to time. "Request timeout to get response from ..." and I have no idea what could be the problem.
Memory? I did not see any outofmemory error messages in server.log from IS.
Actions
4. Re: Configuration for three parallel Infinispan cluster

alexd1979 Apr 9, 2018 11:39 AM (in response to nadirx)

"If you want to use..." I don´t know If I want to use it. I want only a simple and robust 2 node cluster for replication of simple and complex stored values (Arrays and Structs). For the moment it was very unstable sometimes the nodes discover on startup, sometimes not. Then Timeouts happens in communication and the clustered.xml was used "out-of-the-box" from me, only and single modification is to raise the offset for the ports to seperate the clusters from each others.
Actions
5. Re: Configuration for three parallel Infinispan cluster

rvansa Apr 9, 2018 12:19 PM (in response to alexd1979)

Have you checked GC logs? Maybe you have long GC pauses that correlate with these failures.
Actions

6. Re: Configuration for three parallel Infinispan cluster

alexd1979 Apr 11, 2018 5:01 AM (in response to alexd1979)

Hello,

Thank you all for your tipps, but I don´t think it is related to GC or memory. I make a fresh installation of IS 9.1.4 on the Redhat server and use the ./standalone.sh -c clustered.xml with no modifications. I configure my application and Connector to use the defaut repl cache to put my values. The Connector uses the HotRod protocol.

From the server.log I see different error messages, where I am not sure, where it was the root cause:

Sometimes, if we restart the 2nd node, I see on the first node

2018-04-09 14:48:20,108 ERROR [org.infinispan.CLUSTER] (transport-thread--p4-t17) ISPN000196: Failed to recover cluster state after the current node became the coordinator (or after merge): java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 10 from wwhelapp0120
 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
 at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
 at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:620)
 at org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:484)
 at org.infinispan.topology.ClusterTopologyManagerImpl.becomeCoordinator(ClusterTopologyManagerImpl.java:359)
 at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:338)
 at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:83)
 at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:765)
 at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
 at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
 at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 10 from wwhelapp0120
 at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:163)
 at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
 at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ... 3 more
2018-04-09 17:09:42,952 FATAL [org.infinispan.CLUSTER] (transport-thread--p4-t17) ISPN100004: After merge (or coordinator change), the coordinator failed to recover cluster. Cluster members are [wwhelapp0119, wwhelapp0120].

2018-04-09 12:52:13,903 WARN  [org.infinispan.server.hotrod.Decoder2x] (HotRod-ServerWorker-5-4) ISPN006011: Operation 'REMOVE' forced to return previous value should be used on transactional caches, otherwise data inconsistency issues could arise under failure situations
2018-04-09 12:52:28,913 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p3-t1) ISPN000136: Error executing command RemoveCommand, writing keys [WrappedByteArray{bytes=[B0x033E104445534B54..[19], hashCode=-1420082805}]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 18 from wwhelapp0120
 at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:163)
 at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
 at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

2018-04-09 13:15:06,446 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (jgroups-16,wwhelapp0119) ISPN000136: Error executing command RemoveCommand, writing keys [WrappedByteArray{bytes=[B0x033E104445534B54..[19], hashCode=-1420082805}]: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from wwhelapp0120, see cause for remote stack trace
 at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:27)
 at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:41)
 at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:25)
 at org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:51)
 at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:35)
 at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:53)
 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1328)
 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1238)
 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$200(JGroupsTransport.java:121)
 at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.receive(JGroupsTransport.java:1366)
 at org.jgroups.JChannel.up(JChannel.java:819)
 at org.jgroups.fork.ForkProtocolStack.up(ForkProtocolStack.java:134)
 at org.jgroups.stack.Protocol.up(Protocol.java:340)
 at org.jgroups.protocols.FORK.up(FORK.java:134)
 at org.jgroups.protocols.FRAG3.up(FRAG3.java:171)
 at org.jgroups.protocols.FlowControl.up(FlowControl.java:343)
 at org.jgroups.protocols.FlowControl.up(FlowControl.java:343)
 at org.jgroups.protocols.pbcast.GMS.up(GMS.java:864)
 at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240)
 at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1002)
 at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:728)
 at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:383)
 at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:600)
 at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:119)
 at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:199)
 at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:252)
 at org.jgroups.protocols.MERGE3.up(MERGE3.java:276)
 at org.jgroups.protocols.Discovery.up(Discovery.java:267)
 at org.jgroups.protocols.TP.passMessageUp(TP.java:1229)
 at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 10 seconds for key WrappedByteArray{bytes=[B0x033E104445534B54..[19], hashCode=-1420082805} and requestor CommandInvocation:wwhelapp0119:764. Lock is held by CommandInvocation:wwhelapp0119:763
 at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.lock(DefaultLockManager.java:253)
 at org.infinispan.interceptors.locking.AbstractLockingInterceptor.lockAndRecord(AbstractLockingInterceptor.java:269)
 at org.infinispan.interceptors.locking.AbstractLockingInterceptor.visitNonTxDataWriteCommand(AbstractLockingInterceptor.java:130)
 at org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor.visitDataWriteCommand(NonTransactionalLockingInterceptor.java:38)
 at org.infinispan.interceptors.locking.AbstractLockingInterceptor.visitRemoveCommand(AbstractLockingInterceptor.java:105)
 at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:63)
 at org.infinispan.interceptors.BaseAsyncInterceptor.invokeNext(BaseAsyncInterceptor.java:58)
 at org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:306)
 at org.infinispan.statetransfer.StateTransferInterceptor.handleWriteCommand(StateTransferInterceptor.java:252)
 at org.infinispan.statetransfer.StateTransferInterceptor.visitRemoveCommand(StateTransferInterceptor.java:108)
 at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:63)
 at org.infinispan.interceptors.BaseAsyncInterceptor.invokeNext(BaseAsyncInterceptor.java:58)
 at org.infinispan.interceptors.impl.CacheMgmtInterceptor.visitRemoveCommand(CacheMgmtInterceptor.java:214)
 at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:63)
 at org.infinispan.interceptors.BaseAsyncInterceptor.invokeNextAndExceptionally(BaseAsyncInterceptor.java:127)
 at org.infinispan.interceptors.impl.InvocationContextInterceptor.visitCommand(InvocationContextInterceptor.java:96)
 at org.infinispan.interceptors.BaseAsyncInterceptor.invokeNext(BaseAsyncInterceptor.java:60)
 at org.infinispan.interceptors.DDAsyncInterceptor.handleDefault(DDAsyncInterceptor.java:54)
 at org.infinispan.interceptors.DDAsyncInterceptor.visitRemoveCommand(DDAsyncInterceptor.java:65)
 at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:63)
 at org.infinispan.interceptors.DDAsyncInterceptor.visitCommand(DDAsyncInterceptor.java:50)
 at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invokeAsync(AsyncInterceptorChainImpl.java:234)
 at org.infinispan.commands.remote.BaseRpcInvokingCommand.processVisitableCommandAsync(BaseRpcInvokingCommand.java:63)
 at org.infinispan.commands.remote.SingleRpcCommand.invokeAsync(SingleRpcCommand.java:57)
 at org.infinispan.remoting.inboundhandler.BasePerCacheInboundInvocationHandler.invokeCommand(BasePerCacheInboundInvocationHandler.java:102)
 at org.infinispan.remoting.inboundhandler.BaseBlockingRunnable.invoke(BaseBlockingRunnable.java:99)
 at org.infinispan.remoting.inboundhandler.BaseBlockingRunnable.runAsync(BaseBlockingRunnable.java:71)
 at org.infinispan.remoting.inboundhandler.BaseBlockingRunnable.run(BaseBlockingRunnable.java:40)
 ... 3 more

7. Re: Configuration for three parallel Infinispan cluster

rvansa Apr 11, 2018 6:13 AM (in response to alexd1979)

If this is not happening under heavy load (where timeouts are to be expected) and the cluster is operational at least partially, I am out of advices. At such moment you need to set trace logging on (including org.jgroups) and try to find out what's happening to the messages.

Just to be sure, could you try to turn off firewall?
Actions
8. Re: Configuration for three parallel Infinispan cluster

alexd1979 Apr 11, 2018 7:19 AM (in response to rvansa)

Hello, now this is the test system. What happens is the first call of a page fill up the cache with 2000 elements in a block, then it works fast to read for a few minutes and then the communication seems to be broken between the nodes and come never back.
i am not sure what I should think about Infinispan, it does not work "out-of-the-box" and there are no specialists available. All people I ask for (paid) support, give me the feedback, they have no experiences deeply with Infinispan as cluster (but they wrote Infinispan Installation support on their websites) . Is this cluster framework just BETA or does it work for enterprise, large scaled infrastructures?
Is there are no good example configuration available or have nobody here deeply experiences with Infinispan as cluster and can help me to build a stable system?

What is the best practise to have this scenario:
an application based on 2 node cluster
On each cluster you have a local Infinispan installation
the two cluster nodes should be operate in replication mode, means, if one node if failing, the other node should operate without any impact.
This two node cluster exists three times, one test cluster, one validation cluster and one production cluster in the same network.
Actions
9. Re: Configuration for three parallel Infinispan cluster

rvansa Apr 11, 2018 8:17 AM (in response to alexd1979)

I don't know whom you're getting support from, but Infinispan as a project is not supported (besides this forum and IRC). Most of the developers are Red Hat employees and the project is productized as Red Hat JBoss Data Grid (it's also used inside Wildfly -> EAP, Keycloak -> RH SSO and others). So this is where you should be looking for paid support. On this forum you'll find answers from the core developers, so I'd say that these are people who have experience with clustering - basically all the features are developed as clustered. And yes, there are setups with hundreds of clustered nodes in production.

I am sorry it does not work out-of-the-box, and it seems that you don't experience any of the often configuration issues (like misconfiguring IPv4/IPv6 and such), but given the information you gave is hard to see what's happening. Stack traces ain't enough in distributed systems.
Actions
10. Re: Configuration for three parallel Infinispan cluster

nadirx Apr 11, 2018 10:41 AM (in response to alexd1979)

As Radim said, Infinispan is definitely used in production in single node and clustered configurations up to 100s of nodes.
Clustering is a complex thing to get right, so you cannot expect things to just work without putting some effort into understanding about discovery, transports and the type of network you're dealing with.
If your cluster is breaking up, we need to understand why. This could be due to Infinispan/JGroups misconfiguration, network/switch issues, operating system networking, etc.
In particular enabling debug logs for JGroups would help.
Actions
11. Re: Configuration for three parallel Infinispan cluster

alexd1979 Apr 11, 2018 1:16 PM (in response to nadirx)

Thank you for your clear words. I am very impatient to get things to fly because we configure now weeks on this cluster and it seems every time to end up in total outage what annoy me. I spend hours and hours on this topic.
What seems to help, I configure the default-stack now to TCP instead of UDP and this seems to be stable in general, also now for hours.
But, what I know, TCP with unicast is "expensive" in the network communication and UDP with Multicast should be prefered, isn´t it?

If we get this reference system to work, for sure, we will use Infinispan in all future projects because I am very pleased about the management interface and in general the functionality.
I don´t know how to enable in Infinispan the trace mode, where is the switch for that?
Actions
12. Re: Configuration for three parallel Infinispan cluster

nadirx Apr 11, 2018 1:53 PM (in response to alexd1979)

TCP is not necessarily more expensive than UDP: it depends on the size of the cluster, Distributed vs Replicated, the size of the entries, etc. Also, it is always wiser to trade off a little performance for stability.

As for logging, if you are using clustered.xml, look at the logging subsystem, and add a relevant logger, eg:

<logger category="org.jgroups">
<level name="DEBUG"/>
</logger>

Since our server is based on WildFly, you can also tune the logger at runtime using the CLI:

How To - WildFly 10 - Project Documentation Editor
Actions

Go to original post