Issue clustering embedded infinispan (spring boot) with Infinispan Server (9.4.4)
soupy989 Feb 26, 2019 6:08 PMHi All
I have been trying to cluster an embedded spring boot infinispan instance with a remote standalone server node with no luck and was hoping someone here could help.
The error I'm seeing on the infinispan server side (NODE_1):
17:12:26,408 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Infinispan Server 9.4.4.Final (WildFly Core 6.0.2.Final) started in 10731ms - Started 274 of 322 services (163 services are lazy, passive
or on-demand)
17:19:49,480 INFO [org.infinispan.CLUSTER] (jgroups-8,{NODE_1}) ISPN000094: Received new cluster view for channel cluster: [{NODE_1}|1] (2) [{NODE_1}, {NODE_2}-52218]
17:19:49,483 INFO [org.infinispan.CLUSTER] (jgroups-8,{NODE_1}) ISPN100000: Node {NODE_2}-52218 joined the cluster
17:20:49,644 WARN [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p4-t23) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Ti
med out waiting for responses for request 1 from {NODE_2}-52218
at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: org.infinispan.util.logging.TraceException
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: org.infinispan.util.logging.TraceException
at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:75)
at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:525)
at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:508)
at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:321)
The error seen from Spring Boot (NODE_2):
17:19:49.171 [main] DEBUG o.i.manager.DefaultCacheManager - Starting cache manager null
17:19:49.180 [main] INFO o.i.f.GlobalComponentRegistry - ISPN000128: Infinispan version: Infinispan 'Infinity Minus ONE +2' 9.4.4.Final
17:19:49.343 [main] INFO o.i.r.t.jgroups.JGroupsTransport - ISPN000078: Starting JGroups channel cluster
17:19:49.429 [main] DEBUG org.jgroups.stack.Configurator - set property TCP.diagnostics_addr to default value /224.0.75.75
17:19:49.434 [main] DEBUG org.jgroups.protocols.TCP - thread pool min/max/keep-alive: 0/200/60000 use_fork_join=false, internal pool: 0/4/30000 (4 cores available)
17:19:49.439 [main] DEBUG org.jgroups.protocols.MPING - bind_addr=/10.252.132.39, mcast_addr=/234.99.54.14, mcast_port=45700
17:19:49.447 [main] DEBUG org.jgroups.protocols.MPING - receiver thread started
17:19:49.452 [main] DEBUG org.jgroups.protocols.pbcast.GMS - address={NODE_2}-52218, cluster=cluster, physical address=10.252.132.39:47500
17:19:49.471 [main] DEBUG org.jgroups.protocols.pbcast.GMS - {NODE_2}-52218: sending JOIN({NODE_2}-52218) to {NODE_1}
17:19:49.652 [main] DEBUG o.jgroups.protocols.pbcast.NAKACK2 -
[rdcdlimdg02-52218 setDigest()]
existing digest: []
new digest: {NODE_1}: [0 (0)], {NODE_2}-52218: [0 (0)]
resulting digest: {NODE_1}: [0 (0)], {NODE_2}-52218: [0 (0)]
17:19:49.652 [main] DEBUG org.jgroups.protocols.pbcast.GMS - {NODE_2}-52218: installing view [{NODE_1}|1] (2) [{NODE_1}, {NODE_2}-52218]
17:19:49.665 [main] INFO org.infinispan.CLUSTER - ISPN000094: Received new cluster view for channel cluster: [{NODE_1}|1] (2) [{NODE_1}, {NODE_2}-52218]
17:19:49.666 [FD_SOCK pinger-11,{NODE_2}-52218] DEBUG org.jgroups.protocols.FD_SOCK - {NODE_2}-52218: pingable_mbrs=[{NODE_1}, {NODE_2}-52218], ping_dest={NODE_1}
17:19:49.666 [main] DEBUG o.i.r.t.jgroups.JGroupsTransport - Joined: [{NODE_1}, {NODE_2}-52218], Left: []
17:19:49.744 [main] INFO o.i.r.t.jgroups.JGroupsTransport - ISPN000079: Channel cluster local address is {NODE_2}-52218, physical addresses are [10.252.132.39:47500]
17:20:49.756 [main] DEBUG o.i.t.ClusterTopologyManagerImpl - Timed out waiting for rebalancing status from coordinator, trying again
The jgroups xml configuration used from Spring Boot (pulled from 9.4.4-core .jar):
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd">
<TCP bind_addr="${jgroups.tcp.address:127.0.0.1}"
bind_port="${jgroups.tcp.port:7800}"
enable_diagnostics="false"
thread_naming_pattern="pl"
send_buf_size="640k"
sock_conn_timeout="1000"
bundler_type="no-bundler"
thread_pool.min_threads="${jgroups.thread_pool.min_threads:0}"
thread_pool.max_threads="${jgroups.thread_pool.max_threads:200}"
thread_pool.keep_alive_time="60000"
/>
<TCPPING initial_hosts="{HOST_1}[47500],{HOST_2}[47500]"/>
<MPING bind_addr="${jgroups.tcp.address:127.0.0.1}"
mcast_addr="234.99.54.14"
mcast_port="45700"
ip_ttl="${jgroups.udp.ip_ttl:2}"/>
<MERGE3 min_interval="10000"
max_interval="30000"
/>
<FD_SOCK start_port="57600"/>
<!-- Suspect node `timeout` to `timeout + timeout_check_interval` millis after the last heartbeat -->
<FD_ALL timeout="10000"
interval="2000"
timeout_check_interval="1000"
/>
<VERIFY_SUSPECT timeout="1500"/>
<pbcast.NAKACK2 use_mcast_xmit="false"
xmit_interval="100"
xmit_table_num_rows="50"
xmit_table_msgs_per_row="1024"
xmit_table_max_compaction_time="30000"
resend_last_seqno="true"
/>
<UNICAST3 xmit_interval="100"
xmit_table_num_rows="50"
xmit_table_msgs_per_row="1024"
xmit_table_max_compaction_time="30000"
/>
<pbcast.STABLE stability_delay="500"
desired_avg_gossip="5000"
max_bytes="1M"
/>
<pbcast.GMS print_local_addr="false"
join_timeout="${jgroups.join_timeout:5000}"
/>
<MFC max_credits="2m"
min_threshold="0.40"
/>
<FRAG3/>
</config>
Compared to jgroups configuration used from server:
<stack name="tcp">
<transport type="TCP" socket-binding="jgroups-tcp"/>
<protocol type="TCPPING">
<property name="initial_hosts">
{HOST_2}[47500],{HOST_1}[47500]
</property>
</protocol>
<protocol type="MPING" socket-binding="jgroups-mping"/>
<protocol type="MERGE3"/>
<protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
<protocol type="FD_ALL"/>
<protocol type="VERIFY_SUSPECT"/>
<protocol type="pbcast.NAKACK2">
<property name="use_mcast_xmit">
false
</property>
</protocol>
<protocol type="UNICAST3"/>
<protocol type="pbcast.STABLE"/>
<protocol type="pbcast.GMS"/>
<protocol type="MFC_NB"/>
<protocol type="FRAG3"/>
</stack>
Some background:
jgroups.tcp.address and jgroups.tcp.port are set on start through JVM options.
Infinispan Server version - 9.4.4
infinispan-spring-boot-starter-embedded - 2.1.1
The oddest part about this is that I am able to cluster two standalone infinspan instances with no problem between NODE_1 and NODE_2 (using infinispan TCP config above). I took a look at trace logging for both org.jgroups and org.infinispan but found no details. Watching the UNICAST traffic (in trace logs) between the two, I don't see where a request was dropped - all requests from NODE_1 receive an ACK from node 2 < 60s.
Any ideas on cause/resolution would be greatly appreciated.
Thank you.