1 Reply Latest reply on May 20, 2016 4:43 AM by sebastian.laskawiec

New node fails to join, cluster breaks down

vdzhuvinov May 11, 2016 8:44 AM

We are experiencing problems adding new nodes to an existing distributed cluster (embedded Infinispan 7.2.5), and I'd like to find out how this can be addressed. Basically, with a 4+ node cluster that is under load, adding new nodes appears to be almost impossible, so new nodes are added after shutting down the entire cluster, and starting it anew with the new nodes. This works all of time, but we would like to be able to reliably add nodes without such interruptions.

Yesterday, when adding a new node with YK profiling enabled (for diagnosing app performance), this node wasn't able to join, but what also occurred - the clustering between the original 7 nodes (that have been running for a while) broke down, so the entire cluster had to be restarted. We are now trying to diagnose what happened. Insights and suggestions will be much appreciated.

Here are some selected log lines from the node that tried to join - some are our own:

```

[CM8007] Infinispan status: RUNNING

[CM8016] Infinispan cluster local node is coordinator: false

[CM8012] Infinispan cluster members: [ip-10-180-243-45-53253, ip-10-180-243-221-4532, ip-10-180-242-197-28571, ip-10-180-243-69-29416, ip-10-180-242-145-50239, ip-10-180-243-140-24676, ip-10-180-243-124-55699]

[CM8013] Infinispan cluster distributed sync timeout: 240000

...

[CM8006] Started Infinispan DIST_ASYNC cache authzStore.authzCache in 104302 ms

...

// a few more caches start ...

// then after a few seconds we start seeing trouble

...

Incoming-15,ip-10-180-243-124-55699","level":"WARN","loggerName":"org.jgroups.protocols.pbcast.NAKACK2","message":"JGRP000011: ip-10-180-243-124-55699: dropped message 1 from non-member ip-10-180-243-69-29416 (view=MergeView::[ip-10-180-242-145-50239|52] (4) [ip-10-180-242-145-50239, ip-10-180-243-140-24676, ip-10-180-243-124-55699, ip-10-180-242-197-28571], 1 subgroups: [ip-10-180-242-145-50239|51] (6) [ip-10-180-242-145-50239, ip-10-180-243-45-53253, ip-10-180-243-221-4532, ip-10-180-243-69-29416, ip-10-180-243-124-55699, ip-10-180-242-197-28571])

OOB-18,ip-10-180-243-124-55699","level":"WARN","loggerName":"org.jgroups.protocols.pbcast.NAKACK2","message":"JGRP000011: ip-10-180-243-124-55699: dropped message 119 from non-member ip-10-180-243-69-29416 (view=MergeView::[ip-10-180-242-145-50239|52] (4) [ip-10-180-242-145-50239, ip-10-180-243-140-24676, ip-10-180-243-124-55699, ip-10-180-242-197-28571], 1 subgroups: [ip-10-180-242-145-50239|51] (6) [ip-10-180-242-145-50239, ip-10-180-243-45-53253, ip-10-180-243-221-4532, ip-10-180-243-69-29416, ip-10-180-243-124-55699, ip-10-180-242-197-28571]) (received 105 identical messages from ip-10-180-243-69-29416 in the last 72510 ms)

OOB-8,ip-10-180-243-124-55699","level":"WARN","loggerName":"org.jgroups.protocols.pbcast.NAKACK2","message":"JGRP000011: ip-10-180-243-124-55699: dropped message 115 from non-member ip-10-180-243-69-29416 (view=MergeView::[ip-10-180-242-145-50239|52] (4) [ip-10-180-242-145-50239, ip-10-180-243-140-24676, ip-10-180-243-124-55699, ip-10-180-242-197-28571], 1 subgroups: [ip-10-180-242-145-50239|51] (6) [ip-10-180-242-145-50239, ip-10-180-243-45-53253, ip-10-180-243-221-4532, ip-10-180-243-69-29416, ip-10-180-243-124-55699, ip-10-180-242-197-28571]) (received 105 identical messages from ip-10-180-243-69-29416 in the last 72510 ms)

// more like the above, the we get the suspect firing...

Exception in thread "transport-thread--p2-t7" org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command EntryResponseCommand{identifier=52c0f7eb-af24-4dc3-b0d2-ea28f85ee13f, completedSegments=null, inDoubtSegments=null, values=null, topologyId=128, origin=ip-10-180-243-124-55699}

app_1 | at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:528)

app_1 | at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:287)

app_1 | at org.infinispan.iteration.impl.DistributedEntryRetriever$2.handleException(DistributedEntryRetriever.java:309)

app_1 | at org.infinispan.iteration.impl.DistributedEntryRetriever$3.run(DistributedEntryRetriever.java:479)

app_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

app_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

app_1 | at java.lang.Thread.run(Thread.java:745)

// one for each cache

```

The coordinator node was seeing the following exceptions:

```

ERROR [006205] Timed out waiting for 15 seconds for valid responses from any of Denderfaddress.ip 10 180 242 197 28571, responded.falsel, Senderfaddress.ip 10 180 .2 145 50239, responded-false}., 5enderfaddress.ip 10 180 243 69 29416, responded-false}].

ERROR [006205] Timed out waiting for 15 seconds for valid responses from any of [Senderfaddress-ip 10 180 242 197 28571, responded-falsel, Senderfaddress-ip 10 180 243 69 29416, responded-false}, Sender{address-ip 10 180 242 145 50239, responded-false}].

...

```

I wonder if adjusting some of the JGroups health check timeouts could actually fix this?

Also, would migrating to Infinispan 8.2 would help in any way with the above?

I have attached the JGroups config.

default-jgroups-s3ping.xml.zip 1,012 bytes

1. Re: New node fails to join, cluster breaks down

sebastian.laskawiec May 20, 2016 4:43 AM (in response to vdzhuvinov)

Since you mentioned that the cluster is running under high load, it might be the case of Garbage Collection. Printing out GC logs might be helpful here. Attaching profilers probably makes the situation even worse (they add some small overhead).

I would start with increasing FD_ALL timeouts. Here is a link to manual which might help you: Chapter 7. List of Protocols
Actions