I'm trying to orchestrate the creation and dynamic scaling of an infinispan cluster and I'm struggling to get the boxes to correctly cluster together. I'm using TCPPING to do node discovery because the network I'm on does not support UDP multicast and we don't want to have a dependency on TCPGOSSIP nodes. What I'm trying to do is to add a new node to an existing cluster without impacting service.
My naive approach was to create the new node and configure it to have the IP address of all the existing nodes (I've also tried all the existing nodes and the new node) specified in the initial_hosts parameter for TCPPING (I've attached a copy of the configuration below). When I start it up, it comes up and reports a new jGroups view containing just itself (see the end of the attached server.log.new-node) and never reports the rest of the cluster. Looking at the rest of the cluster (end of server.log.old-node), none of them report the arrival of the new node, but they all start generating a warning every 20seconds (which I guess is when MERGE2 runs?) saying "no physical address for b6dcb8f0-8311-2e33-01d6-a337eada15f8, dropping message". From reading around, it seems that the new node is partially reporting itself to the cluster, but isn't sending enough information for the other nodes to then contact it afterwards.
If I restart the newly added node, it consistently comes up as expected (I see the full jGroups view reported on all the nodes, new and old, and no warnings in the logs). This makes me suspicious that the new node is timing out its discovery processing the first time it starts up (there are other services starting at the same time so the CPU is quite busy) and thus not properly announcing itself to the old nodes. I tried extending the timeout parameter for TCPPING but that leads to other infinispan services failing to come up because jGroups is down.
Can anyone see anything wrong with my setup, or suggest any ideas for getting diagnostics from the systems? If the discovery is timing out, is there any way to extend the timeout without preventing the rest of infinispan starting?