3 Replies Latest reply on Feb 9, 2015 2:06 PM by il_pizzaiolo

    Only one node refuses to join an Infinispan cluster

    il_pizzaiolo

      This is a real headscratcher, so I'm looking for some help on this.  My customer has three nodes with more-or-less identical Infinispan and JGroups configs (see below); in addition, the application code which uses Infinispan is also the same across all three nodes .  The jgroups-tcp.xml config of each uses the host's name in '<TCP bind_addr="${jgroups.bind_addr:<FQDN>}, but the TCPPING 'initial_hosts' is identical for each node.  I've attached a sample jgroups-tcp.xml.

       

      The behavior is this:

      1)  Start node-1.

      2)  Start node-2.  Node-1 and node-2 see each other and join.

      3)  Start node-3.  Node-1 and node-2 log that they see it:  ISPN000094: Received new cluster view: [<node-1's name>|6] (3) [<node-1's name>, <node-2's name>, <node-3's name>]

      4)  At about this time, node-3 logs the following complaint:

      <TIMESTAMP> - [ERROR] - from org.jgroups.protocols.TCP in OOB-1,shared=tcp

      JGRP000030: null: failed handling incoming message: java.lang.NoSuchFieldError: serializedCreator

      5)  Exactly 4 minutes later, node-3 reports that it can't start Infinispan and goes down:

      <TIMESTAMP> - [ERROR] - from application in main

      Exception occured in InfinispanPlugin.onStartUnable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl

      org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl

          at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:185) ~[org.infinispan.infinispan-commons-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:869) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:638) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:627) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:530) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:216) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.CacheImpl.start(CacheImpl.java:675) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:553) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:516) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:398) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          ...

      Caused by: org.infinispan.util.concurrent.TimeoutException: Node <node-1's name> timed out

          at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:174) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:521) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator(LocalTopologyManagerImpl.java:287) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.topology.LocalTopologyManagerImpl.join(LocalTopologyManagerImpl.java:100) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.statetransfer.StateTransferManagerImpl.start(StateTransferManagerImpl.java:100) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          ...

      Caused by: org.jgroups.TimeoutException: timeout sending message to <node-1's name>

          at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:419) ~[org.jgroups.jgroups-3.4.1.Final.jar:3.4.1.Final]

          at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:353) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

          at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167) ~[org.infinispan.infinispan-core-6.0.2.Final.jar:6.0.2.Final]

       

      It's not specific to node-1 either, because under starts and restarts of all nodes, sometimes node-3 will fail trying to talk to node-2.

       

      I think the core issue is the initial 'java.lang.NoSuchFieldError: serializedCreator' which points to an issue node-3 demarshalling objects from its cluster mates.  However, all three nodes have the same versions of:

      org.infinispan.infinispan-core-6.0.2.Final.jar
      org.jboss.marshalling.jboss-marshalling-river-1.4.4.Final.jar
      org.jboss.marshalling.jboss-marshalling-1.4.4.Final.jar


      I captured concurrent network traces on all three nodes, and node-3 communicates with node-1 (in the scenario above).  This problem appeared out of the blue but has been consistent for about a week.  Anyone have any guesses as to what could be the issue?

        • 1. Re: Only one node refuses to join an Infinispan cluster
          wdfink

          What approach do you use? Is it a Infinispan/JDG server or is it an application with embedded mode?

          If you use embedded mode it might be that you have different JGroups versions packed into the application, this might cause errors like this.

          • 2. Re: Only one node refuses to join an Infinispan cluster
            il_pizzaiolo

            We're using Infinispan embedded within an application.  The apps all use org.jgroups.jgroups-3.4.1.Final.jar; I see this in the log messages on all three nodes.  I did a little more digging and it looks like the initial error--

            org.jgroups.protocols.TCP in OOB-1,shared=tcp  JGRP000030: null: failed handling incoming message: java.lang.NoSuchFieldError: serializedCreator--comes as the result of an incoming out-of-band message.  Could there be an issue with the way the incoming messages

            are formatted/mangled, such that the third node can't read the data?  From what I've found, 'serializedCreator' may correspond to a field in org.jboss.marshalling.MarshallingConfiguration or org.jboss.marshalling.cloner.ClonerConfiguration; does anyone know if/how either of these would be involved with this problem or if I'm off track?

            • 3. Re: Only one node refuses to join an Infinispan cluster
              il_pizzaiolo

              In the end, this came down to a version mismatch of the following jars:  org.infinispan.infinispan-commons, org.infinispan.infinispan-core, org.jboss.marshalling.jboss-marshalling, and org.jboss.marshalling.jboss-marshalling-river.  All of the apps shared a common NFS lib directory, but the two working nodes picked up the same versions of all.  The non-working nodes were using a combination of the others.  Different combinations resulted in different error messages.  The error above is the result of running the following on a node:

               

              org.infinispan.infinispan-core-6.0.2.Final.jar
              org.jboss.marshalling.jboss-marshalling-1.4.4.Final.jar
              org.jboss.marshalling.jboss-marshalling-river-1.3.18.GA.jar


              who is attempting to connect with a node running:


              org.infinispan.infinispan-core-6.0.0.Final.jar
              org.jboss.marshalling.jboss-marshalling-1.3.18.GA.jar
              org.jboss.marshalling.jboss-marshalling-river-1.3.18.GA.jar


              Since there appears to be a strong dependence on common versions across hosts, I wonder if this could be advertised and checked when nodes attempt to synch.  That way, the failure would be immediate and clearly logged.