2 Replies Latest reply on Mar 25, 2015 3:54 PM by scarpent

    Wildfly 8.1 failover occasionally fails, indicated by JGRP000032 messages

    scarpent

      I'm using Wildfly 8.1 with standalone HA clustering, and jgroups config like so:

       

      <stack name="tcp">
          <transport type="TCP" socket-binding="jgroups-tcp"/>
          <protocol type="TCPPING">
              <property name="initial_hosts">${jgroups.tcpping.initial_hosts}</property>
              <property name="port_range">1</property>
              <property name="num_initial_members">3</property>
          </protocol>
          <protocol type="MERGE2"/>
          <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
          <protocol type="FD"/>
          <protocol type="VERIFY_SUSPECT"/>
          <protocol type="pbcast.NAKACK2">
              <property name="use_mcast_xmit">false</property>
              <property name="use_mcast_xmit_req">false</property>
          </protocol>
          <protocol type="UNICAST3"/>
          <protocol type="pbcast.STABLE"/>
          <protocol type="pbcast.GMS"/>
          <protocol type="MFC"/>
          <protocol type="FRAG2"/>
          <protocol type="RSVP"/>
      </stack>
      
      

       

      (Using TCP since multicast not available in AWS.)

       

      I have two servers in my dev environment, and normally things work as expected. When I bring up the second server, I see this in the log of the first:

       

      2015-03-25 13:01:34,367 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-18,shared=tcp) ISP

      N000094: Received new cluster view: [node-admin-app1/web|5] (2) [node-admin-app1/web, node-admin-app2/web]

       

      And when shutting down a server I'll see something like this in the log of the remaining one:

       

      2015-03-25 13:02:56,796 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-8,shared=tcp) ISPN

      000094: Received new cluster view: [node-admin-app2/web|6] (1) [node-admin-app2/web]

       

      And with that I've confirmed in my application that the session properly failed over.

       

      Occasionally when shutting down the server, I won't see the "new cluster view" entry, and instead will get a bunch of messages like this:

       

      2015-03-25 12:54:33,269 WARN  [org.jgroups.protocols.TCP] (Timer-2,shared=tcp) JGRP000032: null: no physical address for node-admin-app1/web, dropping message

       

      They eventually stop. But when I bring up the other server, I again don't see the expected cluster view message, and failover does not work when again a server is shutdown. The session is borked.

       

      Please let me know if more information will help. Thank you!