2 Replies Latest reply on Jul 21, 2015 10:34 AM by saranya guna

    Infinispan cache UDP

    saranya guna Newbie

      Hi All,

       

      Faced the below issue in production enviroment.

       

      When the application tries to acquire lock for the cache , we got replication timeout exception.

       

      We have four nodes in clustered setup( node1, node2, node3 and node 4) .Node 3 went down and again came up within two minutes. When the node went down, the below message was sent only to the coordinator  node i.e) node 1.

       

      16:56:24,584 DEBUG [org.jgroups.protocols.pbcast.NAKACK2] (Incoming-6,shared=udp) removed node3:server/test from xmit_table (not member anymore)

       

      The above message was not received in node 2 and node 4. Got the below error in coordinator node.

       

      16:56:29,584 WARN  [org.jgroups.protocols.pbcast.GMS] (ViewHandler,test,node1:server/test) node1:server/test: failed to collect all ACKs (expected=3) for view [node1:server/test|4] after 5000ms, missing ACKs from [node2:server/test, node4:server/test]

       

       

      So, when the node comes up the below message was received only by the coordinator.

       

      16:57:57,642 DEBUG [org.jgroups.protocols.pbcast.GMS] (Incoming-12,shared=udp) node1:server/test installing view [node1:server/test|5] [node1:server/test,node2:server/test, node4:server/test, node3:server/test]

      16:57:57,642 DEBUG [org.jgroups.protocols.FD_SOCK] (Incoming-12,shared=udp) VIEW_CHANGE received: [node1:server/test, node2:server/test, node4:server/test, node3:server/test]

       

      While checking the logs in node2, i could see FD has detected that the node 3 went down through heart beat message

       

      16:56:59,764 DEBUG [org.jgroups.protocols.FD] (Timer-3,shared=udp) node2:server/test: received no heartbeat from node3:server/test for 5 times (30000 milliseconds), suspecting it

       

      Below are the cache configurations:

       

      <cache-container name="test" aliases="test" default-cache="test">

                          <transport lock-timeout="60000"/>

                          <replicated-cache name="test" mode="SYNC" start="EAGER" batching="true">

                              <transaction mode="NON_XA" locking="PESSIMISTIC"/>

                              <locking isolation="READ_COMMITTED" striping="false" acquire-timeout="600000"/>

                          </replicated-cache>

      </cache-container>

       

      <subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="udp">

                      <stack name="udp">

                          <transport type="UDP" socket-binding="jgroups-udp"/>

                          <protocol type="PING"/>

                          <protocol type="MERGE3"/>

                          <protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/>

                          <protocol type="FD"/>

                          <protocol type="VERIFY_SUSPECT"/>

                          <protocol type="pbcast.NAKACK"/>

                          <protocol type="UNICAST2"/>

                          <protocol type="pbcast.STABLE"/>

                          <protocol type="pbcast.GMS"/>

                          <protocol type="UFC"/>

                          <protocol type="MFC"/>

                          <protocol type="FRAG2"/>

                          <protocol type="RSVP"/>

                      </stack>

      </subsytem>