5 Replies Latest reply on Jun 2, 2010 4:00 PM by vblagojevic

    StateTransferException on getCache()

    cbo_

      I have a cluster member attempting to join a replicated cache populated with some 100,000 entries and the fetchInMemoryState=true when it is getting the following StateTransferException.  Ideas?

       

       

      2010-05-26 14:18:16,917 472255 TRACE [org.jgroups.protocols.UNICAST] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) te
      st16a_GCBC-4894 <-- ACK(test16b_GCBC-40743: #2)
      2010-05-26 14:18:17,082 472420 TRACE [org.jgroups.protocols.TCP] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) receiv
      ed [dst: test16a_GCBC-4894, src: test16b_GCBC-40743 (3 headers), size=0 bytes], headers are FLUSH: FLUSH[type=ABORT_FLUSH,viewId=3], UNICAST: DAT
      A, seqno=137, conn_id=2, TCP: [channel_name=infinispan-cluster-OHServerHybrid-test16]
      2010-05-26 14:18:17,082 472420 TRACE [org.jgroups.protocols.UNICAST] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) te
      st16a_GCBC-4894 <-- DATA(test16b_GCBC-40743: #137, conn_id=2)
      2010-05-26 14:18:17,083 472421 TRACE [org.jgroups.protocols.UNICAST] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) te
      st16a_GCBC-4894 --> ACK(test16b_GCBC-40743: #151)
      2010-05-26 14:18:17,083 472421 TRACE [org.jgroups.protocols.TCP] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) sendin
      g msg to test16b_GCBC-40743, src=test16a_GCBC-4894, headers are UNICAST: ACK, seqno=151, TCP: [channel_name=infinispan-cluster-OHServerHybrid-tes
      t16]
      2010-05-26 14:18:17,083 472421 TRACE [org.jgroups.protocols.TCP] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) dest=1
      70.137.237.81:44446 (102 bytes)
      2010-05-26 14:18:17,083 472421 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894
      test16a_GCBC-4894: received ABORT_FLUSH from flush coordinator test16b_GCBC-40743,  am I flush participant=true
      2010-05-26 14:18:17,083 472421 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894
      test16a_GCBC-4894: received ABORT_FLUSH from flush coordinator test16b_GCBC-40743,  am I flush participant=true
      2010-05-26 14:18:17,083 472421 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894
      test16a_GCBC-4894: received ABORT_FLUSH from flush coordinator test16b_GCBC-40743,  am I flush participant=true
      2010-05-26 14:18:17,083 472421 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894
      test16a_GCBC-4894: received ABORT_FLUSH from flush coordinator test16b_GCBC-40743,  am I flush participant=true
      2010-05-26 14:18:17,083 472421 DEBUG [org.jgroups.protocols.pbcast.GMS] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:)
      test16a_GCBC-4894: sending RESUME event
      2010-05-26 14:18:17,083 472421 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894
      test16a_GCBC-4894: received RESUME, sending STOP_FLUSH to all
      2010-05-26 14:18:17,083 472421 TRACE [org.jgroups.protocols.pbcast.NAKACK] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-489
      4:) sending test16a_GCBC-4894#79
      2010-05-26 14:18:17,083 472421 TRACE [org.jgroups.protocols.TCP] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) sendin
      g msg to null, src=test16a_GCBC-4894, headers are FLUSH: FLUSH[type=STOP_FLUSH,viewId=4], NAKACK: [MSG, seqno=79], TCP: [channel_name=infinispan-
      cluster-OHServerHybrid-test16]
      2010-05-26 14:18:17,084 472422 TRACE [org.jgroups.protocols.TCP] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) loopin
      g back message [dst: <null>, src: test16a_GCBC-4894 (3 headers), size=0 bytes]
      2010-05-26 14:18:17,085 472423 TRACE [org.jgroups.protocols.TCP] (Incoming-47,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) receiv
      ed [dst: <null>, src: test16a_GCBC-4894 (3 headers), size=0 bytes], headers are FLUSH: FLUSH[type=STOP_FLUSH,viewId=4], NAKACK: [MSG, seqno=79],
      TCP: [channel_name=infinispan-cluster-OHServerHybrid-test16]
      2010-05-26 14:18:17,085 472423 TRACE [org.jgroups.protocols.pbcast.STABLE] (Incoming-45,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-489
      4:) test16a_GCBC-4894: resetting digest from NAKACK: [test16a_GCBC-4894#78]
      ...skipping...
      2010-05-26 14:17:30,004 425342 WARN  [org.jgroups.protocols.pbcast.NAKACK] (Incoming-42,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-489
      4:) test16a_GCBC-4894: dropped message from test16b_GCBC-40743 (not in xmit_table), keys are [test16a_GCBC-4894], view=[test16a_GCBC-4894|4] [tes
      t16a_GCBC-4894]
      2010-05-26 14:17:30,014 425352 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:)
      determining merge leader from [test16a_GCBC-4894, test16b_GCBC-40743]
      2010-05-26 14:17:30,014 425352 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:)
      I (test16a_GCBC-4894) am not the merge leader, waiting for merge leader (test16b_GCBC-40743) to initiate merge
      2010-05-26 14:17:30,215 425553 TRACE [org.jgroups.protocols.TCP] (OOB-116,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) received [
      dst: test16a_GCBC-4894, src: test16b_GCBC-40743 (2 headers), size=0 bytes, flags=OOB], headers are FD: heartbeat, TCP: [channel_name=infinispan-c
      luster-OHServerHybrid-test16]
      2010-05-26 14:17:30,215 425553 TRACE [org.jgroups.protocols.FD] (OOB-116,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) received ar
      e-you-alive from test16b_GCBC-40743, sending response
      2010-05-26 14:17:30,215 425553 TRACE [org.jgroups.protocols.TCP] (OOB-116,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) sending ms
      g to test16b_GCBC-40743, src=test16a_GCBC-4894, headers are FD: heartbeat ack, TCP: [channel_name=infinispan-cluster-OHServerHybrid-test16]
      2010-05-26 14:17:30,215 425553 TRACE [org.jgroups.protocols.TCP] (OOB-116,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-4894:) dest=170.1
      37.237.81:44446 (114 bytes)
      2010-05-26 14:17:31,338 426676 ERROR [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-13,infinispan-cluster-OHServerHybrid-
      test16,test16a_GCBC-4894:) Caught while requesting or applying state

      org.infinispan.statetransfer.StateTransferException: java.io.EOFException: Read past end of file

              at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:329)

              at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:78)

              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:582)

              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:670)

              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:730)

              at org.jgroups.JChannel.up(JChannel.java:1427)

              at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:855)

              at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:483)

              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:524)

              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:463)

              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:225)

              at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)

              at org.jgroups.protocols.FC.up(FC.java:474)

              at org.jgroups.protocols.pbcast.GMS.up(GMS.java:890)

              at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)

              at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)

              at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)

              at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)

              at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)

              at org.jgroups.protocols.FD.up(FD.java:261)

              at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)

              at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)

              at org.jgroups.protocols.Discovery.up(Discovery.java:277)

              at org.jgroups.protocols.PING.up(PING.java:67)

              at org.jgroups.protocols.MPING.up(MPING.java:175)

              at org.jgroups.protocols.TP.passMessageUp(TP.java:997)

              at org.jgroups.protocols.TP.access$100(TP.java:56)

              at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1537)

              at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1519)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:619)

      Caused by: java.io.EOFException: Read past end of file

              at org.jboss.marshalling.AbstractUnmarshaller.eofOnRead(AbstractUnmarshaller.java:184)

              at org.jboss.marshalling.AbstractUnmarshaller.readUnsignedByteDirect(AbstractUnmarshaller.java:319)

              at org.jboss.marshalling.AbstractUnmarshaller.readUnsignedByte(AbstractUnmarshaller.java:280)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)

              at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)

      And, Relevant jgroups file I am using:
      <config xmlns="urn:org:jgroups"
              xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">
         <TCP loopback="true"
              port_range="30"
              recv_buf_size="20000000"
              send_buf_size="640000"
              discard_incompatible_packets="true"
              max_bundle_size="64000"
              max_bundle_timeout="30"
              enable_bundling="true"
              use_send_queues="true"
              sock_conn_timeout="300"
              enable_diagnostics="false"
              thread_pool.enabled="true"
              thread_pool.min_threads="2"
              thread_pool.max_threads="30"
              thread_pool.keep_alive_time="5000"
              thread_pool.queue_enabled="false"
              thread_pool.queue_max_size="100"
              thread_pool.rejection_policy="Discard"
              oob_thread_pool.enabled="true"
              oob_thread_pool.min_threads="2"
              oob_thread_pool.max_threads="30"
              oob_thread_pool.keep_alive_time="5000"
              oob_thread_pool.queue_enabled="false"
              oob_thread_pool.queue_max_size="100"
              oob_thread_pool.rejection_policy="Discard"
               />
         <MPING receive_on_all_interfaces="true"  break_on_coord_rsp="true"
            mcast_addr="${jgroups.udp.mcast_addr:228.6.7.8}" mcast_port="${jgroups.udp.mcast_port:42610}" ip_ttl="${jgroups.udp.ip_ttl:16}"
            num_initial_members="2" num_ping_requests="1"/>
         <MERGE2 max_interval="30000"
                 min_interval="10000"/>
         <FD_SOCK/>
         <FD timeout="3000" max_tries="3"/>
         <VERIFY_SUSPECT timeout="1500"/>
         <pbcast.NAKACK
               use_mcast_xmit="false" gc_lag="0"
               retransmit_timeout="300,600,1200,2400,4800"
               discard_delivered_msgs="false"/>
         <UNICAST timeout="300,600,1200"/>
         <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                        max_bytes="400000"/>
         <pbcast.GMS print_local_addr="false" join_timeout="7000" view_bundling="true"/>
         <FC max_credits="2000000"
             min_threshold="0.10"/>
         <FRAG2 frag_size="60000"/>
         <pbcast.STREAMING_STATE_TRANSFER/>
         <!-- <pbcast.STATE_TRANSFER/> -->
         <pbcast.FLUSH timeout="0"/>
      </config>

        • 1. Re: StateTransferException on getCache()
          cbo_

          Forgot to mention I am using 4.1 Beta1

          • 2. Re: StateTransferException on getCache()
            cbo_

            Later, noticed these additional messages:

             

            2010-05-26 14:23:18,200 41915 WARN  [org.infinispan.remoting.InboundInvocationHandlerImpl] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,t
            est16a_GCBC-12952:) Cache named [OHSOrderCache] exists but isn't in a state to handle invocations.  Its state is INSTANTIATED.
            2010-05-26 14:23:18,275 41990 TRACE [org.jgroups.blocks.RequestCorrelator] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-1295
            2:) sending rsp for 8 to test16b_GCBC-40743
            2010-05-26 14:23:18,275 41990 TRACE [org.jgroups.protocols.FC] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) sender t
            est16b_GCBC-40743 minus 5 credits, 1999995 remaining
            2010-05-26 14:23:18,276 41991 TRACE [org.jgroups.protocols.UNICAST] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) tes
            t16a_GCBC-12952 --> DATA(test16b_GCBC-40743: #4, conn_id=0)
            2010-05-26 14:23:18,276 41991 TRACE [org.jgroups.protocols.TCP] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) sending
            msg to test16b_GCBC-40743, src=test16a_GCBC-12952, headers are RequestCorrelator: id=201, type=RSP, id=8, rsp_expected=false, UNICAST: DATA, seq
            no=4, TCP: [channel_name=infinispan-cluster-OHServerHybrid-test16]
            2010-05-26 14:23:18,276 41991 TRACE [org.jgroups.protocols.TCP] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) dest=17
            0.137.237.81:44446 (130 bytes)
            2010-05-26 14:23:18,276 41991 TRACE [org.jgroups.protocols.FC] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) sender t
            est16b_GCBC-40743 minus 37 credits, 1996700 remaining
            2010-05-26 14:23:18,276 41991 TRACE [org.jgroups.protocols.FC] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) sender t
            est16b_GCBC-40743 minus 37 credits, 1996700 remaining
            2010-05-26 14:23:18,276 41991 TRACE [org.jgroups.blocks.RequestCorrelator] (Incoming-2,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-1295
            2:) calling (org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher) with request 9
            2010-05-26 14:23:18,279 41994 TRACE [org.jgroups.protocols.TCP] (Incoming-4,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) receive
            d [dst: test16a_GCBC-12952, src: test16b_GCBC-40743 (2 headers), size=0 bytes], headers are UNICAST: ACK, seqno=4, TCP: [channel_name=infinispan-
            cluster-OHServerHybrid-test16]
            2010-05-26 14:23:18,279 41994 TRACE [org.jgroups.protocols.UNICAST] (Incoming-4,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) tes
            t16a_GCBC-12952 <-- ACK(test16b_GCBC-40743: #4)
            2010-05-26 14:23:20,018 43733 TRACE [org.jgroups.protocols.TCP] (OOB-5,infinispan-cluster-OHServerHybrid-test16,test16a_GCBC-12952:) received [ds
            t: test16a_GCBC-12952, src: test16b_GCBC-40743 (2 headers), size=0 bytes, flags=OOB], headers are FD: heartbeat, TCP: [channel_name=infinispan-cl
            uster-OHServerHybrid-test16]
            • 3. Re: StateTransferException on getCache()
              vblagojevic

              Somehow the stream from the state producer got corrupted. Why and how I can not tell from the stack traces. Did anything happen with state producer node at that very moment of state transfer? For example, it might have died while sending data....

              • 4. Re: StateTransferException on getCache()
                cbo_

                I don't think I have those logs around any longer.  Since putting that post out there I have had some additional thoughts on what might be happening in this scenario.  I will mention below the sequence of events and I'd like it if you could comment on whether there is indeed something there that could cause the exception we saw.

                 

                The use case includes 2 VMs.  One on one machine, one on another.  Using replication mode.  Use case is HA with one side owner and the other side warm (both are not mutating the cache).

                 

                Sequence:

                 

                1) Master/owner VM comes up and populates the cache from a persistent source

                2) Slave VM comes up and executes getCache().  We have fetchInMemoryState=true so this will likely begin state retrieval from the master VM

                3) Master VM does some updates/puts().

                4) Master VM is (by design) brought down and back up (keeping the Master status).

                4.1) This VM will join the cluster by calling getCache().  Since he also has fetchInMemoryState=true I'm guessing this will begin the process of getting state from the slave VM (btw, this is not desirable...just what I am guessing is happening).

                4.2) This VM (since desiring to still be master) will call cache.clear().  I believe this is being done before the full state retrieval has completed in the background.

                 

                If this makes sense, I am planning to change the fetchInMemoryState to false if the VM comes up as Master to avoid the potential clash with the call to cache.clear().

                • 5. Re: StateTransferException on getCache()
                  vblagojevic

                  What might have been happening is that you were brining down master while the state transfer was still in process (slave taking state from master)?

                   

                  4.1) You are right about your assumptions.

                  4.2) If you are going to call cache.clear then why transfer state at all?! So you are right here as well.

                   

                  In conclusion, I agree that you should set fetchInMemoryState to false if the VM comes up as Master. Hopefully this application level logic makes sense in your particular scenario.

                   

                  All the best,

                  Vladimir