5 Replies Latest reply on Feb 22, 2007 4:27 AM by belaban

    NAKACK errors on retransmit when cluster is under heavy load

    hmesha

      Env:
      Hardware: 2 x86 based machines
      OS: Suse Linux Enterprise Server 9 SP3
      Server: JBoss 4.0.5.GA
      JGroups : 2.4.1

      We're running a performance test on the cluster above. 100 concurrent virtual users with 1 second pause time between actions. After sometime (4hours) in the test I observed the following error on the server console.



      2007-02-16 17:12:32,294 ERROR [org.jgroups.protocols.pbcast.NAKACK:error] (requester=-.-.-.205:33509, local_addr=-.-.-.205:33509) message -.-.-.205:33509::1172743 not found in sent msgs.
      Sent messages: [1172262 - 1172861] (600)
      2007-02-16 17:44:39,599 ERROR [org.jgroups.protocols.pbcast.NAKACK:error] (requester=-.-.-.205:33509, local_addr=-.-.-.205:33509) message -.-.-.205:33509::1628057 not found in sent msgs.
      Sent messages: [1627729 - 1628398] (670)
      


      Note: I omitted the servers subnet for security reasons.

      The errors above occur not freqently (once or twice every 4 hours) on the servers but I'd like to understand why they happen. It looks to me like a concurrency issue with NAKACK when looking at the code base for JGroups 2.4.1.

      Can someone explain the error?

      Thanks,