4 Replies Latest reply on Jun 29, 2007 8:02 AM by mrkmrk

    JGroups: Problem with rebooting peers

    mrkmrk

      Hi

      I'm really sure if this is the right place to ask. I could not find any specific JGroups forums.

      However, let me know if there is a more suitable place.

      We're using JGroups 2.4.1 sp3 in a JBoss 4.0.2. We're using JGroups to send messages between the various servers. It works fine.
      We have one problem, though. When one of the destination servers reboots, the first message from the sender server fails. The log on the sender server says: ""2nd attempt to send data failed too". By adding some logging in BasicConnectionTable we've found that a "Socket Closed" occurs. The code in _send(byte[] data, int offset, int length) tries once, catches the IOException and tries again the same closed connection, then giving the "2nd.." message.

      private void _send(byte[] data, int offset, int length) {
       synchronized(send_mutex) {
       try {
       doSend(data, offset, length);
       updateLastAccessed();
       }
       catch(IOException io_ex) {
       if(log.isWarnEnabled())
       log.warn("peer closed connection, trying to re-send msg");
       try {
       doSend(data, offset, length);
       updateLastAccessed();
       }
       catch(IOException io_ex2) {
       if(log.isErrorEnabled()) log.error("2nd attempt to send data failed too");
       }
       catch(Exception ex2) {
       if(log.isErrorEnabled()) log.error("exception is " + ex2);
       }
       }
       catch(InterruptedException iex) {}
       catch(Throwable ex) {
       if(log.isErrorEnabled()) log.error("exception is " + ex);
       }
       }
       }
      


      We really don't want to loose that message.
      Have I missed something? Is there something I can do?

        • 1. Re: JGroups: Problem with rebooting peers
          belaban

          You're not losing those messages, as it is not the transport but either UNICAST or NAKACK which will resend a message until it has been delivered.

          Mailing lists for JGroups are at
          http://sourceforge.net/mail/?group_id=6081

          • 2. Re: JGroups: Problem with rebooting peers
            mrkmrk

            But we are using TCP.
            Our configuration:
            TCP(start_port=7810)

            Shouldn't that do what UNICAST would do?

            As I understand it, BasicConnectionTable is in the layer called "BuildingBlocks". UNICAST and NACKACK are in lower layers. How can a lower layer help things gone wrong in a higher layer?



            Best regards,
            Morten


            • 3. Re: JGroups: Problem with rebooting peers
              belaban

              No, the connection table is in the transport layer (TCP) and as such doesn't have to be concerned about retransmission or failed members.
              So, the failure detection layer (FD) will at one point kick in and remove the rebooted node from the cluster. Until this happens, TCP will happily continue trying to send packets to that node, and that's what you might be seeing.

              • 4. Re: JGroups: Problem with rebooting peers
                mrkmrk

                Our code is very simple, and as such we have no concept of cluster.

                We basically have a "Channel" upon which we call "send(Message msg)".
                When the method returns we don't know whether the transmission went through or not. Therefore if the socket is closed, the message is lost.