1 Reply Latest reply on Dec 16, 2010 5:27 AM by bensonfungjava

    Get Socket closed exception, the nodes leaves the group and doesn't re-join anymore

    nsemchenkov
      The JBoss server works fine as a cluster member for a while and then all of a sudden gets "Socket closed" exception. After that its not a member of the cluster anymore.
      Please note, we have set shun=false trying to keep it in the cluster but it didnt help
      In the past we had JBoss 4.2 running on the same hardware and had no issues. So It's probably not a networking problem
      Please find below an excerpt from the cluster.log

      cluster.log:
      2010-12-10 09:16:14,160 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:16:14,161 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:16:14,161 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:16:14,548 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:7901 (own address=10.170.1.24:7900)
      2010-12-10 09:17:06,591 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 10.170.1.24:7901: java.net.SocketException: Socket closed
      2010-12-10 09:17:06,628 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:47583 (own address=10.170.1.24:38487)
      2010-12-10 09:17:06,638 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.170.1.24:47583 (number=0)
      2010-12-10 09:17:06,640 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:7901 (own address=10.170.1.24:7900)
      2010-12-10 09:17:06,644 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:17:06,645 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.170.1.24:39588 (number=0)
      2010-12-10 09:17:06,645 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:17:06,645 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.170.1.24:39588 (number=0)
      2010-12-10 09:17:06,645 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:17:06,645 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.170.1.24:39588 (number=0)
      2010-12-10 09:17:06,645 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:17:06,645 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.170.1.24:39588 (number=0)
      2010-12-10 09:17:06,646 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:17:06,646 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.170.1.24:39588 (number=0)
      2010-12-10 09:17:06,646 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      2010-12-10 09:17:06,646 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.170.1.24:39588 (number=0)
      2010-12-10 09:17:06,852 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 10.170.1.25:7900: java.net.SocketException: Socket closed
      2010-12-10 09:17:07,960 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 10.170.1.17:7900: java.net.SocketException: Socket closed
      2010-12-10 09:17:08,113 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 10.170.1.16:7900: java.net.SocketException: Socket closed
      2010-12-10 09:17:08,167 WARN  [org.jgroups.protocols.FD] I was suspected by 10.170.1.15:54399; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      2010-12-10 09:17:08,176 WARN  [org.jgroups.protocols.FD] I was suspected by 10.170.1.15:54399; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      2010-12-10 09:17:08,177 WARN  [org.jgroups.protocols.FD] I was suspected by 10.170.1.15:54399; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      2010-12-10 09:17:08,177 WARN  [org.jgroups.protocols.FD] I was suspected by 10.170.1.15:54399; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      2010-12-10 09:17:08,213 WARN  [org.jgroups.protocols.FD] I was suspected by 10.170.1.15:33585; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      2010-12-10 09:17:08,256 WARN  [org.jgroups.protocols.FD] I was suspected by 10.170.1.15:33585; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      2010-12-10 09:17:08,297 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 10.170.1.15:7901: java.net.SocketException: Broken pipe
      2010-12-10 09:17:08,309 DEBUG [org.jgroups.protocols.pbcast.FLUSH] Received START_FLUSH at 10.170.1.24:60948 but I am not flush participant, not responding
      2010-12-10 09:17:08,309 DEBUG [org.jgroups.protocols.pbcast.FLUSH] Received START_FLUSH at 10.170.1.24:60948 but I am not flush participant, not responding
      2010-12-10 09:17:08,310 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:54399, 10.170.1.15:33585, 10.170.1.24:39588]
      2010-12-10 09:17:08,310 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.170.1.24:60948] view is [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:54399, 10.17
      0.1.15:33585, 10.170.1.24:39588]
      2010-12-10 09:17:08,310 WARN  [org.jgroups.protocols.pbcast.GMS] I (10.170.1.24:60948) am not a member of view [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:54399
      , 10.170.1.15:33585, 10.170.1.24:39588]; discarding view
      2010-12-10 09:17:08,311 DEBUG [org.jgroups.protocols.pbcast.FLUSH] At 10.170.1.24:60948 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2010-12-10 09:17:08,311 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:54399, 10.170.1.15:33585, 10.170.1.24:39588]
      2010-12-10 09:17:08,311 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.170.1.24:60948] view is [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:54399, 10.17
      0.1.15:33585, 10.170.1.24:39588]
      2010-12-10 09:17:08,312 WARN  [org.jgroups.protocols.pbcast.GMS] I (10.170.1.24:60948) am not a member of view [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:54399
      , 10.170.1.15:33585, 10.170.1.24:39588]; discarding view
      2010-12-10 09:17:08,312 DEBUG [org.jgroups.protocols.pbcast.FLUSH] At 10.170.1.24:60948 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2010-12-10 09:17:08,313 DEBUG [org.jgroups.protocols.pbcast.FLUSH] Received START_FLUSH at 10.170.1.24:60948 but I am not flush participant, not responding
      2010-12-10 09:17:08,313 DEBUG [org.jgroups.protocols.pbcast.FLUSH] Received START_FLUSH at 10.170.1.24:60948 but I am not flush participant, not responding
      2010-12-10 09:17:08,313 DEBUG [org.jgroups.protocols.pbcast.FLUSH] Received START_FLUSH at 10.170.1.24:60948 but I am not flush participant, not responding
      2010-12-10 09:17:08,314 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.170.1.25:56405|11] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.170.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,314 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.170.1.24:60948] view is [10.170.1.25:56405|11] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.17
      0.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,314 WARN  [org.jgroups.protocols.pbcast.GMS] I (10.170.1.24:60948) am not a member of view [10.170.1.25:56405|11] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585
      , 10.170.1.15:54399, 10.170.1.24:39588]; discarding view
      2010-12-10 09:17:08,315 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.170.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,315 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.170.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,315 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.170.1.24:60948] view is [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.17
      0.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,315 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.170.1.24:60948] view is [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.17
      0.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,315 WARN  [org.jgroups.protocols.pbcast.GMS] I (10.170.1.24:60948) am not a member of view [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585
      , 10.170.1.15:54399, 10.170.1.24:39588]; discarding view
      2010-12-10 09:17:08,315 WARN  [org.jgroups.protocols.pbcast.GMS] I (10.170.1.24:60948) am not a member of view [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585
      , 10.170.1.15:54399, 10.170.1.24:39588]; discarding view
      2010-12-10 09:17:08,316 DEBUG [org.jgroups.protocols.pbcast.FLUSH] At 10.170.1.24:60948 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2010-12-10 09:17:08,316 DEBUG [org.jgroups.protocols.pbcast.FLUSH] At 10.170.1.24:60948 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2010-12-10 09:17:08,317 DEBUG [org.jgroups.protocols.pbcast.FLUSH] At 10.170.1.24:60948 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2010-12-10 09:17:08,313 DEBUG [org.jgroups.protocols.pbcast.FLUSH] Received START_FLUSH at 10.170.1.24:60948 but I am not flush participant, not responding
      2010-12-10 09:17:08,378 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.170.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,378 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.170.1.24:60948] view is [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585, 10.17
      0.1.15:54399, 10.170.1.24:39588]
      2010-12-10 09:17:08,379 WARN  [org.jgroups.protocols.pbcast.GMS] I (10.170.1.24:60948) am not a member of view [10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389, 10.170.1.15:33585
      , 10.170.1.15:54399, 10.170.1.24:39588]; discarding view
      2010-12-10 09:17:08,379 DEBUG [org.jgroups.protocols.pbcast.FLUSH] At 10.170.1.24:60948 received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2010-12-10 09:17:08,678 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 10.170.1.15:7900: java.net.SocketException: Socket closed
      2010-12-10 09:17:08,877 ERROR [org.jgroups.blocks.ConnectionTable] failed sending data to 10.170.1.16:7901: java.net.SocketException: Socket closed
      2010-12-10 09:17:12,126 WARN  [org.jgroups.protocols.VIEW_SYNC] discarding view as I (10.170.1.24:60948) am not member of view ([10.170.1.25:56405|12] [10.170.1.25:56405, 10.170.1.17:39656, 10.170.1.16:33984, 10.170.1.16:40389,
      10.170.1.15:33585, 10.170.1.15:54399, 10.170.1.24:39588])
      2010-12-10 09:17:12,648 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.170.1.24:39588 (own address=10.170.1.24:60948)
      
      

       

      We use default JGroups configuration from JBoss 5.1 server.

      Can anybody advise how we can investigate this problem?

        • 1. Re: Get Socket closed exception, the nodes leaves the group and doesn't re-join anymore
          bensonfungjava

          I also have the same issue.  The JBOSS cluster with 4 nodes runs well.  However, somehow it suddenly happened with the following error message :

           

          2010-12-16  00:00:05,814 WARN  [org.jgroups.protocols.pbcast.NAKACK]  (OOB-20,10.49.109.114:32778) 10.49.109.114:32778] discarded message from  non-member 10.49.109.117:32777, my view is  MergeView::[10.49.109.114:32778|108] [10.49.109.114:32778,  10.49.109.115:32779], subgroups=[[10.49.109.115:32779|107]  [10.49.109.115:32779], [10.49.109.114:32778|0] [10.49.109.114:32778]]

           

          2010-12-16  00:05:23,098 WARN  [org.jgroups.protocols.pbcast.GMS]  (ViewHandler,PRODJBOSS1-JMS-DATA,10.49.109.114:7900) 10.49.109.114:7900  failed to collect all ACKs (1) for mcasted view [10.49.109.114:7900|40]  [10.49.109.114:7900, 10.49.109.115:7900, 10.49.109.117:7900] after  2000ms, missing ACKs from [10.49.109.117:7900],  local_addr=10.49.109.114:7900

           

          2010-12-16  00:06:39,137 WARN  [org.jgroups.blocks.ConnectionTable]  (ConnectionTable.AcceptorThread,10.49.109.114:7900) timed out waiting  for peer address, closing connection <10.49.109.114:7900 -->  10.49.109.117:42356> (1 secs old): java.net.SocketTimeoutException:  Read timed out

           

          2010-12-16  00:16:37,147 WARN  [org.jgroups.protocols.FD]  (OOB-13,10.49.109.114:32778) I was suspected by 10.49.109.115:32779;  ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

           

          2010-12-16  00:16:40,214 ERROR [org.jgroups.blocks.ConnectionTable]  (OOB-56,10.49.109.114:7900) failed sending data to 10.49.109.116:7900:  java.net.SocketException: Socket closed

           

          2010-12-16  00:16:51,089 ERROR [org.jgroups.protocols.UNICAST]  (OOB-62,10.49.109.114:7900) 10.49.109.114:7900: sender window for  10.49.109.115:7900 not found

           

           

          Please help