6 Replies Latest reply on Nov 13, 2012 9:10 PM by sathish.alwar

    Server stops responding

    sathish.alwar

      Hi,

       

      We observe communication between 2 nodes in a cluster fails after following error message. After some point of time, the server hangs and jms messages are not moved between cluster nodes.

       

      Could you please explain us why does socket closes automatically after some days and ways to resolve this issue.

       

      JBOSS Version - jboss-eap-5.0.1

       

      Logs in Server-1

      ------------------------

      09:32:16,404 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@158e783: detected failure on control connection Thread[control:

      Socket[addr=mars02.oss.covad.com/172.31.6.208,port=15239,localport=59817],5,jboss] (4sv63z-80v5k1-h8h6owvc-1-h8h77m6v-ee: requesting new control connection

      09:32:16,404 ERROR [ConnectionTable] failed sending data to 172.31.6.208:7900: java.net.SocketException: Broken pipe

      09:32:16,667 WARN  [GMS] I (172.31.6.143:45049) am not a member of view [172.31.6.208:59663|19] [172.31.6.208:59663], shunning myself and leaving the group (prev_members are [172.31.6.208:412

      61, 172.31.6.143:45049, 172.31.6.208:59663], current view is [172.31.6.143:45049|18] [172.31.6.143:45049, 172.31.6.208:59663])

      09:32:16,660 WARN  [FD] I was suspected by 172.31.6.208:59663; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      09:32:16,614 WARN  [FD] I was suspected by 172.31.6.208:59663; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      09:32:16,689 WARN  [FD] I was suspected by 172.31.6.208:59663; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      09:32:16,685 WARN  [GMS] I (172.31.6.143:45049) am not a member of view [172.31.6.208:59663|23] [172.31.6.208:59663], shunning myself and leaving the group (prev_members are [172.31.6.208:412

      61, 172.31.6.143:45049, 172.31.6.208:59663], current view is [172.31.6.143:45049|22] [172.31.6.143:45049, 172.31.6.208:59663])

      09:32:16,730 ERROR [ConnectionTable] failed sending data to 172.31.6.208:7900: java.net.SocketException: Socket closed

      09:32:16,745 ERROR [ConnectionTable] failed sending data to 172.31.6.208:7900: java.net.SocketException: Socket closed

      09:32:16,747 WARN  [GMS] I (172.31.6.143:45049) am not a member of view [172.31.6.208:59663|22] [172.31.6.208:59663], shunning myself and leaving the group (prev_members are [172.31.6.208:412

      61, 172.31.6.143:45049, 172.31.6.208:59663], current view is [172.31.6.143:45049|21] [172.31.6.143:45049, 172.31.6.208:59663])

      09:32:17,584 ERROR [UNICAST] 172.31.6.143:45049: sender window for 172.31.6.208:59663 not found

      09:32:18,216 ERROR [UNICAST] 172.31.6.143:45049: sender window for 172.31.6.208:59663 not found

             

      09:32:44,537 ERROR [JChannel] failed auto-fetching state

      java.lang.IllegalStateException: Node 172.31.6.143:43877 could not flush the cluster for state retrieval

              at org.jgroups.JChannel.getState(JChannel.java:1106)

              at org.jgroups.JChannel.getState(JChannel.java:1031)

              at org.jgroups.JChannel.getState(JChannel.java:975)

       

      Thanks

      A.SathishKumar

        • 1. Re: Server stops responding
          rhusar

          What are the logs at the second server? Seems like the issue lies there.

          • 2. Re: Server stops responding
            sathish.alwar

            Hi,

             

            Thanks for responding to my query.

             

            Please find logs in 2nd server.

             

            09:32:08,245 INFO  [RPCManagerImpl] Received new cluster view: [172.31.6.208:59663|23] [172.31.6.208:59663]

            09:32:08,245 INFO  [GroupMember] org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@103c660 got new view

            [172.31.6.208:59663|19] [172.31.6.208:59663], old view is [172.31.6.143:45049|18] [172.31.6.143:45049, 172.31.6.208:59663]

            09:32:08,245 INFO  [GroupMember] I am (172.31.6.208:59663)

            09:32:08,257 INFO  [MessagingPostOffice] JBoss Messaging is failing over for failed node 1. If there are many messages to reload thi

            s may take some time...

            09:32:08,477 INFO  [MessagingPostOffice] JBoss Messaging failover completed

            09:32:08,478 INFO  [GroupMember] Dead members: 1 ([172.31.6.143:45049])

            09:32:08,478 INFO  [GroupMember] All Members : 1 ([172.31.6.208:59663])

            09:32:09,192 INFO  [MARS-PARTITION] Suspected member: 172.31.6.143:45049

            09:32:09,327 INFO  [MARS-PARTITION] New cluster view for partition MARS-PARTITION (id: 22, delta: -1) : [172.31.6.208:1099]

            09:32:09,565 INFO  [MARS-PARTITION] I am (172.31.6.208:1099) received membershipChanged event:

            09:32:09,567 INFO  [MARS-PARTITION] Dead members: 1 ([172.31.6.143:1099])

            09:32:09,567 INFO  [MARS-PARTITION] New Members : 0 ([])

            09:32:09,567 INFO  [MARS-PARTITION] All Members : 1 ([172.31.6.208:1099])

            09:32:09,568 INFO  [ProxyFactory] Bound EJB Home 'MarsTask' to jndi 'ejb/MarsTask'

            09:32:09,568 INFO  [ProxyFactory] Bound EJB Home 'MarsLookup' to jndi 'ejb/MarsLookup'

            09:32:09,616 INFO  [ProxyFactory] Bound EJB Home 'ApprovalTransaction' to jndi 'ejb/ApprovalTransaction'

            09:32:09,616 INFO  [ProxyFactory] Bound EJB Home 'MarsFilter' to jndi 'ejb/MarsFilter'

            09:32:09,617 INFO  [ProxyFactory] Bound EJB Home 'MarsCOInformation' to jndi 'ejb/MarsCOInformation'

            09:32:09,618 INFO  [ProxyFactory] Bound EJB Home 'MarsUpdateTask' to jndi 'ejb/MarsUpdateTask'

            09:32:09,619 INFO  [ProxyFactory] Bound EJB Home 'MarsProject' to jndi 'ejb/MarsProject'

            09:32:09,643 INFO  [ProxyFactory] Bound EJB Home 'MarsOnCallTech' to jndi 'ejb/MarsOnCallTech'

            09:32:16,609 WARN  [NAKACK] 172.31.6.208:59663] discarded message from non-member 172.31.6.143:45049, my view is [172.31.6.208:59663

            |23] [172.31.6.208:59663]

            09:32:16,631 WARN  [NAKACK] 172.31.6.208:59663] discarded message from non-member 172.31.6.143:45049, my view is [172.31.6.208:59663

            |22] [172.31.6.208:59663]

            09:32:16,631 WARN  [NAKACK] 172.31.6.208:7900] discarded message from non-member 172.31.6.143:7900, my view is [172.31.6.208:7900|18

            ] [172.31.6.208:7900]

            09:32:16,660 WARN  [NAKACK] 172.31.6.208:59663] discarded message from non-member 172.31.6.143:45049, my view is [172.31.6.208:59663

            |19] [172.31.6.208:59663]

            09:32:16,667 WARN  [NAKACK] 172.31.6.208:59663] discarded message from non-member 172.31.6.143:45049, my view is [172.31.6.208:59663

            |22] [172.31.6.208:59663]

            09:32:16,690 WARN  [NAKACK] 172.31.6.208:7900] discarded message from non-member 172.31.6.143:7900, my view is [172.31.6.208:7900|18

            ] [172.31.6.208:7900]

            09:32:17,094 WARN  [NAKACK] 172.31.6.208:7900] discarded message from non-member 172.31.6.143:7900, my view is [172.31.6.208:7900|18

            ] [172.31.6.208:7900]

            09:32:18,629 WARN  [NAKACK] 172.31.6.208:7900] discarded message from non-member 172.31.6.143:7900, my view is [172.31.6.208:7900|18

            ] [172.31.6.208:7900]

            09:32:18,642 ERROR [NAKACK] sender 172.31.6.143:7900 not found in xmit_table

            09:32:18,642 ERROR [NAKACK] range is null

            09:32:18,876 INFO  [GroupMember] org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@103c660 got new view

            [172.31.6.208:59663|20] [172.31.6.208:59663, 172.31.6.143:43877], old view is [172.31.6.208:59663|19] [172.31.6.208:59663]

            09:32:18,884 INFO  [GroupMember] I am (172.31.6.208:59663)

            09:32:18,884 INFO  [GroupMember] New Members : 1 ([172.31.6.143:43877])

            09:32:18,884 INFO  [GroupMember] All Members : 2 ([172.31.6.208:59663, 172.31.6.143:43877])

            09:32:19,026 WARN  [GMS] queue is suspended; request JOIN(172.31.6.143:43877) is discarded

            09:32:19,432 WARN  [GMS] queue is suspended; request JOIN(172.31.6.143:43877) is discarded

            09:32:22,035 WARN  [GMS] queue is suspended; request JOIN(172.31.6.143:43877) is discarded

            09:32:22,438 WARN  [GMS] queue is suspended; request JOIN(172.31.6.143:43877) is discarded

            09:32:34,421 WARN  [GMS] GMS flush by coordinator at 172.31.6.208:59663 failed

            09:32:34,519 INFO  [RPCManagerImpl] Received new cluster view: [172.31.6.208:59663|25] [172.31.6.208:59663, 172.31.6.143:43877]

            09:32:40,455 WARN  [GMS] GMS flush by coordinator at 172.31.6.208:59663 failed

            09:32:40,526 INFO  [MARS-PARTITION] New cluster view for partition MARS-PARTITION (id: 24, delta: 1) : [172.31.6.208:1099, 172.31.6.

            143:1099]

            09:32:40,530 INFO  [MARS-PARTITION] I am (172.31.6.208:1099) received membershipChanged event:

            09:32:40,530 INFO  [MARS-PARTITION] Dead members: 0 ([])

            09:32:40,530 INFO  [MARS-PARTITION] New Members : 1 ([172.31.6.143:1099])

            09:32:40,530 INFO  [MARS-PARTITION] All Members : 2 ([172.31.6.208:1099, 172.31.6.143:1099])

            11:32:41,066 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@f666e

            7: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=53914],

            5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9pbkg-ytq: requesting new control connection

            11:32:46,406 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@1272a

            0a: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=53604]

            ,5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9mjbk-ytj: requesting new control connection

            11:33:16,457 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@1272a

            0a: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=54793]

            ,5,] (4sv65s-c9i6xo-h8h74d1t-1-h9a9mjbk-ytj: requesting new control connection

            11:33:21,147 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@f666e

            7: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=54792],

            5,] (4sv65s-c9i6xo-h8h74d1t-1-h9a9pbkg-ytq: requesting new control connection

            11:33:46,495 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@1272a

            0a: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=54887]

            ,5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9mjbk-ytj: requesting new control connection

            11:33:51,205 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@f666e

            7: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=54914],

            5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9pbkg-ytq: requesting new control connection

            11:34:16,545 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@1272a

            0a: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=54973]

            ,5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9mjbk-ytj: requesting new control connection

            11:34:21,265 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@f666e

            7: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=54989],

            5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9pbkg-ytq: requesting new control connection

            11:34:46,608 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@1272a

            0a: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=55058]

            ,5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9mjbk-ytj: requesting new control connection

            11:34:51,325 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@f666e

            7: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=55069],

            5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9pbkg-ytq: requesting new control connection

            11:35:16,654 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@1272a

            0a: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=55120]

            ,5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9mjbk-ytj: requesting new control connection

            11:35:21,384 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@f666e

            7: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=55127],

            5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9pbkg-ytq: requesting new control connection

            11:35:46,704 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@1272a

            0a: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=55205]

            ,5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9mjbk-ytj: requesting new control connection

            11:35:51,444 WARN  [BisocketServerInvoker] org.jboss.remoting.transport.bisocket.BisocketServerInvoker$ControlMonitorTimerTask@f666e

            7: detected failure on control connection Thread[control: Socket[addr=mars01.oss.covad.com/172.31.6.143,port=63865,localport=55215],

            5,jboss] (4sv65s-c9i6xo-h8h74d1t-1-h9a9pbkg-ytq: requesting new control connection

            • 3. Re: Server stops responding
              rhusar

              discarded message from non-member

              This is a typical sign that you didnt set the initial_members on both nodes. It is not enough to set it on one. Can you check you are using the same config on both?

              • 4. Re: Server stops responding
                sathish.alwar

                Hi,

                 

                Thanks for your quick response. I am new to JBoss. Could you please let me know where to configure this attribute.

                 

                Thanks and Regards

                A.SathishKumar

                • 5. Re: Server stops responding
                  rhusar

                  Oh, sorry, I mixed this up with another thread.

                   

                  Your problem is described on the customer poral here: https://access.redhat.com/knowledge/solutions/64231

                   

                  Rado

                  • 6. Re: Server stops responding
                    sathish.alwar

                    Hi,

                     

                    I observe issue being posted, however i dont find the solution. Could you please let me know the solution to solve this, we are badly affected by this issue.

                     

                    Thanks and Regards

                    A.SathishKumar