0 Replies Latest reply on May 27, 2014 11:52 AM by blue666man

    1 of 4 nodes goes down in Domain cluster with jgroups using TCPPING

    blue666man

      Not sure if this should be here or under jgroups; apologies in advance.  My setup is:

       

      Two RHEL 6 virtual machines

      JBoss EAP 6.1.0 Final setup in domain mode; each host controller, including the domain controller, fires up 2 app server processes

      JBoss Teiid 8.6.0 Final

       

      Of the four app server processes:  "teiid-prod-server-one" and "teiid-prod-server-two" are on the slave host controller "cdtssoa110p".  "teiid-prod-server-three" and "teiid-prod-server-four" are on the domain controller "cdtssoa111p".

       

      "teiid-prod-server-three" always goes down (process is actually stopped) after about 23-26 hours after launching it.  The only ERROR in the host-controller.log file we can find is:

       

      1: The slave host controller's log:

      11:12:52,873 INFO  [org.jboss.as.host.controller] (domain-connection-threads - 20) JBAS010916: Reconnected to master
      11:14:01,982 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-1) JBAS010920: Server [Server:teiid-prod-server-one] registered using connection [Channel ID 131b6cdb (inbound) of Remoting connection 1e6a7c88 to /162.44.29.158:59540]
      11:15:31,502 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      11:15:53,160 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-3) JBAS010920: Server [Server:teiid-prod-server-two] registered using connection [Channel ID 29f7afd2 (inbound) of Remoting connection 669c9f59 to /162.44.29.158:4884]
      11:26:46,379 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      11:26:46,379 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      11:26:56,132 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      11:26:56,133 WARN  [org.jboss.as.host.controller] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBAS010914: Connection to remote host-controller closed. Trying to reconnect.
      11:27:25,155 INFO  [org.jboss.as.host.controller] (domain-connection-threads - 30) JBAS010916: Reconnected to master
      13:06:34,807 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer
      13:06:56,905 WARN  [org.jboss.as.host.controller] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-1) JBAS010914: Connection to remote host-controller closed. Trying to reconnect.
      13:07:06,374 INFO  [org.jboss.as.host.controller] (domain-connection-threads - 43) JBAS010916: Reconnected to master
      16:07:10,521 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      16:09:38,624 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-3) JBAS010920: Server [Server:teiid-prod-server-two] registered using connection [Channel ID 600b846a (inbound) of Remoting connection 22d63665 to /162.44.29.158:32246]
      17:33:58,971 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      17:34:10,317 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-4) JBAS010920: Server [Server:teiid-prod-server-one] registered using connection [Channel ID 030d45a6 (inbound) of Remoting connection 6f6ed124 to /162.44.29.158:39415]
      10:26:56,131 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      10:27:19,992 INFO  [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-1) JBAS010920: Server [Server:teiid-prod-server-one] registered using connection [Channel ID 642e8c7f (inbound) of Remoting connection 5d42bd74 to /162.44.29.158:60220]
      02:23:58,737 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer
      02:23:58,738 WARN  [org.jboss.as.host.controller] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBAS010914: Connection to remote host-controller closed. Trying to reconnect.
      02:27:06,273 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer
      02:27:13,459 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: JBREM000202: Abrupt close on Remoting connection 0523eae6 to cdtssoa111p/162.44.29.159:9999
      02:27:13,794 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: JBREM000202: Abrupt close on Remoting connection 7912a4ac to cdtssoa111p/162.44.29.159:9999
      02:29:34,892 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer
      02:31:13,232 INFO  [org.jboss.as.host.controller] (domain-connection-threads - 45) JBAS010916: Reconnected to master
      
      

       

      2:  The Domain controller's log:

      Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out
      07:33:08,694 ERROR [stderr] (main) java.io.IOException: JBAS012175: Channel closed
      [Server:teiid-prod-server-three] 07:33:08,696 ERROR [stderr] (main)  at org.jboss.as.server.mgmt.domain.HostControllerConnection.getChannel(HostControllerConnection.java:100)
      [Server:teiid-prod-server-three] 07:33:08,697 ERROR [stderr] (main)  at org.jboss.as.protocol.mgmt.ManagementChannelHandler.executeRequest(ManagementChannelHandler.java:115)
      [Server:teiid-prod-server-three] 07:33:08,697 ERROR [stderr] (main)  at org.jboss.as.protocol.mgmt.ManagementChannelHandler.executeRequest(ManagementChannelHandler.java:98)
      [Server:teiid-prod-server-three] 07:33:08,704 ERROR [stderr] (main)  at org.jboss.as.server.mgmt.domain.HostControllerConnection.reConnect(HostControllerConnection.java:168)
      [Server:teiid-prod-server-three] 07:33:08,704 ERROR [stderr] (main)  at org.jboss.as.server.mgmt.domain.HostControllerClient.reconnect(HostControllerClient.java:98)
      [Server:teiid-prod-server-three] 07:33:08,708 ERROR [stderr] (main)  at org.jboss.as.server.DomainServerMain.main(DomainServerMain.java:138)
      [Server:teiid-prod-server-three] 07:33:08,708 ERROR [stderr] (main)  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      [Server:teiid-prod-server-three] 07:33:08,709 ERROR [stderr] (main)  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      [Server:teiid-prod-server-three] 07:33:08,710 ERROR [stderr] (main)  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      [Server:teiid-prod-server-three] 07:33:08,713 ERROR [stderr] (main)  at java.lang.reflect.Method.invoke(Method.java:601)
      [Server:teiid-prod-server-three] 07:33:08,723 ERROR [stderr] (main)  at org.jboss.modules.Module.run(Module.java:270)
      [Server:teiid-prod-server-three] 07:33:08,724 ERROR [stderr] (main)  at org.jboss.modules.Main.main(Main.java:411)
      [Server:teiid-prod-server-three] 07:33:08,803 INFO  [org.teiid.RUNTIME] (MSC service thread 1-1) TEIID50010 Translator "postgresql" removed
      [Server:teiid-prod-server-three] 07:33:08,801 INFO  [org.teiid.RUNTIME] (MSC service thread 1-4) TEIID50010 Translator "teiid" removed
      [.... bunch more logs about shutting down server-three ....]
      
      

       

      Configuration of the domain controller particular to jgroups:

       

      <subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="tcp">
                      <stack name="udp">
                          <transport type="UDP" socket-binding="jgroups-udp"/>
                          <protocol type="PING"/>
                          <protocol type="MERGE3"/>
                          <protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/>
                          <protocol type="FD"/>
                          <protocol type="VERIFY_SUSPECT"/>
                          <protocol type="pbcast.NAKACK"/>
                          <protocol type="UNICAST2"/>
                          <protocol type="pbcast.STABLE"/>
                          <protocol type="pbcast.GMS"/>
                          <protocol type="UFC"/>
                          <protocol type="MFC"/>
                          <protocol type="FRAG2"/>
                          <protocol type="RSVP"/>
                      </stack>
                      <stack name="tcp">
                          <transport type="TCP" socket-binding="jgroups-tcp"/>
                          <protocol type="TCPPING">
                              <property name="initial_hosts">
                                  cdtssoa111p[40000],cdtssoa110p[40000]
                              </property>
                              <property name="num_initial_members">2</property>
                              <property name="port_range">1</property>
                              <property name="timeout">5000</property>
                              <property name="break_on_coord_rsp">true</property>
                              <property name="level">debug</property>
                          </protocol>
                          <protocol type="MERGE2"/>
                          <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
                          <protocol type="FD"/>
                          <protocol type="VERIFY_SUSPECT"/>
                          <protocol type="BARRIER"/>
                          <protocol type="pbcast.NAKACK"/>
                          <protocol type="UNICAST2"/>
                          <protocol type="pbcast.STABLE"/>
                          <protocol type="pbcast.GMS"/>
                          <protocol type="UFC"/>
                          <protocol type="MFC"/>
                          <protocol type="FRAG2"/>
                          <protocol type="RSVP"/>
                      </stack>
                  </subsystem>
      
      ...
      <socket-binding-group name="ha-sockets" default-interface="public">
                  <socket-binding name="ajp" port="8009"/>
                  <socket-binding name="http" port="8080"/>
                  <socket-binding name="https" port="8443"/>
                  <socket-binding name="jgroups-mping" port="0" multicast-address="${jboss.default.multicast.address:230.0.0.4}" multicast-port="45700"/>
                  <socket-binding name="jgroups-tcp" port="40000"/>
                  <socket-binding name="jgroups-tcp-fd" port="57600"/>
                  <socket-binding name="jgroups-udp" port="55200" multicast-address="${jboss.default.multicast.address:230.0.0.4}" multicast-port="45688"/>
                  <socket-binding name="jgroups-udp-fd" port="54200"/>
                  <socket-binding name="modcluster" port="0" multicast-address="224.0.1.105" multicast-port="23364"/>
                  <socket-binding name="remoting" port="4447"/>
                  <socket-binding name="txn-recovery-environment" port="4712"/>
                  <socket-binding name="txn-status-manager" port="4713"/>
                  <socket-binding name="teiid-jdbc" port="31000"/>
                  <socket-binding name="teiid-odbc" port="35432"/>
                  <outbound-socket-binding name="mail-smtp">
                      <remote-destination host="localhost" port="25"/>
                  </outbound-socket-binding>
              </socket-binding-group>
      

       

      What I suspect is that the XNIO ReadTimeOuts and connection failures between the slave HC and DC are somehow causing server 3 to be shutdown, but I'm unclear as to why.  Both the slave HC and DC are started with:

       

      -Djboss.host.server.connection.timeout=90000 and

      -Djboss.host.domain.connection.timeout=90000

       

      Is there something in the jgroups configuration that is incorrect?

      Thanks,

      John