1 of 4 nodes goes down in Domain cluster with jgroups using TCPPING
blue666man May 27, 2014 11:52 AMNot sure if this should be here or under jgroups; apologies in advance. My setup is:
Two RHEL 6 virtual machines
JBoss EAP 6.1.0 Final setup in domain mode; each host controller, including the domain controller, fires up 2 app server processes
JBoss Teiid 8.6.0 Final
Of the four app server processes: "teiid-prod-server-one" and "teiid-prod-server-two" are on the slave host controller "cdtssoa110p". "teiid-prod-server-three" and "teiid-prod-server-four" are on the domain controller "cdtssoa111p".
"teiid-prod-server-three" always goes down (process is actually stopped) after about 23-26 hours after launching it. The only ERROR in the host-controller.log file we can find is:
1: The slave host controller's log:
11:12:52,873 INFO [org.jboss.as.host.controller] (domain-connection-threads - 20) JBAS010916: Reconnected to master 11:14:01,982 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-1) JBAS010920: Server [Server:teiid-prod-server-one] registered using connection [Channel ID 131b6cdb (inbound) of Remoting connection 1e6a7c88 to /162.44.29.158:59540] 11:15:31,502 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 11:15:53,160 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-3) JBAS010920: Server [Server:teiid-prod-server-two] registered using connection [Channel ID 29f7afd2 (inbound) of Remoting connection 669c9f59 to /162.44.29.158:4884] 11:26:46,379 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 11:26:46,379 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 11:26:56,132 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 11:26:56,133 WARN [org.jboss.as.host.controller] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBAS010914: Connection to remote host-controller closed. Trying to reconnect. 11:27:25,155 INFO [org.jboss.as.host.controller] (domain-connection-threads - 30) JBAS010916: Reconnected to master 13:06:34,807 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer 13:06:56,905 WARN [org.jboss.as.host.controller] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-1) JBAS010914: Connection to remote host-controller closed. Trying to reconnect. 13:07:06,374 INFO [org.jboss.as.host.controller] (domain-connection-threads - 43) JBAS010916: Reconnected to master 16:07:10,521 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 16:09:38,624 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-3) JBAS010920: Server [Server:teiid-prod-server-two] registered using connection [Channel ID 600b846a (inbound) of Remoting connection 22d63665 to /162.44.29.158:32246] 17:33:58,971 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 17:34:10,317 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-4) JBAS010920: Server [Server:teiid-prod-server-one] registered using connection [Channel ID 030d45a6 (inbound) of Remoting connection 6f6ed124 to /162.44.29.158:39415] 10:26:56,131 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 10:27:19,992 INFO [org.jboss.as.domain.controller.mgmt] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" task-1) JBAS010920: Server [Server:teiid-prod-server-one] registered using connection [Channel ID 642e8c7f (inbound) of Remoting connection 5d42bd74 to /162.44.29.158:60220] 02:23:58,737 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer 02:23:58,738 WARN [org.jboss.as.host.controller] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBAS010914: Connection to remote host-controller closed. Trying to reconnect. 02:27:06,273 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer 02:27:13,459 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: JBREM000202: Abrupt close on Remoting connection 0523eae6 to cdtssoa111p/162.44.29.159:9999 02:27:13,794 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: JBREM000202: Abrupt close on Remoting connection 7912a4ac to cdtssoa111p/162.44.29.159:9999 02:29:34,892 ERROR [org.jboss.remoting.remote.connection] (Remoting "cdtssoa110p.rxcorp.com:MANAGEMENT" read-1) JBREM000200: Remote connection failed: java.io.IOException: Connection reset by peer 02:31:13,232 INFO [org.jboss.as.host.controller] (domain-connection-threads - 45) JBAS010916: Reconnected to master
2: The Domain controller's log:
Remote connection failed: org.xnio.channels.ReadTimeoutException: Read timed out 07:33:08,694 ERROR [stderr] (main) java.io.IOException: JBAS012175: Channel closed [Server:teiid-prod-server-three] 07:33:08,696 ERROR [stderr] (main) at org.jboss.as.server.mgmt.domain.HostControllerConnection.getChannel(HostControllerConnection.java:100) [Server:teiid-prod-server-three] 07:33:08,697 ERROR [stderr] (main) at org.jboss.as.protocol.mgmt.ManagementChannelHandler.executeRequest(ManagementChannelHandler.java:115) [Server:teiid-prod-server-three] 07:33:08,697 ERROR [stderr] (main) at org.jboss.as.protocol.mgmt.ManagementChannelHandler.executeRequest(ManagementChannelHandler.java:98) [Server:teiid-prod-server-three] 07:33:08,704 ERROR [stderr] (main) at org.jboss.as.server.mgmt.domain.HostControllerConnection.reConnect(HostControllerConnection.java:168) [Server:teiid-prod-server-three] 07:33:08,704 ERROR [stderr] (main) at org.jboss.as.server.mgmt.domain.HostControllerClient.reconnect(HostControllerClient.java:98) [Server:teiid-prod-server-three] 07:33:08,708 ERROR [stderr] (main) at org.jboss.as.server.DomainServerMain.main(DomainServerMain.java:138) [Server:teiid-prod-server-three] 07:33:08,708 ERROR [stderr] (main) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [Server:teiid-prod-server-three] 07:33:08,709 ERROR [stderr] (main) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [Server:teiid-prod-server-three] 07:33:08,710 ERROR [stderr] (main) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [Server:teiid-prod-server-three] 07:33:08,713 ERROR [stderr] (main) at java.lang.reflect.Method.invoke(Method.java:601) [Server:teiid-prod-server-three] 07:33:08,723 ERROR [stderr] (main) at org.jboss.modules.Module.run(Module.java:270) [Server:teiid-prod-server-three] 07:33:08,724 ERROR [stderr] (main) at org.jboss.modules.Main.main(Main.java:411) [Server:teiid-prod-server-three] 07:33:08,803 INFO [org.teiid.RUNTIME] (MSC service thread 1-1) TEIID50010 Translator "postgresql" removed [Server:teiid-prod-server-three] 07:33:08,801 INFO [org.teiid.RUNTIME] (MSC service thread 1-4) TEIID50010 Translator "teiid" removed [.... bunch more logs about shutting down server-three ....]
Configuration of the domain controller particular to jgroups:
<subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="tcp"> <stack name="udp"> <transport type="UDP" socket-binding="jgroups-udp"/> <protocol type="PING"/> <protocol type="MERGE3"/> <protocol type="FD_SOCK" socket-binding="jgroups-udp-fd"/> <protocol type="FD"/> <protocol type="VERIFY_SUSPECT"/> <protocol type="pbcast.NAKACK"/> <protocol type="UNICAST2"/> <protocol type="pbcast.STABLE"/> <protocol type="pbcast.GMS"/> <protocol type="UFC"/> <protocol type="MFC"/> <protocol type="FRAG2"/> <protocol type="RSVP"/> </stack> <stack name="tcp"> <transport type="TCP" socket-binding="jgroups-tcp"/> <protocol type="TCPPING"> <property name="initial_hosts"> cdtssoa111p[40000],cdtssoa110p[40000] </property> <property name="num_initial_members">2</property> <property name="port_range">1</property> <property name="timeout">5000</property> <property name="break_on_coord_rsp">true</property> <property name="level">debug</property> </protocol> <protocol type="MERGE2"/> <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/> <protocol type="FD"/> <protocol type="VERIFY_SUSPECT"/> <protocol type="BARRIER"/> <protocol type="pbcast.NAKACK"/> <protocol type="UNICAST2"/> <protocol type="pbcast.STABLE"/> <protocol type="pbcast.GMS"/> <protocol type="UFC"/> <protocol type="MFC"/> <protocol type="FRAG2"/> <protocol type="RSVP"/> </stack> </subsystem> ... <socket-binding-group name="ha-sockets" default-interface="public"> <socket-binding name="ajp" port="8009"/> <socket-binding name="http" port="8080"/> <socket-binding name="https" port="8443"/> <socket-binding name="jgroups-mping" port="0" multicast-address="${jboss.default.multicast.address:230.0.0.4}" multicast-port="45700"/> <socket-binding name="jgroups-tcp" port="40000"/> <socket-binding name="jgroups-tcp-fd" port="57600"/> <socket-binding name="jgroups-udp" port="55200" multicast-address="${jboss.default.multicast.address:230.0.0.4}" multicast-port="45688"/> <socket-binding name="jgroups-udp-fd" port="54200"/> <socket-binding name="modcluster" port="0" multicast-address="224.0.1.105" multicast-port="23364"/> <socket-binding name="remoting" port="4447"/> <socket-binding name="txn-recovery-environment" port="4712"/> <socket-binding name="txn-status-manager" port="4713"/> <socket-binding name="teiid-jdbc" port="31000"/> <socket-binding name="teiid-odbc" port="35432"/> <outbound-socket-binding name="mail-smtp"> <remote-destination host="localhost" port="25"/> </outbound-socket-binding> </socket-binding-group>
What I suspect is that the XNIO ReadTimeOuts and connection failures between the slave HC and DC are somehow causing server 3 to be shutdown, but I'm unclear as to why. Both the slave HC and DC are started with:
-Djboss.host.server.connection.timeout=90000 and
-Djboss.host.domain.connection.timeout=90000
Is there something in the jgroups configuration that is incorrect?
Thanks,
John