Hello again,
I made some further tests concerning fault tolerance, and I have few questions about it.
I simulated a loss of connection between my two nodes by stopping the network interface on node "legolaslg" which was the first started.
A client was already connected.
After some seconds, I had the following messages on the other node, "gimlig".
09:15:45,997 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[gimlig:7800 (addi
tional data: 18 bytes)], local_addr=gimlig:7800 (additional data: 18 bytes)
09:15:46,505 INFO [DefaultPartition] Suspected member: legolaslg:7800 (additional data: 18 bytes)
09:15:46,508 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: -1) : [172.21.158.20:1099]
09:15:46,508 INFO [DefaultPartition] I am (172.21.158.20:1099) received membershipChanged event:
09:15:46,508 INFO [DefaultPartition] Dead members: 1 ([172.21.158.37:1099])
09:15:46,509 INFO [DefaultPartition] New Members : 0 ([])
09:15:46,509 INFO [DefaultPartition] All Members : 1 ([172.21.158.20:1099])
09:15:47,285 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=file:/opt/jboss-3.2.7/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
09:15:47,744 INFO [A] Bound to JNDI name: queue/A
09:15:47,747 INFO [B] Bound to JNDI name: queue/B
09:15:47,749 INFO [C] Bound to JNDI name: queue/C
09:15:47,752 INFO [D] Bound to JNDI name: queue/D
09:15:47,754 INFO [ex] Bound to JNDI name: queue/ex
09:15:47,783 INFO [testTopic] Bound to JNDI name: topic/testTopic
09:15:47,786 INFO [securedTopic] Bound to JNDI name: topic/securedTopic
09:15:47,788 INFO [testDurableTopic] Bound to JNDI name: topic/testDurableTopic
09:15:47,791 INFO [testQueue] Bound to JNDI name: queue/testQueue
09:15:47,852 INFO [OILServerILService] JBossMQ OIL service available at : /0.0.0.0:8090
09:15:47,919 INFO [UILServerILService] JBossMQ UIL service available at : /0.0.0.0:8093
09:15:47,971 INFO [DLQ] Bound to JNDI name: queue/DLQ
But nothing on the other node !
Fifteen minutes later, I had the following message on node "gimlig":
09:31:10,977 INFO [ConnectionTable] exception is java.net.SocketException: No route to host
09:31:13,289 INFO [ConnectionTable] addr=legolaslg:7800, connections are connections (1):
key: gimlig:7800: <gimlig:7800 --> gimlig:38551> (851 secs old)
In the meantime, and ten minutes after I had shutdown the interface, I had the following messages on node "legolaslg":
09:27:52,775 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:27:55,155 INFO [DefaultPartition] Suspected member: gimlig:7800 (additional data: 18 bytes)
09:27:55,284 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:27:57,794 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:00,304 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:02,972 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:05,474 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:07,985 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:10,494 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:13,004 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:15,514 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:18,024 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:20,534 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:23,044 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:25,554 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
The last messages are repeated every seconds.
Few minutes later, I tried to use the client, and the following message appeared in the logs (database connection uses also that private network)
09:31:11,844 ERROR [PreparedStatementCache] Failed closing cached statement
java.sql.SQLException: Io exception: Connection timed out
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:168)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:210)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:323)
at oracle.jdbc.driver.OracleStatement.close(OracleStatement.java:604)
at oracle.jdbc.driver.OraclePreparedStatement.privateClose(OraclePreparedStatement.java:290)
at oracle.jdbc.driver.OraclePreparedStatement.close(OraclePreparedStatement.java:235)
at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.agedOut(CachedPreparedStatement.java:312)
at org.jboss.resource.adapter.jdbc.PreparedStatementCache.ageOut(PreparedStatementCache.java:42)
at org.jboss.util.LRUCachePolicy.flush(LRUCachePolicy.java:183)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.destroy(BaseWrapperManagedConnection.java:268)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.doDestroy(InternalManagedConnectionPool.java:539)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.removeTimedOut(InternalManagedConnectionPool.java:415)
at org.jboss.resource.connectionmanager.IdleRemover$1.run(IdleRemover.java:70)
at java.lang.Thread.run(Thread.java:595)
09:31:11,847 WARN [JBossManagedConnectionPool] Exception destroying ManagedConnection org.jboss.resource.connectionmanager.TxConnectionManager$TxConnectionEventListener@
19f332b[state=DESTROYED mc=org.jboss.resource.adapter.jdbc.local.LocalManagedConnection@1ee4c69 handles=0 lastUse=1108627196592 permit=false trackByTx=false mcp=org.jboss
.resource.connectionmanager.JBossManagedConnectionPool$OnePool@1e193f2 context=org.jboss.resource.connectionmanager.InternalManagedConnectionPool@1da990d]
org.jboss.resource.JBossResourceException: SQLException; - nested throwable: (java.sql.SQLException: Io exception: Broken pipe)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.checkException(BaseWrapperManagedConnection.java:572)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.destroy(BaseWrapperManagedConnection.java:276)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.doDestroy(InternalManagedConnectionPool.java:539)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.removeTimedOut(InternalManagedConnectionPool.java:415)
at org.jboss.resource.connectionmanager.IdleRemover$1.run(IdleRemover.java:70)
at java.lang.Thread.run(Thread.java:595)
Caused by: java.sql.SQLException: Io exception: Broken pipe
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:168)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:210)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:323)
at oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:919)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.destroy(BaseWrapperManagedConnection.java:272)
... 4 more
09:31:14,244 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
As a result, the client was lost and I had to kill it.
The ping messages were still inserted in the log every seconds.
I tried to restart the client, but I couldn't connect. It was hanging forever for a connection.
On the other node, network is ok, and database is reachable through a standard SQL connection.
For that, I had to stop completely the two nodes, and restart them for any client to establish a connection.
I could stop JBoss successfully on "gimlig" node, but I had to "kill -9" the java process on "legolaslg" node.
Here are the end of the logs on "legolaslg" node:
09:47:28,000 INFO [TxConnectionManager] Unbound connection factory for resource adapter for ConnectionManager 'jboss.jca:service=LocalTxCM,name=DefaultDS from JNDI name
'java:/DefaultDS'
09:47:28,000 INFO [TxConnectionManager] Unbound connection factory for resource adapter for ConnectionManager 'jboss.jca:service=TxCM,name=JmsXA from JNDI name 'java:/Jm
sXA'
09:47:28,001 INFO [TxConnectionManager] Unbound connection factory for resource adapter for ConnectionManager 'jboss.jca:service=LocalTxCM,name=jdbc/LusciDS from JNDI na
me 'java:/jdbc/LusciDS'
09:47:28,018 INFO [TreeCache] stopService(): closing the channel
09:47:28,314 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:47:28,329 INFO [TreeCache] stopService(): stopping the dispatcher
09:47:28,346 INFO [MailService] Mail service 'java:/Mail' removed from JNDI
09:47:28,352 INFO [orb] prepare ORB for shutdown...
09:47:28,352 INFO [orb] ORB going down...
09:47:28,353 INFO [orb] ORB shutdown complete
09:47:28,353 INFO [orb] ORB run, exit
09:47:28,364 INFO [TomcatDeployer] undeploy, ctxPath=/jbossmq-httpil, warUrl=file:/opt/jboss-3.2.7/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.wa
r/
09:47:28,429 INFO [DefaultPartition] Closing partition DefaultPartition
09:47:35,454 ERROR [TCP] down_handler thread for TCP was interrupted (in order to be terminated), but is is still alive
09:47:35,455 INFO [DefaultPartition] Partition DefaultPartition closed.
/opt/jboss/bin/run.sh: line 181: 13685 Killed "$JAVA" $JAVA_OPTS -Djava.endorsed.dirs="$JBOSS_ENDORSED_DIRS" -classpath "$JBOSS_CLASSPATH" org.jboss.Main
"$@"
I restarted node "gimlig" first, then node "legolaslg" which still had no connection over private network.
On gimlig, I had the following entry:
09:53:05,277 INFO [TreeCache] viewAccepted(): new members: [gimli:34044, legolasl:33513]
09:53:05,288 INFO [TreeCache] locking the tree to obtain transient state
09:53:05,292 INFO [TreeCache] returning the transient state (217 bytes)
On legolasl, I had these entries:
09:52:45,354 INFO [DefaultPartition] Initializing
09:52:45,425 INFO [ConnectionTable] server socket created on legolaslg:7800
09:52:45,426 INFO [STDOUT]
-------------------------------------------------------
GMS: address is legolaslg:7800 (additional data: 18 bytes)
-------------------------------------------------------
09:52:45,438 INFO [ConnectionTable] accepted connection, client_sock=Socket[addr=/172.21.158.37,port=43991,localport=7800]
09:52:45,443 INFO [ConnectionTable] input_cookie is bela
09:52:45,445 INFO [ConnectionTable] connection was created to legolaslg:7800
09:52:45,445 INFO [ConnectionTable] created socket to legolaslg:7800
09:52:45,448 INFO [ConnectionTable] connection was created to legolaslg:7800 (additional data: 18 bytes)
09:52:48,450 INFO [DefaultPartition] Number of cluster members: 1
09:52:48,450 INFO [DefaultPartition] Other members: 0
09:52:48,450 INFO [DefaultPartition] Fetching state (will wait for 60000 milliseconds):
09:52:48,485 INFO [HANamingService] Listening on /0.0.0.0:1100
09:52:48,492 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=172.16.128.26:1100
When I re-enabled interface on node legolaslg, the two nodes successfully joined "DefaultPartition". However, it seems that nothing was instanciated on node legolasl, as all classes were process by the other node.
These few tests were made especially to check whether we could without a big risk to split the cluster in two different rooms of our production site. It seems that we have to be careful...
It seems that we maybe have to use a GOSSIP server, who will act as the "policeman" of the cluster.
What do you think about that ?
Is there a possibility to define a master node which will stay alive while the other dies if isolated ?
Thanks in advance for your help.