10 Replies Latest reply on Apr 18, 2005 5:28 PM by achech

How to bind specific network layers to a specific interface

aguillevick Feb 14, 2005 5:57 AM

Hello All,

I'm trying to setup the clustering service in such way that all "administrative" dialog will use a private network, whild clients will connet on another interface.

I use Jboss 3.2.7 on a linux RH AS v3.

I tried several combinations of settings without the expected result:

First combination:
-------------------

Standard startup sequence (/etc/rc.d/init.d/jboss start) and cluster-service.xml modified as follows:


legolaslg

loopback="false" bind_addr="legolaslg" />

The logs on two nodes show that a part of dialog is made on the private network, but then, farming deployement goes over the public network.
Trying a netstat -a | grep "MyPrivateAddresses" shows also than no connect are established over private network.

Clients can connect.

Standard JBoss shutdown command works.

Second combination:
-----------------------

I use the -b switch at JBoss startup to bind all network layers to the interface that corresponds to public network.

I use the same settings than above for cluster-service.xml

Result is the same that first combination regarding "administrative dialog"

Client can open a connection.

JBoss shutdown with standard shutdown command also impossible. Only a "kill -15" on the java process helps.

When I use the "shutdown.sh -S", I have the following error:

Exception in thread "main" javax.naming.CommunicationException: Could not obtain connection to any of these urls: localhost:1099 [Root exception is javax.naming.CommunicationException: Failed to connect to server localhost:1099 [Root exception is javax.naming.ServiceUnavailableException: Failed to connect to server localhost:1099 [Root exception is java.net.ConnectException: Connection refused]]]
at org.jnp.interfaces.NamingContext.checkRef(NamingContext.java:1363)
at org.jnp.interfaces.NamingContext.lookup(NamingContext.java:575)
at org.jnp.interfaces.NamingContext.lookup(NamingContext.java:568)
at javax.naming.InitialContext.lookup(InitialContext.java:351)
at org.jboss.Shutdown.main(Shutdown.java:201)
Caused by: javax.naming.CommunicationException: Failed to connect to server localhost:1099 [Root exception is javax.naming.ServiceUnavailableException: Failed to connect to server localhost:1099 [Root exception is java.net.ConnectException: Connection refused]]
at org.jnp.interfaces.NamingContext.getServer(NamingContext.java:250)
at org.jnp.interfaces.NamingContext.checkRef(NamingContext.java:1348)
... 4 more
Caused by: javax.naming.ServiceUnavailableException: Failed to connect to server localhost:1099 [Root exception is java.net.ConnectException: Connection refused]
at org.jnp.interfaces.NamingContext.getServer(NamingContext.java:224)
... 5 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:364)
at java.net.Socket.connect(Socket.java:507)
at java.net.Socket.connect(Socket.java:457)
at java.net.Socket.(Socket.java:365)
at java.net.Socket.(Socket.java:265)
at org.jnp.interfaces.TimedSocketFactory.createSocket(TimedSocketFactory.java:69)
at org.jnp.interfaces.TimedSocketFactory.createSocket(TimedSocketFactory.java:62)
at org.jnp.interfaces.NamingContext.getServer(NamingContext.java:220)
... 5 more

Third combination:
-----------------------

I use the -b switch at JBoss startup to bind all network layers to the interface that corresponds to private network.

I use the same settings than above for cluster-service.xml

This time, client can't open a connection.

JBoss shutdown with standard shutdown command is also impossible. Only a "kill -15" on the java process helps.

Doesn't a special directive exist to tell JBoss to use a certain network for its administrative dialog, and another to listen to client on another one ?

Thanks in advande for your help.

1. Re: How to bind specific network layers to a specific interf

aguillevick Feb 14, 2005 6:00 AM (in response to aguillevick)

Hello again,

In the first section called "First Combination", the hostname corresponds to the private network.

Thanks.
Actions
2. Re: How to bind specific network layers to a specific interf

aguillevick Feb 14, 2005 7:57 AM (in response to aguillevick)

Hi again,

While browsing the logs, I can see the following entries which seems more and less the same lines, with two different addresses (but ports are different). The first extract shows all HAservices attached to my private network, and then, the second entries show the same service now attached to public network, which I don't what.
Where could my mistake be ?

THanks in advance.

13:39:58,880 INFO [DefaultPartition] Initializing
13:39:58,944 INFO [UDP] unicast sockets will use interface 172.21.158.37
13:39:58,951 INFO [UDP] socket information:
local_addr=legolaslg:33292 (additional data: 18 bytes), mcast_addr=228.1.2.3:45566, bind_addr=legolaslg/172.21.158.37, ttl=32
sock: bound to 172.21.158.37:33292, receive buffer size=131071, send buffer size=131071
mcast_recv_sock: bound to 172.21.158.37:45566, send buffer size=131071, receive buffer size=131071
mcast_send_sock: bound to 172.21.158.37:33293, send buffer size=131071, receive buffer size=131071
13:39:58,952 INFO [STDOUT]
-------------------------------------------------------
GMS: address is legolaslg:33292 (additional data: 18 bytes)
-------------------------------------------------------
13:40:00,972 INFO [DefaultPartition] Number of cluster members: 1
13:40:00,973 INFO [DefaultPartition] Other members: 0
13:40:00,973 INFO [DefaultPartition] Fetching state (will wait for 60000 milliseconds):
13:40:00,973 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 0, delta: 0) : [172.21.158.37:1099]
13:40:00,979 INFO [DefaultPartition] I am (172.21.158.37:1099) received membershipChanged event:
13:40:00,979 INFO [DefaultPartition] Dead members: 0 ([])
13:40:00,980 INFO [DefaultPartition] New Members : 0 ([])
13:40:00,980 INFO [DefaultPartition] All Members : 1 ([172.21.158.37:1099])
13:40:01,012 INFO [HANamingService] Listening on legolaslg/172.21.158.37:1100
13:40:01,018 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /172.21.158.37:1102, group=230.0.0.4, HA-JNDI address=172.21.158.37:1100
13:40:01,749 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=file:/opt/jboss-3.2.7/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
13:40:02,151 INFO [STDOUT] [ jacorb.home unset! Will use '.' ]
13:40:02,151 INFO [STDOUT] [ File ./jacorb.properties for configuration jacorb not found ]
13:40:02,223 INFO [interceptors] InterceptorManager started with 0 SIs, 0 CIs and 2 IORIs
13:40:02,316 INFO [orb] ORB run
13:40:02,418 INFO [CorbaNamingService] Naming: [IOR:000000000000002B49444C3A6F6D672E6F72672F436F734E616D696E672F4E616D696E67436F6E746578744578743A312E3000000000000200000
00000000068000102000000000E3137322E31362E3132382E3236000DC8000000114A426F73732F4E616D696E672F726F6F74000000000000020000000000000008000000004A414300000000010000001C0000000
0000100010000000105010001000101090000000105010001000000010000002C0000000000000001000000010000001C00000000000100010000000105010001000101090000000105010001]
13:40:02,573 INFO [MailService] Mail Service bound to java:/Mail
13:40:02,773 INFO [TreeCache] setting cluster properties from xml to: UDP(ip_mcast=true;ip_ttl=64;loopback=false;mcast_addr=230.1.2.3;mcast_port=45577;mcast_recv_buf_siz
e=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000):PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):MERGE2
(max_interval=20000;min_interval=10000):FD_SOCK:VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):pbcast.NAKACK(down_thread=false;gc_lag=50;max_xmit_size=819
2;retransmit_timeout=600,1200,2400,4800;up_thread=false):UNICAST(down_thread=false;min_threshold=10;timeout=600,1200,2400;window_size=100):pbcast.STABLE(desired_avg_gossi
p=20000;down_thread=false;up_thread=false):FRAG(down_thread=false;frag_size=8192;up_thread=false):pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=tr
ue;shun=true):pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)
13:40:02,792 INFO [TreeCache] interceptor chain is:
class org.jboss.cache.interceptors.CallInterceptor
class org.jboss.cache.interceptors.LockInterceptor
class org.jboss.cache.interceptors.CreateIfNotExistsInterceptor
class org.jboss.cache.interceptors.ReplicationInterceptor
13:40:02,792 INFO [TreeCache] cache mode is REPL_ASYNC
13:40:02,808 INFO [UDP] unicast sockets will use interface 172.16.128.26
13:40:02,813 INFO [UDP] socket information:
local_addr=legolasl:33295, mcast_addr=230.1.2.3:45577, bind_addr=/172.16.128.26, ttl=64
sock: bound to 172.16.128.26:33295, receive buffer size=80000, send buffer size=131071
mcast_recv_sock: bound to 172.16.128.26:45577, send buffer size=131071, receive buffer size=80000
mcast_send_sock: bound to 172.16.128.26:33296, send buffer size=131071, receive buffer size=80000
13:40:02,813 INFO [STDOUT]
-------------------------------------------------------
GMS: address is legolasl:33295
-------------------------------------------------------
13:40:04,825 INFO [TreeCache] viewAccepted(): new members: [legolasl:33295]
13:40:04,826 INFO [TreeCache] state could not be retrieved (must be first member in group)
13:40:04,826 INFO [TreeCache] new cache is null (maybe first member in cluster)
13:40:04,921 INFO [DefaultDS] Bound connection factory for resource adapter for ConnectionManager 'jboss.jca:service=LocalTxCM,name=DefaultDS to JNDI name 'java:/Default
Actions
3. Re: How to bind specific network layers to a specific interf

aguillevick Feb 17, 2005 2:25 AM (in response to aguillevick)

Hello,

I solved a part of my problem using TCP & TCPPING instead of UDP. The different options proposed with this protocol seems to make a finer setup possible. These are my settings for cluster-service.xml on one node (I used for that the excellent documentation found at JGroups website):

<TCP
start_port="7800"
bind_addr="legolaslg"
loopback="false"
/>
<TCPPING
initial_hosts="legolaslg[7800],gimlig[7800]"
num_initial_members="2"
port_range="5"
timeout="3000"
up_thread="true"
down_thread="true"
/>

The hostnames here correspond to the ones on private network.
This time, a "netstat -a" shows that JBoss listens to connections from private network on port 7800. However, there are still some connections between the servers over the public network:

tcp 0 0 legolasl:34334 gimli.dmv-stainless.c:49152 ESTABLISHED
tcp 0 0 legolasl:49152 gimli.dmv-stainless.c:36717 ESTABLISHED

I have some difficulties to know exactly what are these ports made for. Are they a part of the HA layer ? And so, could we bind them to private network ?

I will make some further tests and will post the results.
Actions
4. Re: How to bind specific network layers to a specific interf

aguillevick Feb 17, 2005 4:41 AM (in response to aguillevick)

Hello again,

I made some further tests concerning fault tolerance, and I have few questions about it.
I simulated a loss of connection between my two nodes by stopping the network interface on node "legolaslg" which was the first started.
A client was already connected.

After some seconds, I had the following messages on the other node, "gimlig".

09:15:45,997 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[gimlig:7800 (addi
tional data: 18 bytes)], local_addr=gimlig:7800 (additional data: 18 bytes)
09:15:46,505 INFO [DefaultPartition] Suspected member: legolaslg:7800 (additional data: 18 bytes)
09:15:46,508 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: -1) : [172.21.158.20:1099]
09:15:46,508 INFO [DefaultPartition] I am (172.21.158.20:1099) received membershipChanged event:
09:15:46,508 INFO [DefaultPartition] Dead members: 1 ([172.21.158.37:1099])
09:15:46,509 INFO [DefaultPartition] New Members : 0 ([])
09:15:46,509 INFO [DefaultPartition] All Members : 1 ([172.21.158.20:1099])
09:15:47,285 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=file:/opt/jboss-3.2.7/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
09:15:47,744 INFO [A] Bound to JNDI name: queue/A
09:15:47,747 INFO [B] Bound to JNDI name: queue/B
09:15:47,749 INFO [C] Bound to JNDI name: queue/C
09:15:47,752 INFO [D] Bound to JNDI name: queue/D
09:15:47,754 INFO [ex] Bound to JNDI name: queue/ex
09:15:47,783 INFO [testTopic] Bound to JNDI name: topic/testTopic
09:15:47,786 INFO [securedTopic] Bound to JNDI name: topic/securedTopic
09:15:47,788 INFO [testDurableTopic] Bound to JNDI name: topic/testDurableTopic
09:15:47,791 INFO [testQueue] Bound to JNDI name: queue/testQueue
09:15:47,852 INFO [OILServerILService] JBossMQ OIL service available at : /0.0.0.0:8090
09:15:47,919 INFO [UILServerILService] JBossMQ UIL service available at : /0.0.0.0:8093
09:15:47,971 INFO [DLQ] Bound to JNDI name: queue/DLQ

But nothing on the other node !

Fifteen minutes later, I had the following message on node "gimlig":

09:31:10,977 INFO [ConnectionTable] exception is java.net.SocketException: No route to host
09:31:13,289 INFO [ConnectionTable] addr=legolaslg:7800, connections are connections (1):
key: gimlig:7800: <gimlig:7800 --> gimlig:38551> (851 secs old)

In the meantime, and ten minutes after I had shutdown the interface, I had the following messages on node "legolaslg":

09:27:52,775 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:27:55,155 INFO [DefaultPartition] Suspected member: gimlig:7800 (additional data: 18 bytes)
09:27:55,284 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:27:57,794 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:00,304 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:02,972 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:05,474 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:07,985 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:10,494 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:13,004 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:15,514 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:18,024 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:20,534 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:23,044 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:28:25,554 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)

The last messages are repeated every seconds.

Few minutes later, I tried to use the client, and the following message appeared in the logs (database connection uses also that private network)

09:31:11,844 ERROR [PreparedStatementCache] Failed closing cached statement
java.sql.SQLException: Io exception: Connection timed out
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:168)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:210)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:323)
at oracle.jdbc.driver.OracleStatement.close(OracleStatement.java:604)
at oracle.jdbc.driver.OraclePreparedStatement.privateClose(OraclePreparedStatement.java:290)
at oracle.jdbc.driver.OraclePreparedStatement.close(OraclePreparedStatement.java:235)
at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.agedOut(CachedPreparedStatement.java:312)
at org.jboss.resource.adapter.jdbc.PreparedStatementCache.ageOut(PreparedStatementCache.java:42)
at org.jboss.util.LRUCachePolicy.flush(LRUCachePolicy.java:183)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.destroy(BaseWrapperManagedConnection.java:268)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.doDestroy(InternalManagedConnectionPool.java:539)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.removeTimedOut(InternalManagedConnectionPool.java:415)
at org.jboss.resource.connectionmanager.IdleRemover$1.run(IdleRemover.java:70)
at java.lang.Thread.run(Thread.java:595)
09:31:11,847 WARN [JBossManagedConnectionPool] Exception destroying ManagedConnection org.jboss.resource.connectionmanager.TxConnectionManager$TxConnectionEventListener@
19f332b[state=DESTROYED mc=org.jboss.resource.adapter.jdbc.local.LocalManagedConnection@1ee4c69 handles=0 lastUse=1108627196592 permit=false trackByTx=false mcp=org.jboss
.resource.connectionmanager.JBossManagedConnectionPool$OnePool@1e193f2 context=org.jboss.resource.connectionmanager.InternalManagedConnectionPool@1da990d]
org.jboss.resource.JBossResourceException: SQLException; - nested throwable: (java.sql.SQLException: Io exception: Broken pipe)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.checkException(BaseWrapperManagedConnection.java:572)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.destroy(BaseWrapperManagedConnection.java:276)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.doDestroy(InternalManagedConnectionPool.java:539)
at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.removeTimedOut(InternalManagedConnectionPool.java:415)
at org.jboss.resource.connectionmanager.IdleRemover$1.run(IdleRemover.java:70)
at java.lang.Thread.run(Thread.java:595)
Caused by: java.sql.SQLException: Io exception: Broken pipe
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:168)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:210)
at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:323)
at oracle.jdbc.driver.OracleConnection.close(OracleConnection.java:919)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.destroy(BaseWrapperManagedConnection.java:272)
... 4 more
09:31:14,244 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)

As a result, the client was lost and I had to kill it.

The ping messages were still inserted in the log every seconds.

I tried to restart the client, but I couldn't connect. It was hanging forever for a connection.

On the other node, network is ok, and database is reachable through a standard SQL connection.

For that, I had to stop completely the two nodes, and restart them for any client to establish a connection.

I could stop JBoss successfully on "gimlig" node, but I had to "kill -9" the java process on "legolaslg" node.

Here are the end of the logs on "legolaslg" node:
09:47:28,000 INFO [TxConnectionManager] Unbound connection factory for resource adapter for ConnectionManager 'jboss.jca:service=LocalTxCM,name=DefaultDS from JNDI name
'java:/DefaultDS'
09:47:28,000 INFO [TxConnectionManager] Unbound connection factory for resource adapter for ConnectionManager 'jboss.jca:service=TxCM,name=JmsXA from JNDI name 'java:/Jm
sXA'
09:47:28,001 INFO [TxConnectionManager] Unbound connection factory for resource adapter for ConnectionManager 'jboss.jca:service=LocalTxCM,name=jdbc/LusciDS from JNDI na
me 'java:/jdbc/LusciDS'
09:47:28,018 INFO [TreeCache] stopService(): closing the channel
09:47:28,314 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
09:47:28,329 INFO [TreeCache] stopService(): stopping the dispatcher
09:47:28,346 INFO [MailService] Mail service 'java:/Mail' removed from JNDI
09:47:28,352 INFO [orb] prepare ORB for shutdown...
09:47:28,352 INFO [orb] ORB going down...
09:47:28,353 INFO [orb] ORB shutdown complete
09:47:28,353 INFO [orb] ORB run, exit
09:47:28,364 INFO [TomcatDeployer] undeploy, ctxPath=/jbossmq-httpil, warUrl=file:/opt/jboss-3.2.7/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.wa
r/
09:47:28,429 INFO [DefaultPartition] Closing partition DefaultPartition
09:47:35,454 ERROR [TCP] down_handler thread for TCP was interrupted (in order to be terminated), but is is still alive
09:47:35,455 INFO [DefaultPartition] Partition DefaultPartition closed.
/opt/jboss/bin/run.sh: line 181: 13685 Killed "$JAVA" $JAVA_OPTS -Djava.endorsed.dirs="$JBOSS_ENDORSED_DIRS" -classpath "$JBOSS_CLASSPATH" org.jboss.Main
"$@"

I restarted node "gimlig" first, then node "legolaslg" which still had no connection over private network.

On gimlig, I had the following entry:

09:53:05,277 INFO [TreeCache] viewAccepted(): new members: [gimli:34044, legolasl:33513]
09:53:05,288 INFO [TreeCache] locking the tree to obtain transient state
09:53:05,292 INFO [TreeCache] returning the transient state (217 bytes)

On legolasl, I had these entries:

09:52:45,354 INFO [DefaultPartition] Initializing
09:52:45,425 INFO [ConnectionTable] server socket created on legolaslg:7800
09:52:45,426 INFO [STDOUT]
-------------------------------------------------------
GMS: address is legolaslg:7800 (additional data: 18 bytes)
-------------------------------------------------------
09:52:45,438 INFO [ConnectionTable] accepted connection, client_sock=Socket[addr=/172.21.158.37,port=43991,localport=7800]
09:52:45,443 INFO [ConnectionTable] input_cookie is bela
09:52:45,445 INFO [ConnectionTable] connection was created to legolaslg:7800
09:52:45,445 INFO [ConnectionTable] created socket to legolaslg:7800
09:52:45,448 INFO [ConnectionTable] connection was created to legolaslg:7800 (additional data: 18 bytes)
09:52:48,450 INFO [DefaultPartition] Number of cluster members: 1
09:52:48,450 INFO [DefaultPartition] Other members: 0
09:52:48,450 INFO [DefaultPartition] Fetching state (will wait for 60000 milliseconds):
09:52:48,485 INFO [HANamingService] Listening on /0.0.0.0:1100
09:52:48,492 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=172.16.128.26:1100

When I re-enabled interface on node legolaslg, the two nodes successfully joined "DefaultPartition". However, it seems that nothing was instanciated on node legolasl, as all classes were process by the other node.

These few tests were made especially to check whether we could without a big risk to split the cluster in two different rooms of our production site. It seems that we have to be careful...

It seems that we maybe have to use a GOSSIP server, who will act as the "policeman" of the cluster.
What do you think about that ?

Is there a possibility to define a master node which will stay alive while the other dies if isolated ?

Thanks in advance for your help.
Actions
5. Re: How to bind specific network layers to a specific interf

belaban Feb 17, 2005 9:28 AM (in response to aguillevick)

Post your cluster configuration.

Also, if you could post simple instructions to reproduce this problem, this would be appreciated.
Regards,
Actions
6. Re: How to bind specific network layers to a specific interf

aguillevick Feb 17, 2005 10:25 AM (in response to aguillevick)

Hello,

here is the configuration of the first node (legolasl):


DefaultPartition

legolaslg

False


60000





<TCP
start_port="7800"
bind_addr="legolaslg"
loopback="false"
/>
<TCPPING
initial_hosts="legolaslg[7800],gimlig[7800]"
num_initial_members="2"
port_range="5"
timeout="3000"
up_thread="true"
down_thread="true"
/>
<MERGE2 min_interval="10000" max_interval="20000" />
<FD shun="true" up_thread="true" down_thread="true"
timeout="2500" max_tries="5" />
<VERIFY_SUSPECT timeout="3000" num_msgs="3"
up_thread="true" down_thread="true" />
<pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
max_xmit_size="8192"
up_thread="true" down_thread="true" />
<pbcast.STABLE desired_avg_gossip="20000"
up_thread="true" down_thread="true" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
shun="true" print_local_addr="true" />
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />





jboss:service=DefaultPartition

DefaultPartition

/HASessionState/Default

0





jboss:service=DefaultPartition

DefaultPartition

${jboss.bind.address}

0

1100

50


230.0.0.4
1102







${jboss.bind.address}






jboss:service=DefaultPartition
jboss.cache:service=InvalidationManager
jboss.cache:service=InvalidationManager
DefaultPartition
DefaultJGBridge

Here is the configuration of the second node:


DefaultPartition

legolaslg

False


60000





<TCP
start_port="7800"
bind_addr="legolaslg"
loopback="false"
/>
<TCPPING
initial_hosts="legolaslg[7800],gimlig[7800]"
num_initial_members="2"
port_range="5"
timeout="3000"
up_thread="true"
down_thread="true"
/>
<MERGE2 min_interval="10000" max_interval="20000" />
<FD shun="true" up_thread="true" down_thread="true"
timeout="2500" max_tries="5" />
<VERIFY_SUSPECT timeout="3000" num_msgs="3"
up_thread="true" down_thread="true" />
<pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
max_xmit_size="8192"
up_thread="true" down_thread="true" />
<pbcast.STABLE desired_avg_gossip="20000"
up_thread="true" down_thread="true" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
shun="true" print_local_addr="true" />
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />





jboss:service=DefaultPartition

DefaultPartition

/HASessionState/Default

0





jboss:service=DefaultPartition

DefaultPartition

${jboss.bind.address}

0

1100

50


230.0.0.4
1102







${jboss.bind.address}






jboss:service=DefaultPartition
jboss.cache:service=InvalidationManager
jboss.cache:service=InvalidationManager
DefaultPartition
DefaultJGBridge

I started first the node legolas.
Then I started node gimli.

I started a client. I open few screens, run few extractions.
We use EJBs with session beans and stateless entity beans

This client made some extraction from an oracle database located on a third server.

Then, I disabled network interface on private network (on server legolasl which was started first) using the following command:

ifconfig eth1 down

I then noticed all the things described in my previous post.

PS: I don't know whether my configuration file will be displayed correctly. The preview showed some truncated text and missing lines. I could email you these configuration files if you wish ?

Thank you, best regards.
Actions
7. Re: How to bind specific network layers to a specific interf

aguillevick Feb 17, 2005 11:22 AM (in response to aguillevick)

Hello again,

I made some other tests with a separated GossipServer. I downloaded JGroups 2.2.7 from JGroups website. I installed it on another Linux box.

I changed the configuration files of my two servers so to use gossip server as reference server.

Configuration for legolasl host:

<TCP
bind_addr="legolaslg"
loopback="false"
/>
<TCPGOSSIP
initial_hosts="eowyng[8903]"
gossip_refresh_rate="20000"
num_initial_members="2"
up_thread="true"
down_thread="true"
/>

Configuration for gimli host:

<TCP
bind_addr="gimlig"
loopback="false"
/>
<TCPGOSSIP
initial_hosts="eowyng[8903]"
gossip_refresh_rate="20000"
num_initial_members="2"
up_thread="true"
down_thread="true"
/>

I still use this private network.

I ran JBoss on my two nodes. Cluster formed successfully and GossipServer showed both connections.

I started a client and used some screens and displayed few orders.

I then disabled network interface for private network on host legolasl.
ifconfig eth1 down

I had the following messages on gimli host:

16:50:45,707 INFO [GossipClient] refresher task is run
16:50:45,707 INFO [GossipClient] registering DefaultPartition : gimlig:7800 (additional data: 18 bytes)
16:50:45,707 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:50:45,708 INFO [GossipClient] refresher task done. Registered 1 items
16:51:05,717 INFO [GossipClient] refresher task is run
16:51:05,717 INFO [GossipClient] registering DefaultPartition : gimlig:7800 (additional data: 18 bytes)
16:51:05,717 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:51:05,718 INFO [GossipClient] refresher task done. Registered 1 items
16:51:20,467 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[gimlig:7800 (addi
tional data: 18 bytes)], local_addr=gimlig:7800 (additional data: 18 bytes)
16:51:20,974 INFO [DefaultPartition] Suspected member: legolaslg:7800 (additional data: 18 bytes)
16:51:20,976 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: -1) : [172.21.158.20:1099]
16:51:20,977 INFO [DefaultPartition] I am (172.21.158.20:1099) received membershipChanged event:
16:51:20,977 INFO [DefaultPartition] Dead members: 1 ([172.21.158.37:1099])
16:51:20,977 INFO [DefaultPartition] New Members : 0 ([])
16:51:20,977 INFO [DefaultPartition] All Members : 1 ([172.21.158.20:1099])
16:51:21,879 INFO [TomcatDeployer] deploy, ctxPath=/jbossmq-httpil, warUrl=file:/opt/jboss-3.2.7/server/all/deploy-hasingleton/jms/jbossmq-httpil.sar/jbossmq-httpil.war/
16:51:22,499 INFO [A] Bound to JNDI name: queue/A
16:51:22,503 INFO [B] Bound to JNDI name: queue/B
16:51:22,506 INFO [C] Bound to JNDI name: queue/C
16:51:22,510 INFO [D] Bound to JNDI name: queue/D
16:51:22,513 INFO [ex] Bound to JNDI name: queue/ex
16:51:22,550 INFO [testTopic] Bound to JNDI name: topic/testTopic
16:51:22,553 INFO [securedTopic] Bound to JNDI name: topic/securedTopic
16:51:22,555 INFO [testDurableTopic] Bound to JNDI name: topic/testDurableTopic
16:51:22,559 INFO [testQueue] Bound to JNDI name: queue/testQueue
16:51:22,625 INFO [OILServerILService] JBossMQ OIL service available at : /0.0.0.0:8090
16:51:22,700 INFO [UILServerILService] JBossMQ UIL service available at : /0.0.0.0:8093
16:51:22,760 INFO [DLQ] Bound to JNDI name: queue/DLQ
16:51:25,727 INFO [GossipClient] refresher task is run
16:51:25,727 INFO [GossipClient] registering DefaultPartition : gimlig:7800 (additional data: 18 bytes)
16:51:25,727 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:51:25,728 INFO [GossipClient] refresher task done. Registered 1 items
16:51:36,527 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903
16:51:45,737 INFO [GossipClient] refresher task is run
16:51:45,737 INFO [GossipClient] registering DefaultPartition : gimlig:7800 (additional data: 18 bytes)
16:51:45,737 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:51:45,738 INFO [GossipClient] refresher task done. Registered 1 items
16:51:50,317 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903
16:52:05,747 INFO [GossipClient] refresher task is run
16:52:05,747 INFO [GossipClient] registering DefaultPartition : gimlig:7800 (additional data: 18 bytes)

I had the following message on legolasl host:

16:50:26,824 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903
16:50:26,884 INFO [GossipClient] refresher task is run
16:50:26,884 INFO [GossipClient] registering DefaultPartition : legolaslg:7800 (additional data: 18 bytes)
16:50:26,885 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:50:26,885 INFO [GossipClient] refresher task done. Registered 1 items
16:50:29,045 INFO [EJBHomeFactory] Searching com/dmv/ejb/operations/operations/facade/OperationsEJB component on SERVER_CONTEXT...
16:50:29,053 INFO [EJBHomeFactory] Found com/dmv/ejb/operations/operations/facade/OperationsEJB component on SERVER_CONTEXT
16:50:44,075 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903
16:50:46,894 INFO [GossipClient] refresher task is run
16:50:46,895 INFO [GossipClient] registering DefaultPartition : legolaslg:7800 (additional data: 18 bytes)
16:50:46,895 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:50:59,744 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903
16:53:55,895 ERROR [GossipClient] exception connecting to host eowyng:8903: java.net.ConnectException: Connection timed out
16:53:55,895 INFO [GossipClient] refresher task done. Registered 1 items
16:53:55,895 INFO [GossipClient] refresher task is run
16:53:55,895 INFO [GossipClient] registering DefaultPartition : legolaslg:7800 (additional data: 18 bytes)
16:53:55,895 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:54:08,744 ERROR [GossipClient] exception connecting to host eowyng:8903: java.net.ConnectException: Connection timed out
16:54:08,745 ERROR [TCPGOSSIP] [FIND_INITIAL_MBRS]: gossip client found no members
16:54:12,045 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
16:54:14,554 WARN [FD] ping_dest is null: members=[legolaslg:7800 (additional data: 18 bytes), gimlig:7800 (additional data: 18 bytes)], pingable_mbrs=[legolaslg:7800 (a
dditional data: 18 bytes)], local_addr=legolaslg:7800 (additional data: 18 bytes)
16:54:14,765 INFO [DefaultPartition] Suspected member: gimlig:7800 (additional data: 18 bytes)
16:54:14,766 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: -1) : [172.21.158.37:1099]
16:54:14,767 INFO [DefaultPartition] I am (172.21.158.37:1099) received membershipChanged event:
16:54:14,767 INFO [DefaultPartition] Dead members: 1 ([172.21.158.20:1099])
16:54:14,767 INFO [DefaultPartition] New Members : 0 ([])
16:54:14,767 INFO [DefaultPartition] All Members : 1 ([172.21.158.37:1099])
16:54:24,954 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903
16:57:04,894 ERROR [GossipClient] exception connecting to host eowyng:8903: java.net.ConnectException: Connection timed out
16:57:04,895 INFO [GossipClient] refresher task done. Registered 1 items
16:57:04,895 INFO [GossipClient] refresher task is run
16:57:04,895 INFO [GossipClient] registering DefaultPartition : legolaslg:7800 (additional data: 18 bytes)
16:57:04,895 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
16:57:33,954 ERROR [GossipClient] exception connecting to host eowyng:8903: java.net.ConnectException: Connection timed out
16:57:33,955 ERROR [TCPGOSSIP] [FIND_INITIAL_MBRS]: gossip client found no members
16:57:47,044 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903

Something annoys me here (to my understanding): node legolasl seems to consider that node gimli is faulty, rather than considering itself as faulty (legolasl can't ping the gossipserver).

Then, I wanted to keep on using some functions on client. After few function calls, I was hanging. I stopped client and restarted it. I couldn't get connected.
I stopped both nodes, restarted only gimli node, and then I was able to open a connection and work normally.

In that case, maybe if a parameter for cluster was telling to shutdown node if gossip server was unreachable, we could solve this problem ? Or maybe, I did something wrong.

Gossip server, as a third independent party, seems to be anyway the most appropriate solution for our problem (separate cluster node into two different rooms). My feeling is if the node falls completely, there won't be a problem. If only a network card fails, we don't have a "big enough" condition.

Thanks in advance for your help.
Actions
8. Re: How to bind specific network layers to a specific interf

aguillevick Feb 17, 2005 11:56 AM (in response to aguillevick)

Hello again,

I made another test. The private network interface is still disabled on node legolasl. I have restarted JBoss on node gimli which can dialog with GossipServer.
I restarted legolasl which can't reach GossipServer as the required network interface is disabled.

For some minutes, it tries to reach GossipServer.

17:44:58,710 INFO [GossipClient] REGISTER_REQ --> eowyng/172.21.158.36:8903
17:44:58,710 INFO [GossipClient] GET_REQ --> eowyng/172.21.158.36:8903

After 4 minutes, the node starts anyway, even with no connect to GossipServer.

17:48:07,705 ERROR [GossipClient] exception connecting to host eowyng:8903: java.net.ConnectException: Connection timed out
17:48:07,705 ERROR [GossipClient] exception connecting to host eowyng:8903: java.net.ConnectException: Connection timed out
17:48:07,705 ERROR [TCPGOSSIP] [FIND_INITIAL_MBRS]: gossip client found no members
17:48:07,719 INFO [DefaultPartition] Number of cluster members: 1
17:48:07,719 INFO [DefaultPartition] Other members: 0
17:48:07,719 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 0, delta: 0) : [172.21.158.37:1099]
17:48:07,719 INFO [DefaultPartition] Fetching state (will wait for 60000 milliseconds):
17:48:07,722 INFO [DefaultPartition] I am (null) received membershipChanged event:
17:48:07,722 INFO [DefaultPartition] Dead members: 0 ([])
17:48:07,722 INFO [DefaultPartition] New Members : 0 ([])
17:48:07,722 INFO [DefaultPartition] All Members : 1 ([172.21.158.37:1099])
17:48:07,758 INFO [HANamingService] Listening on /0.0.0.0:1100

Can I prevent node to start with a specific parameter ?

THank you, best regards.
Actions
9. Re: How to bind specific network layers to a specific interf

aguillevick Feb 18, 2005 9:01 AM (in response to aguillevick)

Hello,

I will make some further tests with a second GossipServer and I will let know.

THank you, best regards.
Actions
10. Re: How to bind specific network layers to a specific interf

achech Apr 18, 2005 5:28 PM (in response to aguillevick)

"aguillevicK" wrote:
Hello,

I will make some further tests with a second GossipServer and I will let know.

THank you, best regards.

Hi aguillevicK,

I'm new in here and try to learn how to make a two node cluster, please forgive me if I ask some stupid questions... ;-)

Just wondering (1)Do you have to use a real load balancer server at front of your cluster? Can you use JBoss/Tomcat to run both cluster/load balancer?

(2) Where can I find the place the defind ${jboss.bind.address}? I saw several files are using it, but don't know how to defind it.

(3) Is there any book you recommand to read for the beginner?

TIA!!!
Actions

Go to original post