Cluster-nodes don't find each other
apatispelikan Aug 3, 2012 8:40 AMHello,
I use JBoss AS7 in domain-mode on two physical nodes. We run an application for several customers and each customer has it's own server-group (with it's own two server-processes - one per physical-node). We startet a cluster for our first customer and it works. Now we wanted to start the next server-group but those server-processes do not form a cluster - each works only on it own.
First of all UDP seems to work. I run several jgroups-test-applications (and they all worked fine). Example:
Node 2: /opt/jdk1.7/bin/java -cp /appl/domaincontroller/jboss-as-7.1.2.Final/modules/org/jgroups/main/jgroups-3.9.Final.jar -Djava.net.preferIPv4Stack=true org.jgroups.tests.McastSenderTest -bind_addr 10.9.0.12 -mcast_addr 228.1.2.23 -port 4711 -ttl 9
Node 1: /opt/jdk1.7/bin/java -c/appl/domaincontroller/jboss-as-7.1.2.Final/modules/org/jgroups/main/jgroups-3.0.9.Final.jar -Djava.net.preferIPv4Stack=true org.jgroups.tests.McastReceiverTest -bind_addr 10.9.0.11 -mcast_addr 228.1.2.23 -port 4711
--> works
Node 2: /opt/jdk1.7/bin/java -cp /appl/domaincontroller/jboss-as-7.1.2.Final/modules/org/jgroups/main/jgroups-3.9.Final.jar -Djava.net.preferIPv4Stack=true org.jgroups.tests.McastReceiverTest -bind_addr 10.9.0.12 -mcast_addr 228.1.2.23 -port 4711
Node 1: /opt/jdk1.7/bin/java -c/appl/domaincontroller/jboss-as-7.1.2.Final/modules/org/jgroups/main/jgroups-3.0.9.Final.jar -Djava.net.preferIPv4Stack=true org.jgroups.tests.McastSenderTest -bind_addr 10.9.0.11 -mcast_addr 228.1.2.23 -port 4711 -ttl 9
--> works
Node 1: /opt/jdk1.7/bin/java -cp /appl/domaincontroller/jboss-as-7.1.2.Final/modules/org/jgroups/main/jgroups-3.0.9.Final.jar -Djava.net.preferIPv4Stack=true org.jgroups.demos.Chat
Node 2: /opt/jdk1.7/bin/java -cp /appl/domaincontroller/jboss-as-7.1.2.Final/modules/org/jgroups/main/jgroups-3.0.9.Final.jar -Djava.net.preferIPv4Stack=true org.jgroups.demos.Chat
--> works
I separate customers cluster-traffic by using different UDP- and TCP-ports. I also separate business-traffic from cluster-traffic by using two physical network-interfaces per node. This is the relevant section of the file domain.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="tcp">
...
<socket-binding-groups>
<socket-binding-group name="full-ha-sockets" default-interface="public">
<socket-binding name="ajp" port="9000"/>
<socket-binding name="http" port="9001"/>
<socket-binding name="https" port="9002"/>
<socket-binding name="jacorb" interface="unsecure" port="9003"/>
<socket-binding name="jacorb-ssl" interface="unsecure" port="9004"/>
<socket-binding name="jgroups-diagnostics" interface="cluster" port="0" multicast-address="228.1.2.24" multicast-port="9005"/>
<socket-binding name="jgroups-mping" interface="cluster" port="0" multicast-address="228.1.2.23" multicast-port="${jboss.clustergroup.port}"/>
<socket-binding name="jgroups-tcp" interface="cluster" port="9007"/>
<socket-binding name="jgroups-tcp-fd" interface="cluster" port="9008"/>
<socket-binding name="messaging" interface="management" port="9012"/>
<socket-binding name="messaging-group" port="0" multicast-address="228.1.2.24" multicast-port="9013"/>
<socket-binding name="messaging-throughput" interface="management" port="9014"/>
<socket-binding name="osgi-http" interface="management" port="9015"/>
<socket-binding name="remoting" interface="management" port="9016"/>
<socket-binding name="txn-recovery-environment" interface="management" port="9017"/>
<socket-binding name="txn-status-manager" interface="management" port="9018"/>
<outbound-socket-binding name="mail-smtp">
<remote-destination host="smtp.apa.at" port="25"/>
</outbound-socket-binding>
</socket-binding-group>
</socket-binding-groups>
(I use tcp because using udp I wasn't not able to bind to my addresses - but tcp is ok because there are only two nodes, but I want to avoid to configure the initial nodes so mping should work).
All servers of a server-group have the same system property
customer A (port-offset=50): <property name="jboss.clustergroup.port" value="9060"/>
customer B (port-offset=100): <property name="jboss.clustergroup.port" value="9160"/>
to simulate port-offsets for multicast-ports.
After starting a server of customer A, "lsof" gives me
java 32112 nobody 354u IPv4 122475705 0t0 UDP 228.1.2.23:9060
java 32112 nobody 357u IPv4 122475706 0t0 TCP 10.9.0.11:9057 (LISTEN)
java 32112 nobody 358u IPv4 122475707 0t0 TCP 10.9.0.11:9058 (LISTEN)
No matter which customer it is - it looks always the same (only different ports). And for the very first server-group clustering works.
Starting any server for the second customer brings
16:06:21,850 FINE [org.jgroups.protocols.FRAG2] (ChannelService lifecycle - 1) received CONFIG event: {bind_addr=/10.9.0.12}
16:06:21,852 FINE [org.jgroups.protocols.MPING] (ChannelService lifecycle - 1) bind_addr=/10.9.0.12 mcast_addr=/228.1.2.23, mcast_port=9260
16:06:21,969 INFO [stdout] (ChannelService lifecycle - 1)
16:06:21,969 INFO [stdout] (ChannelService lifecycle - 1) -------------------------------------------------------------------
16:06:21,970 INFO [stdout] (ChannelService lifecycle - 1) GMS: address=slave:customerB02/hibernate, cluster=hibernate, physical address=10.9.0.12:9257
16:06:21,970 INFO [stdout] (ChannelService lifecycle - 1) -------------------------------------------------------------------
...
16:06:24,980 FINE [org.jgroups.protocols.pbcast.NAKACK] (ChannelService lifecycle - 1)
[setDigest()]
existing digest: []
new digest: slave:customerB02/hibernate: [0 (0)]
resulting digest: slave:customerB02/hibernate: [0 (0)]
16:06:24,980 FINE [org.jgroups.protocols.pbcast.GMS] (ChannelService lifecycle - 1) slave:customerB02/hibernate: view is [slave:customerB02/hibernate|0] [slave:customerB02/hibernate]
16:06:24,981 FINE [org.jgroups.protocols.pbcast.STABLE] (ChannelService lifecycle - 1) resuming message garbage collection
16:06:24,982 FINE [org.jgroups.protocols.FD_SOCK] (ChannelService lifecycle - 1) VIEW_CHANGE received: [slave:customerB02/hibernate]
16:06:24,984 FINE [org.jgroups.protocols.pbcast.STABLE] (ChannelService lifecycle - 1) [ergonomics] setting max_bytes to 400KB (1 members)
16:06:24,984 FINE [org.jgroups.protocols.pbcast.STABLE] (ChannelService lifecycle - 1) resuming message garbage collection
16:06:24,985 FINE [org.jgroups.protocols.pbcast.GMS] (ChannelService lifecycle - 1) created group (first member). My view is [slave:customerB02/hibernate|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
...
16:06:26,289 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-19) ISPN000078: Starting JGroups Channel
16:06:26,294 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-19) ISPN000094: Received new cluster view: [slave:customerB02/hibernate|0] [slave:customerB02/hibernate]
16:06:26,295 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-19) ISPN000079: Cache local address is slave:customerB02/hibernate, physical addresses are [10.9.0.12:9257]
16:06:26,299 INFO [org.infinispan.factories.GlobalComponentRegistry] (MSC service thread 1-19) ISPN000128: Infinispan version: Infinispan 'Brahma' 5.1.4.FINAL
After the three-seconds-timeout MPING gives up and asumes that the process is the first in the cluster.
During my experiments I experienced that also the first server-group does not work sometimes: If I shutdown the server on the second node and start it again the cluster doesn't form again. After restarting the server on the first node the cluster is available again. This is the log of the first customer during this
Node 1 sees shutting down server-process on node 2:
16:31:04,688 FINE [org.jgroups.protocols.FD] (Timer-3,<ADDR>) sending are-you-alive msg to slave:customerA02/hibernate (own address=master:customerA01/hibernate)
16:31:08,140 FINE [org.jgroups.protocols.pbcast.GMS] (Incoming-16,null) master:customerA01/hibernate: view is [master:customerA01/hibernate|2] [master:customerA01/hibernate]
16:31:08,141 FINE [org.jgroups.protocols.pbcast.STABLE] (Incoming-16,null) resuming message garbage collection
16:31:08,141 FINE [org.jgroups.protocols.pbcast.NAKACK] (Incoming-16,null) removed slave:customerA02/hibernate from xmit_table (not member anymore)
16:31:08,142 FINE [org.jgroups.protocols.FD_SOCK] (Incoming-16,null) VIEW_CHANGE received: [master:customerA01/hibernate]
16:31:08,142 FINE [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger,hibernate,master:customerA01/hibernate) peer slave:customerA02/hibernate closed socket gracefully
16:31:08,445 FINE [org.jgroups.protocols.pbcast.STABLE] (Incoming-16,null) [ergonomics] setting max_bytes to 400KB (1 members)
16:31:08,446 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-16,null) ISPN000094: Received new cluster view: [master:customerA01/hibernate|2] [master:customerA01/hibernate]
Starting server-process on node 2 logs this:
16:40:39,511 FINE [org.jgroups.protocols.FRAG2] (ChannelService lifecycle - 1) received CONFIG event: {bind_addr=/10.9.0.12}
16:40:39,513 FINE [org.jgroups.protocols.MPING] (ChannelService lifecycle - 1) bind_addr=/10.9.0.12 mcast_addr=/228.1.2.23, mcast_port=9060
16:40:39,560 INFO [stdout] (ChannelService lifecycle - 1)
16:40:39,561 INFO [stdout] (ChannelService lifecycle - 1) -------------------------------------------------------------------
16:40:39,561 INFO [stdout] (ChannelService lifecycle - 1) GMS: address=slave:customerA02/hibernate, cluster=hibernate, physical address=10.9.0.12:9057
16:40:39,561 INFO [stdout] (ChannelService lifecycle - 1) -------------------------------------------------------------------
...
16:40:42,572 FINE [org.jgroups.protocols.pbcast.NAKACK] (ChannelService lifecycle - 1)
[setDigest()]
existing digest: []
new digest: slave:customerA02/hibernate: [0 (0)]
resulting digest: slave:customerA02/hibernate: [0 (0)]
16:40:42,572 FINE [org.jgroups.protocols.pbcast.GMS] (ChannelService lifecycle - 1) slave:customerA02/hibernate: view is [slave:customerA02/hibernate|0] [slave:customerA02/hibernate]
16:40:42,573 FINE [org.jgroups.protocols.pbcast.STABLE] (ChannelService lifecycle - 1) resuming message garbage collection
16:40:42,574 FINE [org.jgroups.protocols.FD_SOCK] (ChannelService lifecycle - 1) VIEW_CHANGE received: [slave:customerA02/hibernate]
16:40:42,575 FINE [org.jgroups.protocols.pbcast.STABLE] (ChannelService lifecycle - 1) [ergonomics] setting max_bytes to 400KB (1 members)
16:40:42,576 FINE [org.jgroups.protocols.pbcast.STABLE] (ChannelService lifecycle - 1) resuming message garbage collection
16:40:42,577 FINE [org.jgroups.protocols.pbcast.GMS] (ChannelService lifecycle - 1) created group (first member). My view is [slave:customerA02/hibernate|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
...
16:40:43,914 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-8) ISPN000078: Starting JGroups Channel
16:40:43,918 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-8) ISPN000094: Received new cluster view: [slave:customerA02/hibernate|0] [slave:customerA02/hibernate]
16:40:43,918 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-8) ISPN000079: Cache local address is slave:customerA02/hibernate, physical addresses are [10.9.0.12:9057]
...
16:41:03,862 WARNING [org.jgroups.protocols.TCP] (ReceiverThread) null: no physical address for d9235711-056b-197b-ce36-0c79c9426483, dropping message
16:42:45,613 WARNING [org.jgroups.protocols.TCP] (ReceiverThread) null: no physical address for d9235711-056b-197b-ce36-0c79c9426483, dropping message
16:43:41,405 WARNING [org.jgroups.protocols.TCP] (ReceiverThread) null: no physical address for d9235711-056b-197b-ce36-0c79c9426483, dropping message
16:44:24,430 WARNING [org.jgroups.protocols.TCP] (ReceiverThread) null: no physical address for d9235711-056b-197b-ce36-0c79c9426483, dropping message
The server thinks it is alone but the server on node 1 which previously was part of the cluster is still running - but not found by the new process.
Now I stop and start the server-process on node 1 and it finds the cluster:
16:45:07,433 FINE [org.jgroups.protocols.FRAG2] (ChannelService lifecycle - 1) received CONFIG event: {bind_addr=/10.9.0.11}
16:45:07,434 FINE [org.jgroups.protocols.MPING] (ChannelService lifecycle - 1) bind_addr=/10.9.0.11 mcast_addr=/228.1.2.23, mcast_port=9060
...
16:45:07,527 INFO [stdout] (ChannelService lifecycle - 1)
16:45:07,527 INFO [stdout] (ChannelService lifecycle - 1) -------------------------------------------------------------------
16:45:07,527 INFO [stdout] (ChannelService lifecycle - 1) GMS: address=master:customerA01/hibernate, cluster=hibernate, physical address=10.9.0.11:9057
16:45:07,527 INFO [stdout] (ChannelService lifecycle - 1) -------------------------------------------------------------------
16:45:07,540 FINE [org.jgroups.protocols.pbcast.GMS] (ChannelService lifecycle - 1) election results: {slave:customerA02/hibernate=1}
16:45:07,541 FINE [org.jgroups.protocols.pbcast.GMS] (ChannelService lifecycle - 1) sending JOIN(master:customerA01/hibernate) to slave:customerA02/hibernate
16:45:07,619 FINE [org.jgroups.protocols.pbcast.NAKACK] (ChannelService lifecycle - 1)
[setDigest()]
existing digest: []
new digest: slave:customerA02/hibernate: [15 (15)], master:customerA01/hibernate: [0 (0)]
resulting digest: master:customerA01/hibernate: [0 (0)], slave:customerA02/hibernate: [15 (15)]
16:45:07,619 FINE [org.jgroups.protocols.pbcast.GMS] (ChannelService lifecycle - 1) master:customerA01/hibernate: view is [slave:customerA02/hibernate|1] [slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:07,620 FINE [org.jgroups.protocols.FD_SOCK] (ChannelService lifecycle - 1) VIEW_CHANGE received: [slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:07,622 FINE [org.jgroups.protocols.pbcast.STABLE] (ChannelService lifecycle - 1) [ergonomics] setting max_bytes to 800KB (2 members)
16:45:07,624 FINE [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger,hibernate,master:customerA01/hibernate) ping_dest is slave:customerA02/hibernate, pingable_mbrs=[slave:customerA02/hibernate, master:customerA01/hibernate]
...
16:45:10,058 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-19) ISPN000078: Starting JGroups Channel
16:45:10,062 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-19) ISPN000094: Received new cluster view: [slave:customerA02/hibernate|1] [slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:10,062 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-19) ISPN000079: Cache local address is master:customerA01/hibernate, physical addresses are [10.9.0.11:9057]
16:45:10,065 INFO [org.infinispan.factories.GlobalComponentRegistry] (MSC service thread 1-19) ISPN000128: Infinispan version: Infinispan 'Brahma' 5.1.4.FINAL
...
16:45:11,409 INFO [org.jboss.as] (Controller Boot Thread) JBAS015874: JBoss AS 7.1.2.Final "Steropes" started in 9031ms - Started 1021 of 1192 services (155 services are passive or on-demand)
16:45:13,622 FINE [org.jgroups.protocols.FD] (Timer-2,<ADDR>) sending are-you-alive msg to slave:customerA02/hibernate (own address=master:customerA01/hibernate)
This is was is logged on node 2 during starting the server-process on node 1:
16:45:07,598 FINE [org.jgroups.protocols.pbcast.GMS] (ViewHandler,hibernate,slave:customerA02/hibernate) new=[master:customerA01/hibernate], suspected=[], leaving=[], new view: [slave:customerA02/hibernate|1] [slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:07,599 FINE [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,hibernate,slave:customerA02/hibernate) suspending message garbage collection
16:45:07,600 FINE [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,hibernate,slave:customerA02/hibernate) resume task started, max_suspend_time=33000
16:45:07,602 FINE [org.jgroups.protocols.pbcast.NAKACK] (Incoming-1,null)
[setDigest()]
existing digest: slave:customerA02/hibernate: [15 (15)]
new digest: slave:customerA02/hibernate: [14 (14)], master:customerA01/hibernate: [0 (0)]
resulting digest: master:customerA01/hibernate: [0 (0)], slave:customerA02/hibernate: [15 (15)]
16:45:07,603 FINE [org.jgroups.protocols.pbcast.GMS] (Incoming-1,null) slave:customerA02/hibernate: view is [slave:customerA02/hibernate|1] [slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:07,604 FINE [org.jgroups.protocols.FD_SOCK] (Incoming-1,null) VIEW_CHANGE received: [slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:07,607 FINE [org.jgroups.protocols.pbcast.STABLE] (Incoming-1,null) [ergonomics] setting max_bytes to 800KB (2 members)
16:45:07,606 FINE [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger,hibernate,slave:customerA02/hibernate) ping_dest is master:customerA01/hibernate, pingable_mbrs=[slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:07,607 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,null) ISPN000094: Received new cluster view: [slave:customerA02/hibernate|1] [slave:customerA02/hibernate, master:customerA01/hibernate]
16:45:07,625 FINE [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,hibernate,slave:customerA02/hibernate) resuming message garbage collection
16:45:13,607 FINE [org.jgroups.protocols.FD] (Timer-5,<ADDR>) sending are-you-alive msg to master:customerA01/hibernate (own address=slave:customerA02/hibernate)
I do not understand what's the reason for this. I tried to find a solution in the web, but no one seems to expirience the same problem. I don't know if this is a network-issue but I don't think so because the jgroups-tests work. Is there any reason why there could be only one cluster at the time? Or what else could bring the problem I see?
Regards,
Stephan
 
     
     
     
    