0 Replies Latest reply on Nov 28, 2003 6:27 AM by ute

    network overload/connection timeout

    ute

      hi,

      we are using jboss-3.2.2_RC4, apache 2.047 on 2 linux machines rh9 in the same subnet.

      when setting up one cluster so the two linux machines can see each other (udp mcast_add=autodiscovery address; see also my previous message), additionally we installed mod_jk2 and set it up so apache controls loadbalancing.
      during startup everything looks good: one cluster, the two machines see each other, replication takes place...
      somewhen there is a message
      2003-11-28 12:15:08,614 DEBUG [org.javagroups.MyPartition] [Fri Nov 28 12:15:08 EET 2003] [ERROR] FD.Monitor.run(): ping_dest is null
      2003-11-28 12:15:10,295 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.MyPartition] Suspected member: SERVER1:32822 (additional data: 18 bytes)
      2003-11-28 12:15:10,300 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.MyPartition] New cluster view (id: 2, delta: -1) : [xxx.xxx.x.101:1399]
      2003-11-28 12:15:10,301 INFO [MyPartition:ReplicantManager] Dead members: 1
      2003-11-28 12:15:10,301 DEBUG [MyPartition:ReplicantManager] trying to remove deadMember xxx.xxx.x.100:1199 for key jboss.j2ee:jndiName=VCCProductManagerEJB,service=EJB
      2003-11-28 12:15:10,301 DEBUG [MyPartition:ReplicantManager] xxx.xxx.x.100:1199 was removed
      2003-11-28 12:15:10,301 DEBUG [MyPartition:ReplicantManager] notifyKeyListeners

      but both instances are "alive"
      shortly afterwards they find each other again:

      2003-11-28 12:15:42,616 INFO [org.jboss.ha.framework.interfaces.HAPartition.MyPartition] New cluster view: 3 ([xxx.xxx.x.100:1199, xxx.xxx.x.101:1399] delta: 1)
      2003-11-28 12:15:42,617 INFO [MyPartition:ReplicantManager] Merging partitions...
      2003-11-28 12:15:42,617 INFO [MyPartition:ReplicantManager] Dead members: 0
      2003-11-28 12:15:42,617 INFO [MyPartition:ReplicantManager] Originating groups: [[SERVER1:32822 (additional data: 18 bytes)|2] [SERVER1:32822 (additional data: 18 bytes)], [SERVER2:32806 (additional data: 18 bytes)|2] [SERVER2:32806 (additional data: 18 bytes)]]
      2003-11-28 12:15:42,623 DEBUG [MyPartition:ReplicantManager] Sleeping for 50ms second just in case
      2003-11-28 12:15:42,652 DEBUG [org.javagroups.MyPartition] [Fri Nov 28 12:15:42 EET 2003] [ERROR] GMS.installView(): received view <= current view; discarding it ! (current vid: [SERVER1:32822 (additional data: 18 bytes)|3], new vid: [SERVER1:32822 (additional data: 18 bytes)|3])
      2003-11-28 12:15:42,687 DEBUG [MyPartition:ReplicantManager] Start MergeMembers in DRM
      2003-11-28 12:15:42,716 DEBUG [MyPartition:ReplicantManager] lookupLocalReplicants called (xxx.xxx.x.101:1399). Return: 18

      when sniffing the net we could see that there are lot's of multicasts happening which cause an overload of the net and leads somewhen to following error:

      2003-11-28 12:19:04,444 INFO [org.apache.jk.common.ChannelSocket] connection timeout reached

      the network is totally overloaded - after few seconds the linux machines continue like nothing happened. the windows machines in the same net receive a "network unplugged" message.

      we checked several timeouts and configurations.


      anybody has an idea?
      we already checked several docs and fora. any help is very appreciated!
      many thanx in advance,
      ute