2 Replies Latest reply on Aug 31, 2007 8:41 AM by Mariusz Krzemien

    How to configure JGroups on hosts with redundant network lin

    Mariusz Krzemien Newbie

      In our production environment all hosts have duplicated network links. It is intended to protect from single link failure. Does anyone have any example / best practices how to configure JGroups for proper work in such environment? (So that JGroups works fine despite a single link failure).

      We made some prototyping but it failed - details below.

      Thank you in advance.
      Kind regards
      Mariusz

      Version: JBossCache 1.4.1 SP3, JGroups 2.4.1

      Environment: a LAN consisting of two hosts, each host with two NICs (eth0, eth1), the hosts connected directly (eth0-to-eth0, eth1-to-eth1), configured as single IPv4 subnet. JGroups was intended to communicate on both interfaces and to use multicast (see Configuration below)

      Test description:
      - both links are connected
      - on each node started one instance of JBossCache
      - replication working correctly
      - disconnected link eth1-to-eth1
      - replication working correctly
      - reconnected link eth1-to-eth1, disconnected link eth0-to-eth0
      - replication working correctly
      ! after a time (around 5sec) both instances communicate an exception (see below) to one another and break because the exception is not caught

      I don't know if it is enough to simply catch the exception. From the top-level I can see that JGroups/JBossCache does have some problem with this configuration.

      Configuration details:
      <UDP mcast_addr="228.8.8.8" mcast_port="45566"
      ip_ttl="64" ip_mcast="true"
      mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
      ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
      loopback="false"
      receive_on_all_interfaces="true"
      send_on_all_interfaces="true"
      receive_interfaces="eth0,eth1"
      send_interfaces="eth0,eth1"/>
      <PING timeout="2000" num_initial_members="3"
      up_thread="false" down_thread="false"/>
      <MERGE2 min_interval="10000" max_interval="20000"/>
      <!-- <FD shun="true" up_thread="true" own_thread="true" />-->
      <FD_SOCK/>
      <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
      max_xmit_size="8192" up_thread="false" down_thread="false"/>
      <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
      down_thread="false"/>
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="false" down_thread="false"/>
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true"/>
      <FC max_credits="2000000" down_thread="false" up_thread="false"
      min_threshold="0.20"/>
      <FRAG frag_size="8192" down_thread="false" up_thread="true"/>
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

      Logs with exception:
      [2007-08-30 15:20:29,796|DEBUG|main; |org.jgroups.blocks.GroupRequest(execute:195)]: call did not execute correctly, request is [GroupRequest:
      req_id=1188480009786
      caller=10.10.0.2:32781
      10.10.0.1:32781: sender=10.10.0.1:32781, retval=null, received=false, suspected=false

      request_msg: [dst: , src: 10.10.0.2:32781 (3 headers), size = 34 bytes]
      rsp_mode: GET_ALL
      done: false
      timeout: 20000
      expected_mbrs: 0
      ]
      [2007-08-30 15:20:29,796|DEBUG|main; |org.jgroups.blocks.RpcDispatcher(callRemoteMethods:193)]: responses: [sender=10.10.0.1:32781, retval=null, received=false, suspected=false]

      [2007-08-30 15:20:29,797|DEBUG|main; |org.jboss.cache.TreeCache(callRemoteMethods:4405)]: (10.10.0.2:32781): responses for method _replicate:
      [sender=10.10.0.1:32781, retval=null, received=false, suspected=false]

      [2007-08-30 15:20:29,798|DEBUG|main; |org.jboss.cache.interceptors.BaseRpcInterceptor(replicateCall:118)]: responses=[org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null, received=false, suspected=false]
      [2007-08-30 15:20:29,800|DEBUG|main; |org.jboss.cache.interceptors.BaseRpcInterceptor(checkResponses:79)]: Received Throwable from remote node
      org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4422)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4344)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4455)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
      at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:124)
      at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:88)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:365)
      at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:160)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:183)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5863)
      at org.jboss.cache.TreeCache.remove(TreeCache.java:3929)
      at org.jboss.cache.TreeCache.remove(TreeCache.java:3915)
      at test.jbcache.DistributedTree.remove(DistributedTree.java:41)
      at test.jbcache.DistributedTest.handleSession(DistributedTest.java:46)
      at test.jbcache.DistributedTest.main(DistributedTest.java:78)
      Caused by: org.jboss.cache.lock.TimeoutException: Response timed out: sender=10.10.0.1:32781, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4420)
      ... 17 more

        • 1. Re: How to configure JGroups on hosts with redundant network
          Bela Ban Master

          #1 You want to uncomment FD, as pull-the-plug scenarios are not detected by FD_SOCK alone

          #2 Although JGroups can send and receive multicasts on multiple network interfaces, it still have a unicast socket which defines the member's address, and is bound to a single interface (e.g. eth0). If you pull eth0, then this address will not be usable on certain, not all operaing systems

          #3 As we currently don't have logical addresses (I might add them in a future release of JGroups), your best bet is to use IP Bonding (Linux), or similar tech on different operating systems.

          • 2. Re: How to configure JGroups on hosts with redundant network
            Mariusz Krzemien Newbie

            Following your suggestion I uncommented the FD section (in the above configuration). Unfortunately the problem remained, although the output slightly changed.

            Details:
            - both links are connected
            - on each node started one instance of JBossCache
            - replication working correctly
            - disconnected link eth1-to-eth1
            - replication working correctly
            - reconnected link eth1-to-eth1, disconnected link eth0-to-eth0
            - replication working correctly
            - ! when invoking the add/remove kind of operations on the TreeCache, one instance reacts with the exception as above. The other works fine for a while (no exception) but then it reports again with new GMS address (same NIC, changed port) and from that moment there is no replication between the two instances

            Could you explain this output?

            Thank you for quick response
            Kind regards
            Mariusz