2 Replies Latest reply on Aug 31, 2007 7:39 AM by belaban

    How to configure JGroups on hosts with redundant network lin

    mkrzemien

      In our production environment all hosts have duplicated network links. It is intended to protect from single link failure. Does anyone have any example / best practices how to configure JGroups for proper work in such environment? (So that JGroups works fine despite a single link failure).

      We made some prototyping but it failed - details below.

      Thank you in advance.
      Kind regards
      Mariusz

      Version: JBossCache 1.4.1 SP3, JGroups 2.4.1

      Environment: a LAN consisting of two hosts, each host with two NICs (eth0, eth1), the hosts connected directly (eth0-to-eth0, eth1-to-eth1), configured as single IPv4 subnet. JGroups was intended to communicate on both interfaces and to use multicast (see Configuration below)

      Test description:
      - both links are connected
      - on each node started one instance of JBossCache
      - replication working correctly
      - disconnected link eth1-to-eth1
      - replication working correctly
      - reconnected link eth1-to-eth1, disconnected link eth0-to-eth0
      - replication working correctly
      ! after a time (around 5sec) both instances communicate an exception (see below) to one another and break because the exception is not caught

      I don't know if it is enough to simply catch the exception. From the top-level I can see that JGroups/JBossCache does have some problem with this configuration.

      Configuration details:
      <UDP mcast_addr="228.8.8.8" mcast_port="45566"
      ip_ttl="64" ip_mcast="true"
      mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
      ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
      loopback="false"
      receive_on_all_interfaces="true"
      send_on_all_interfaces="true"
      receive_interfaces="eth0,eth1"
      send_interfaces="eth0,eth1"/>
      <PING timeout="2000" num_initial_members="3"
      up_thread="false" down_thread="false"/>
      <MERGE2 min_interval="10000" max_interval="20000"/>
      <!-- <FD shun="true" up_thread="true" own_thread="true" />-->
      <FD_SOCK/>
      <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
      max_xmit_size="8192" up_thread="false" down_thread="false"/>
      <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
      down_thread="false"/>
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="false" down_thread="false"/>
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true"/>
      <FC max_credits="2000000" down_thread="false" up_thread="false"
      min_threshold="0.20"/>
      <FRAG frag_size="8192" down_thread="false" up_thread="true"/>
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

      Logs with exception:
      [2007-08-30 15:20:29,796|DEBUG|main; |org.jgroups.blocks.GroupRequest(execute:195)]: call did not execute correctly, request is [GroupRequest:
      req_id=1188480009786
      caller=10.10.0.2:32781
      10.10.0.1:32781: sender=10.10.0.1:32781, retval=null, received=false, suspected=false

      request_msg: [dst: , src: 10.10.0.2:32781 (3 headers), size = 34 bytes]
      rsp_mode: GET_ALL
      done: false
      timeout: 20000
      expected_mbrs: 0
      ]
      [2007-08-30 15:20:29,796|DEBUG|main; |org.jgroups.blocks.RpcDispatcher(callRemoteMethods:193)]: responses: [sender=10.10.0.1:32781, retval=null, received=false, suspected=false]

      [2007-08-30 15:20:29,797|DEBUG|main; |org.jboss.cache.TreeCache(callRemoteMethods:4405)]: (10.10.0.2:32781): responses for method _replicate:
      [sender=10.10.0.1:32781, retval=null, received=false, suspected=false]

      [2007-08-30 15:20:29,798|DEBUG|main; |org.jboss.cache.interceptors.BaseRpcInterceptor(replicateCall:118)]: responses=[org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null, received=false, suspected=false]
      [2007-08-30 15:20:29,800|DEBUG|main; |org.jboss.cache.interceptors.BaseRpcInterceptor(checkResponses:79)]: Received Throwable from remote node
      org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4422)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4344)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4455)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
      at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:124)
      at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:88)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:365)
      at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:160)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:183)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5863)
      at org.jboss.cache.TreeCache.remove(TreeCache.java:3929)
      at org.jboss.cache.TreeCache.remove(TreeCache.java:3915)
      at test.jbcache.DistributedTree.remove(DistributedTree.java:41)
      at test.jbcache.DistributedTest.handleSession(DistributedTest.java:46)
      at test.jbcache.DistributedTest.main(DistributedTest.java:78)
      Caused by: org.jboss.cache.lock.TimeoutException: Response timed out: sender=10.10.0.1:32781, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4420)
      ... 17 more

        • 1. Re: How to configure JGroups on hosts with redundant network
          mkrzemien

          It is probably due to my configuration. E.g. when starting an instance, the GMS address presented is always eth0 (single NIC address; though I don't know whether GMS part has anything to do with this problem).

          Another test that failed (environment and configuration like in previous post):
          - before starting JBossCache instances disconnected link eth0-to-eth0
          - started both JBossCache instances
          ! instances work but there is no replication

          But if the link eth1-to-eth1 was disconnected at the beginning, the replication would work fine.

          Obviously it is not how it should work.

          Kind regards
          Mariusz

          • 2. Re: How to configure JGroups on hosts with redundant network
            belaban

            You can set a bind address using bind_addr="1.2.3.4" in the XML file, or overriding this with -Djgroups.bind_addr=1.2.3.4 (or -Dbind.address=1.2.3.4).
            Again, IP Bonding would probably solve your issue.