2 Replies Latest reply on Aug 31, 2007 7:39 AM by Bela Ban

    How to configure JGroups on hosts with redundant network lin

    Mariusz Krzemien Newbie

      In our production environment all hosts have duplicated network links. It is intended to protect from single link failure. Does anyone have any example / best practices how to configure JGroups for proper work in such environment? (So that JGroups works fine despite a single link failure).

      We made some prototyping but it failed - details below.

      Thank you in advance.
      Kind regards
      Mariusz

      Version: JBossCache 1.4.1 SP3, JGroups 2.4.1

      Environment: a LAN consisting of two hosts, each host with two NICs (eth0, eth1), the hosts connected directly (eth0-to-eth0, eth1-to-eth1), configured as single IPv4 subnet. JGroups was intended to communicate on both interfaces and to use multicast (see Configuration below)

      Test description:
      - both links are connected
      - on each node started one instance of JBossCache
      - replication working correctly
      - disconnected link eth1-to-eth1
      - replication working correctly
      - reconnected link eth1-to-eth1, disconnected link eth0-to-eth0
      - replication working correctly
      ! after a time (around 5sec) both instances communicate an exception (see below) to one another and break because the exception is not caught

      I don't know if it is enough to simply catch the exception. From the top-level I can see that JGroups/JBossCache does have some problem with this configuration.

      Configuration details:
      <UDP mcast_addr="228.8.8.8" mcast_port="45566"
      ip_ttl="64" ip_mcast="true"
      mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
      ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
      loopback="false"
      receive_on_all_interfaces="true"
      send_on_all_interfaces="true"
      receive_interfaces="eth0,eth1"
      send_interfaces="eth0,eth1"/>
      <PING timeout="2000" num_initial_members="3"
      up_thread="false" down_thread="false"/>
      <MERGE2 min_interval="10000" max_interval="20000"/>
      <!-- <FD shun="true" up_thread="true" own_thread="true" />-->
      <FD_SOCK/>
      <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
      max_xmit_size="8192" up_thread="false" down_thread="false"/>
      <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
      down_thread="false"/>
      <pbcast.STABLE desired_avg_gossip="20000"
      up_thread="false" down_thread="false"/>
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
      shun="true" print_local_addr="true"/>
      <FC max_credits="2000000" down_thread="false" up_thread="false"
      min_threshold="0.20"/>
      <FRAG frag_size="8192" down_thread="false" up_thread="true"/>
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

      Logs with exception:
      [2007-08-30 15:20:29,796|DEBUG|main; |org.jgroups.blocks.GroupRequest(execute:195)]: call did not execute correctly, request is [GroupRequest:
      req_id=1188480009786
      caller=10.10.0.2:32781
      10.10.0.1:32781: sender=10.10.0.1:32781, retval=null, received=false, suspected=false

      request_msg: [dst: , src: 10.10.0.2:32781 (3 headers), size = 34 bytes]
      rsp_mode: GET_ALL
      done: false
      timeout: 20000
      expected_mbrs: 0
      ]
      [2007-08-30 15:20:29,796|DEBUG|main; |org.jgroups.blocks.RpcDispatcher(callRemoteMethods:193)]: responses: [sender=10.10.0.1:32781, retval=null, received=false, suspected=false]

      [2007-08-30 15:20:29,797|DEBUG|main; |org.jboss.cache.TreeCache(callRemoteMethods:4405)]: (10.10.0.2:32781): responses for method _replicate:
      [sender=10.10.0.1:32781, retval=null, received=false, suspected=false]

      [2007-08-30 15:20:29,798|DEBUG|main; |org.jboss.cache.interceptors.BaseRpcInterceptor(replicateCall:118)]: responses=[org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null, received=false, suspected=false]
      [2007-08-30 15:20:29,800|DEBUG|main; |org.jboss.cache.interceptors.BaseRpcInterceptor(checkResponses:79)]: Received Throwable from remote node
      org.jboss.cache.ReplicationException: rsp=sender=10.10.0.1:32781, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4422)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4344)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4455)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
      at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:124)
      at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:88)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:365)
      at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:160)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
      at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:183)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5863)
      at org.jboss.cache.TreeCache.remove(TreeCache.java:3929)
      at org.jboss.cache.TreeCache.remove(TreeCache.java:3915)
      at test.jbcache.DistributedTree.remove(DistributedTree.java:41)
      at test.jbcache.DistributedTest.handleSession(DistributedTest.java:46)
      at test.jbcache.DistributedTest.main(DistributedTest.java:78)
      Caused by: org.jboss.cache.lock.TimeoutException: Response timed out: sender=10.10.0.1:32781, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4420)
      ... 17 more