3 Replies Latest reply on Feb 15, 2006 10:50 AM by vikassingh

ReplicationException in simple situation

skipy Nov 7, 2005 3:33 AM

We have ReplicationException in quite simple situation. I've wrote synthetic test that illustrates problem.

There are 2 servers in cluster. One just listens, no actions are performed with cache. Second one emulates our business logic.

package test;

import org.apache.log4j.PropertyConfigurator;
import org.jboss.cache.TreeCache;
import org.jboss.cache.Fqn;

public class Listener {

 public static void main(String[] args) throws Exception{
 PropertyConfigurator.configure("./conf/log4j.properties");
 final TreeCache cache = new TreeCache();
 new org.jboss.cache.PropertyConfigurator().configure(cache,"./conf/replSync-service.xml");
 cache.startService();
 Runtime.getRuntime().addShutdownHook(new Thread(){
 public void run() {
 cache.stopService();
 }
 });
 // waiting forever
 try{
 Object obj = new Object();
 synchronized(obj){
 obj.wait();
 }
 }catch(InterruptedException ex){
 }
 }
}

package test;

import org.apache.log4j.PropertyConfigurator;
import org.jboss.cache.TreeCache;
import org.jboss.cache.Fqn;

public class Worker {

 public static void main(String[] args) throws Exception{
 int i=0;
 PropertyConfigurator.configure("./conf/log4j.properties");
 TreeCache cache = new TreeCache();
 new org.jboss.cache.PropertyConfigurator().configure(cache,"./conf/replSync-service.xml");
 cache.startService();
 try{
 for(i=0; i<7000; i++){
 Fqn fqn1 = new Fqn(new Object[]{"a","b","c"+i+".tmp"});
 cache.put(fqn1,"key","value");
 Fqn fqn2 = new Fqn(new Object[]{"a","b","c"+i});
 cache.put(fqn2,"key","value");
 cache.remove(fqn2);
 cache.remove(fqn1);
 }
 }catch(Exception ex){
 cache.stopService();
 }
 System.exit(0);
 }
}

Configuration is the following:

<?xml version="1.0" encoding="UTF-8"?>
<server>
 <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/>
 <mbean code="org.jboss.cache.TreeCache"
 name="jboss.cache:service=TreeCache-DL-proto">
 <depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends>
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
 <attribute name="CacheMode">REPL_SYNC</attribute>
 <attribute name="UseReplQueue">false</attribute>
 <attribute name="ReplQueueInterval">0</attribute>
 <attribute name="ReplQueueMaxElements">0</attribute>
 <attribute name="ClusterName">TreeCache-Cluster-DL-Proto</attribute>
 <attribute name="DeadlockDetection">true</attribute>
 <attribute name="ClusterConfig">
 <config>
 <UDP mcast_addr="228.1.2.150" mcast_port="40001" bind_addr="192.168.20.90"
 <!-- or 192.168.20.91 for other server -->
 ip_ttl="16" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="false"/>
 <PING timeout="200" num_initial_members="3"
 up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="100"
 up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
 max_xmit_size="8192" up_thread="false" down_thread="false"/>
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
 down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="20000"
 up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192"
 down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
 shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>
 </attribute>
 <attribute name="FetchStateOnStartup">true</attribute>
 <attribute name="InitialStateRetrievalTimeout">60000</attribute>
 <attribute name="SyncReplTimeout">30000</attribute>
 <attribute name="LockAcquisitionTimeout">20000</attribute>
 </mbean>
</server>

As you can see there is no transaction manager and eviction policy. Thus, listener part of this test really does nothing with cache.

So, the problem is the following. This code works perfectly on development environments. But in test environment (just another pair of servers with different configuration) we have the following exception:

org.jboss.cache.ReplicationException: rsp=sender=192.168.20.91:39625, retval=null, received=false, suspected=false
 at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3505)
 at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3526)
 at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:122)
 at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:87)
 at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:4339)
 at org.jboss.cache.TreeCache.put(TreeCache.java:3083)
 at test.Worker.main(Worker.java:30)

In previous version os JBossCache there was TimeoutException instead of ReplicationException. Error can appear also while removing data from cache.

This situation can be fixed by increasing timeout till, e.g., 3 minutes (120 seconds is not enough, this test fails approx. 1 time from 4 runs). So, workaround exist. But I want to find the reason. It seems to me, that the reason is in network configuration. I would like just to clarify error message: rsp=sender=192.168.20.91:35840, retval=null, received=false, suspected=false. What does it mean - received=false? WHO didn't receive message? This error was found on 192.168.20.90 server, but sender address is 192.168.20.91. What does this mean? Does this mean, that JGroups ou listener machine receive message, but didn't has an answer from cache? Or this mean, that JGroups on worker side doesn't have a response from listener side?

Thank you in advace!

Regards,
Eugene

1. Re: ReplicationException in simple situation

ben.wang Nov 7, 2005 6:58 PM (in response to skipy)

1. What version of JBoss Cache and JGroups?
2. How many machines in prod?
3. Are they all running on the same JGroups version?
4. Have you validated that mcast works?

-Ben
Actions
2. Re: ReplicationException in simple situation

skipy Nov 8, 2005 9:08 AM (in response to skipy)

I'm sorry for this disturbance - problem is solved. Of course, I could answer you questions, but the reason was not in JBossCache - infrastructure problem. Interface on one of test servers have work in half-duplex mode. Now application works exellently!

Thank you for your help! And for your library!

Regards,
Eugene
Actions
3. Re: ReplicationException in simple situation

vikassingh Feb 15, 2006 10:50 AM (in response to skipy)

I am experiencing the same problem on Oracle application Server 10g-R2. We have JBoss Tree Cache implementation and our J2ee application joins the tree cache which is published by another process mostly on separate machine. We are using TCP mode of communication.
I am getting following exception

org.jboss.cache.ReplicationException: rsp=sender=146.122.71.102:2833, retval=null, received=false, suspected=false
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3505)
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3526)
at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:122)
at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:87)
at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:4339)
....
Actions

Go to original post