ReplicationException in simple situation
skipy Nov 7, 2005 3:33 AMWe have ReplicationException in quite simple situation. I've wrote synthetic test that illustrates problem.
There are 2 servers in cluster. One just listens, no actions are performed with cache. Second one emulates our business logic.
package test; import org.apache.log4j.PropertyConfigurator; import org.jboss.cache.TreeCache; import org.jboss.cache.Fqn; public class Listener { public static void main(String[] args) throws Exception{ PropertyConfigurator.configure("./conf/log4j.properties"); final TreeCache cache = new TreeCache(); new org.jboss.cache.PropertyConfigurator().configure(cache,"./conf/replSync-service.xml"); cache.startService(); Runtime.getRuntime().addShutdownHook(new Thread(){ public void run() { cache.stopService(); } }); // waiting forever try{ Object obj = new Object(); synchronized(obj){ obj.wait(); } }catch(InterruptedException ex){ } } }
package test; import org.apache.log4j.PropertyConfigurator; import org.jboss.cache.TreeCache; import org.jboss.cache.Fqn; public class Worker { public static void main(String[] args) throws Exception{ int i=0; PropertyConfigurator.configure("./conf/log4j.properties"); TreeCache cache = new TreeCache(); new org.jboss.cache.PropertyConfigurator().configure(cache,"./conf/replSync-service.xml"); cache.startService(); try{ for(i=0; i<7000; i++){ Fqn fqn1 = new Fqn(new Object[]{"a","b","c"+i+".tmp"}); cache.put(fqn1,"key","value"); Fqn fqn2 = new Fqn(new Object[]{"a","b","c"+i}); cache.put(fqn2,"key","value"); cache.remove(fqn2); cache.remove(fqn1); } }catch(Exception ex){ cache.stopService(); } System.exit(0); } }
Configuration is the following:
<?xml version="1.0" encoding="UTF-8"?> <server> <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar"/> <mbean code="org.jboss.cache.TreeCache" name="jboss.cache:service=TreeCache-DL-proto"> <depends>jboss:service=Naming</depends> <depends>jboss:service=TransactionManager</depends> <attribute name="IsolationLevel">REPEATABLE_READ</attribute> <attribute name="CacheMode">REPL_SYNC</attribute> <attribute name="UseReplQueue">false</attribute> <attribute name="ReplQueueInterval">0</attribute> <attribute name="ReplQueueMaxElements">0</attribute> <attribute name="ClusterName">TreeCache-Cluster-DL-Proto</attribute> <attribute name="DeadlockDetection">true</attribute> <attribute name="ClusterConfig"> <config> <UDP mcast_addr="228.1.2.150" mcast_port="40001" bind_addr="192.168.20.90" <!-- or 192.168.20.91 for other server --> ip_ttl="16" ip_mcast="true" mcast_send_buf_size="150000" mcast_recv_buf_size="80000" ucast_send_buf_size="150000" ucast_recv_buf_size="80000" loopback="false"/> <PING timeout="200" num_initial_members="3" up_thread="false" down_thread="false"/> <MERGE2 min_interval="10000" max_interval="20000"/> <FD_SOCK/> <VERIFY_SUSPECT timeout="100" up_thread="false" down_thread="false"/> <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/> <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/> <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false"/> <FRAG frag_size="8192" down_thread="false" up_thread="false"/> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> </config> </attribute> <attribute name="FetchStateOnStartup">true</attribute> <attribute name="InitialStateRetrievalTimeout">60000</attribute> <attribute name="SyncReplTimeout">30000</attribute> <attribute name="LockAcquisitionTimeout">20000</attribute> </mbean> </server>
As you can see there is no transaction manager and eviction policy. Thus, listener part of this test really does nothing with cache.
So, the problem is the following. This code works perfectly on development environments. But in test environment (just another pair of servers with different configuration) we have the following exception:
org.jboss.cache.ReplicationException: rsp=sender=192.168.20.91:39625, retval=null, received=false, suspected=false at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3505) at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3526) at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:122) at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:87) at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:4339) at org.jboss.cache.TreeCache.put(TreeCache.java:3083) at test.Worker.main(Worker.java:30)
In previous version os JBossCache there was TimeoutException instead of ReplicationException. Error can appear also while removing data from cache.
This situation can be fixed by increasing timeout till, e.g., 3 minutes (120 seconds is not enough, this test fails approx. 1 time from 4 runs). So, workaround exist. But I want to find the reason. It seems to me, that the reason is in network configuration. I would like just to clarify error message: rsp=sender=192.168.20.91:35840, retval=null, received=false, suspected=false. What does it mean - received=false? WHO didn't receive message? This error was found on 192.168.20.90 server, but sender address is 192.168.20.91. What does this mean? Does this mean, that JGroups ou listener machine receive message, but didn't has an answer from cache? Or this mean, that JGroups on worker side doesn't have a response from listener side?
Thank you in advace!
Regards,
Eugene