6 Replies Latest reply on May 13, 2005 9:53 AM by g.ambrozio

Problem with cluster under same VM

gustavo.ambrozio May 12, 2005 9:57 AM

I'm new to JBossCache and I can't see what I'm doing wrong.....

Basically, I'm testing JBossCache to verify if it's behaviour can solve my problem. I wrote the following class (simplified):

public class TreeCacheTest implements Runnable
{
 private static boolean testEnds = false;
 private String id;

 public static void main(String[] args)
 {
 new Thread(new TreeCacheTest("1")).start();
 new Thread(new TreeCacheTest("2")).start();
 new Thread(new TreeCacheTest("3")).start();
 }

 public TreeCacheTest(String id)
 {
 this.id = id;
 }

 public void run()
 {
 Util.sleep((long) (Math.random()*5000));

 try
 {
 TreeCache tree = new TreeCache();
 tree.setFetchStateOnStartup(true);
 tree.setClusterName("Test");
 tree.setClusterProperties("state_transfer.xml");
 tree.setCacheMode(TreeCache.REPL_SYNC);
 tree.createService();
 tree.startService();

 long end = System.currentTimeMillis() + 10000 + (long) (Math.random()*5000);
 while (end > System.currentTimeMillis())
 {
 long wait = (long) (Math.random()*1000);
 Util.sleep(wait);
 synchronized(TreeCacheTest.class)
 {
 if (testEnds) break;
 }
 tree.put("/a", id+"_"+wait, new Long(wait));
 }

 synchronized(TreeCacheTest.class)
 {
 testEnds = true;
 }

 Set set = tree.getKeys("/a");
 System.out.println("ended "+id+" "+set.size());
 } catch (Exception e) {
 e.printStackTrace();
 }
 }
}

This code starts 3 threads with different ids so I can track what everyone is doing. Each thread adds nodes to the cache for about 10 seconds. When the first one stops, all threads stop (using the static variable testEnds). So, when these threads stop, I think they should all have the same data, right? Well, that's not the case.... I get the following lines at the end of the test:

ended 1 42
ended 3 44
ended 2 42

Well, it varies from test to test, but they never have the same number of members in their cache..... What am I missing here?? Maybe my cluster configuration (state_transfer.xml):

<config>
 <UDP mcast_recv_buf_size="64000" mcast_send_buf_size="32000"
 mcast_port="45566" ucast_recv_buf_size="64000" use_incoming_packet_handler="false" mcast_addr="228.8.8.8"
use_outgoing_packet_handler="false" loopback="true"
ucast_send_buf_size="32000" ip_ttl="32"/>
 <AUTOCONF />
 <PING timeout="2000" num_initial_members="3"/>
 <MERGE2 max_interval="10000" min_interval="5000"/>
 <FD_SOCK />
 <VERIFY_SUSPECT timeout="1500"/>
 <pbcast.NAKACK max_xmit_size="8192" gc_lag="50" retransmit_timeout="600,1200,2400,4000"/>
 <UNICAST timeout="1000,1500,2000"/>
 <pbcast.STABLE stability_delay="1000" desired_avg_gossip="20000" max_bytes="0"/>
 <FRAG frag_size="8192" down_thread="false" up_thread="false"/>
 <pbcast.GMS print_local_addr="true" join_timeout="3000" join_retry_timeout="2000" shun="true"/>
 <pbcast.STATE_TRANSFER />
</config>

Thanks for the help.

1. Re: Problem with cluster under same VM

belaban May 12, 2005 10:49 AM (in response to gustavo.ambrozio)

Two things:

#1 Starting all instances at the same time might get you into merge situations. Merging different states is not (yet) implemented in JBossCache.

#2

You are running the default protocol. This is *not* using virtual synchrony, so at the end we're *not* running a flush to flush all pending messages out of the system.
If you take the XML code below (which is JGroups/conf/vsync.xml), then you will have virtual synchrony, and the program behaves is as expected.

XML:

<config>
<UDP mcast_recv_buf_size="80000" mcast_send_buf_size="150000"
mcast_port="45566" use_incoming_packet_handler="false"
mcast_addr="228.8.8.8" use_outgoing_packet_handler="false" ip_ttl="32"/>
<PING timeout="2000" num_initial_members="3"/>
<FD />
<STABLE />
<VERIFY_SUSPECT timeout="1500"/>
<NAKACK />
<UNICAST timeout="1200,2000,2500"/>
<FRAG frag_size="8192" down_thread="false" up_thread="false"/>
<FLUSH />
<GMS />
<VIEW_ENFORCER />
<STATE_TRANSFER />
<QUEUE />
</config>

Output:
-------------------------------------------------------
GMS: address is 192.168.0.2:4739
-------------------------------------------------------

-------------------------------------------------------
GMS: address is 192.168.0.2:4750
-------------------------------------------------------

-------------------------------------------------------
GMS: address is 192.168.0.2:4752
-------------------------------------------------------
2: put(/a, 2_412, 412)
2: put(/a, 2_76, 76)
2: put(/a, 2_10, 10)
2: put(/a, 2_522, 522)
2: put(/a, 2_502, 502)
2: put(/a, 2_346, 346)
2: put(/a, 2_676, 676)
view for 2 is [192.168.0.2:4739|1] [192.168.0.2:4739, 192.168.0.2:4752]
view for 2 is [192.168.0.2:4739|2] [192.168.0.2:4739, 192.168.0.2:4752, 192.168.0.2:4750]
2: put(/a, 2_365, 365)
1: put(/a, 1_292, 292)
1: put(/a, 1_60, 60)
3: put(/a, 3_386, 386)
1: put(/a, 1_272, 272)
3: put(/a, 3_466, 466)
3: put(/a, 3_254, 254)
1: put(/a, 1_724, 724)
3: put(/a, 3_288, 288)
2: put(/a, 2_708, 708)
1: put(/a, 1_521, 521)
1: put(/a, 1_46, 46)
1: put(/a, 1_35, 35)
3: put(/a, 3_684, 684)
2: put(/a, 2_603, 603)
2: put(/a, 2_93, 93)
3: put(/a, 3_371, 371)
2: put(/a, 2_227, 227)
1: put(/a, 1_865, 865)
3: put(/a, 3_320, 320)
1: put(/a, 1_70, 70)
2: put(/a, 2_796, 796)
2: put(/a, 2_274, 274)
3: put(/a, 3_970, 970)
1: put(/a, 1_927, 927)
2: put(/a, 2_499, 499)
3: put(/a, 3_398, 398)
3: put(/a, 3_94, 94)
1: put(/a, 1_488, 488)
1: put(/a, 1_3, 3)
1: put(/a, 1_20, 20)
3: put(/a, 3_199, 199)
3: put(/a, 3_72, 72)
2: put(/a, 2_545, 545)
1: put(/a, 1_444, 444)
2: put(/a, 2_716, 716)
3: put(/a, 3_813, 813)
1: put(/a, 1_672, 672)
3: put(/a, 3_264, 264)
2: put(/a, 2_608, 608)
1: put(/a, 1_824, 824)
2: put(/a, 2_469, 469)
3: put(/a, 3_999, 999)
1: put(/a, 1_470, 470)
1: put(/a, 1_87, 87)
2: put(/a, 2_799, 799)
2: put(/a, 2_22, 22)
3: put(/a, 3_947, 947)
2: put(/a, 2_327, 327)
1: put(/a, 1_723, 723)
2: put(/a, 2_136, 136)
2: put(/a, 2_15, 15)
3: put(/a, 3_313, 313)
3: put(/a, 3_117, 117)
1: put(/a, 1_480, 480)
2: put(/a, 2_322, 322)
2: put(/a, 2_350, 350)
3: put(/a, 3_971, 971)
1: put(/a, 1_853, 853)
2: put(/a, 2_567, 567)
2: put(/a, 2_68, 68)
2: put(/a, 2_14, 14)
1: put(/a, 1_328, 328)
2: put(/a, 2_650, 650)
3: put(/a, 3_930, 930)
1: put(/a, 1_699, 699)
2: put(/a, 2_545, 545)
ended 2 73
ended 1 73
ended 3 73
Actions
2. Re: Problem with cluster under same VM

belaban May 12, 2005 10:50 AM (in response to gustavo.ambrozio)

The XML fragment didn't show correctly. But if you download JGroups, you'll find vsync.xml in JGroups.conf.
Actions
3. Re: Problem with cluster under same VM

g.ambrozio May 12, 2005 12:38 PM (in response to gustavo.ambrozio)
Did you modify something else?? I'm running on my machine and I still geet inconsistent results:

ended 2 26
ended 1 29
ended 3 23

using vsync.xml:

<config> <UDP mcast_recv_buf_size="80000" mcast_send_buf_size="150000" use_incoming_packet_handler="false" use_outgoing_packet_handler="false" ip_ttl="32"/> <PING timeout="2000" num_initial_members="3"/> <FD /> <STABLE /> <VERIFY_SUSPECT timeout="1500"/> <NAKACK /> <UNICAST timeout="1200,2000,2500"/> <FRAG frag_size="8192" down_thread="false" up_thread="false"/> <FLUSH /> <GMS /> <VIEW_ENFORCER /> <STATE_TRANSFER /> <QUEUE /> </config>

Can it be because I'm running under windows? Would it help using a shared cache loader, even though I don't need it, because I only have to keep a lot of distributed agents in sync (everyone needs to know what the others are doing)

Thanks again.
Actions
4. Re: Problem with cluster under same VM

g.ambrozio May 12, 2005 12:54 PM (in response to gustavo.ambrozio)
Some new tests: First, I modified the first sleep to:

//Util.sleep((long) (Math.random()*5000)); Util.sleep((long) (id*2000));

id is now an integer, rather than a String. So now I have the first thread starting after 2 seconds, the second after 4 and the third after 6. I also added this to the end of run():

for (Iterator iter = set.iterator(); iter.hasNext();) { System.out.println("end "+iter.next()+" by "+id); }

What I get is:

ended 1 24
end 1_982 by 1
end 1_584 by 1
end 3_415 by 1
end 1_762 by 1
end 1_658 by 1
end 1_395 by 1
end 1_61 by 1
end 3_86 by 1
end 1_415 by 1
end 1_783 by 1
end 3_21 by 1
end 3_938 by 1
end 1_135 by 1
end 1_299 by 1
end 1_924 by 1
end 3_566 by 1
end 1_222 by 1
end 3_251 by 1
end 1_81 by 1
end 3_197 by 1
end 1_239 by 1
end 3_63 by 1
end 1_777 by 1
end 1_324 by 1
ended 2 20
end 2_680 by 2
end 2_895 by 2
end 2_844 by 2
end 2_884 by 2
end 2_119 by 2
end 2_923 by 2
end 2_351 by 2
end 2_863 by 2
end 2_384 by 2
end 2_696 by 2
end 2_481 by 2
end 2_187 by 2
end 2_452 by 2
end 2_935 by 2
end 2_262 by 2
end 2_136 by 2
end 2_322 by 2
end 2_918 by 2
end 2_820 by 2
end 2_416 by 2
ended 3 24
end 1_982 by 3
end 3_415 by 3
end 1_762 by 3
end 1_658 by 3
end 1_395 by 3
end 1_61 by 3
end 3_86 by 3
end 1_783 by 3
end 1_415 by 3
end 3_21 by 3
end 3_10 by 3
end 3_938 by 3
end 1_299 by 3
end 1_135 by 3
end 1_924 by 3
end 1_222 by 3
end 3_566 by 3
end 3_251 by 3
end 1_81 by 3
end 3_197 by 3
end 1_239 by 3
end 3_63 by 3
end 1_777 by 3
end 1_324 by 3

Seems like threads 1 and 3 are on the same cluster, as thread 2 is all alone.....

I got log4j configured to DEBUG, so I geet a lot of extra stuff. One I'm finding curious is:

creator of own vid (dcbnw225:1987)is different from creator of message's vid (dcbnw225:1984); discarding message !

Maybe that's a hint???

Thanks a lot
Actions
5. Re: Problem with cluster under same VM

belaban May 13, 2005 3:28 AM (in response to gustavo.ambrozio)

Well, if the members are in different clusters, then you will always get different results.
I suggest you don't start all 3 threads concurrently. Have at least 1 member in the cluster, before you start the others.
Otherwise, they will need a merge, and the MERGE protocol was not in your stack.
Actions
6. Re: Problem with cluster under same VM

g.ambrozio May 13, 2005 9:53 AM (in response to gustavo.ambrozio)

"bela@jboss.com" wrote:
Well, if the members are in different clusters, then you will always get different results.
I suggest you don't start all 3 threads concurrently. Have at least 1 member in the cluster, before you start the others.
Otherwise, they will need a merge, and the MERGE protocol was not in your stack.

In this last test, I ensured that the second thread would only start 2 seconds after tthe first one (I tried increasing this time without success). I'm adding the MERGE protocol now. Just to make sure I'm doing it right, I added this line:

<MERGE2 max_interval="10000" min_interval="5000"/>

right after the PING protocol. Is this right?

Thanks again
Actions

Go to original post