5 Replies Latest reply on Jul 15, 2011 11:10 AM by alex.heneveld

Beginner Problem — Data Loss Starting Two Nodes in Order

alex.heneveld Jul 14, 2011 6:04 PM

I'm having a problem with entries disappearing when a node joins a cluster. It looks like it's rehashing and having some allergic reaction to the new node, but I can't fathom why. Any assistance appreciated.

The attached java file shows the problem: it creates a CacheManager and Cache, sets a value, waits a bit, creates another CM+C, then tries to get the value ... only sometimes gets null instead -- from both caches. (For me the problem seems to appear when the delay is ~2s ... but that's not scientific.)

It is 5.0.0.CR7 (maven in eclipse, java6, Mac) with ClusteredDefault global conf, DIST_SYNC, 1 owner, no L1, all set programmatically. As simple as I could make it ... too simple somehow I guess.

I've also attached two logs, once when it's okay and once when it's not. In both cases when the second cache node tries to join there is a warning that it is being discarded and 7 seconds later it rejoins and all is well. The log message is:

WARNING: almac-44920: not member of view [almac-50720|2] [almac-50720]; discarding it

Maybe this node is accepted very briefly, rehashed, then booted out losing the data. But why? And how to fix it?

Many thanks,

Alex

LosingDataBug.log.zip 1.4 KB
LosingDataOkay.log.zip 1.4 KB
LosingDataWhenNewNodeDiscardedOnJoin.java.zip 838 bytes

1. Re: Beginner Problem — Data Loss Starting Two Nodes in Order

alex.heneveld Jul 14, 2011 7:41 PM (in response to alex.heneveld)

Just noticed the attachments all got zipped automatically. Here is the relevant code, and a more convenient single zip attached (hopefully not zipzipped...).

    public void bug() throws InterruptedException {
        EmbeddedCacheManager cm1 = newCM();
        Cache c1 = cm1.getCache("x");
        c1.put("key", "1");
        Thread.sleep(1000);
        EmbeddedCacheManager cm2 = newCM();
        Cache c2 = cm2.getCache("x");
        assert c1.get("key") != null : "value at cache 1 was lost";        
        cm1.stop();
        cm2.stop();
    }

    public EmbeddedCacheManager newCM() {
        GlobalConfiguration gc = GlobalConfiguration.getClusteredDefault();
        Configuration c = new Configuration().fluent()
            .mode(Configuration.CacheMode.DIST_SYNC)
            .hash().numOwners(1)
            .clustering().l1().disable()
            .build();
        return new DefaultCacheManager(gc, c);
    }

LosingDataFiles.zip 3.6 KB

2. Re: Beginner Problem — Data Loss Starting Two Nodes in Order

sannegrinovero Jul 15, 2011 7:12 AM (in response to alex.heneveld)

Hi Alex,
it's likely that after starting the second cachemanager it didn't finish joining the cluster and performing state transfer from the other nodes.
When you do "getCache()" it won't block to wait for all state to be received, as the expected cluster size is unknown; therefore it is good practice to either wait a couple of second before testing it's content, or poll the members size like we do in the testsuite.
Have a look into the testsuite source:
org.infinispan.test.MultipleCacheManagersTest

especially methods createClusteredCaches and waitForClusterToForm.

Also note that we distribute the testing jars too, so that people can reuse these utilities in their own tests.
Actions
3. Re: Beginner Problem — Data Loss Starting Two Nodes in Order

alex.heneveld Jul 15, 2011 8:13 AM (in response to sannegrinovero)

Thanks Sanne but I don't see how that applies.

In my world the first cache manager (CM1) has no way to know cluster size or if/when other cache managers are joining. He needs to be able to call cache.get(...) at any point in time and see consistent data. (I could cheat in this example since they are same JVM but that's not going to help IRL.)

What is happening is that CM2 comes in to existence, then does cm2.getCache("x"), and that seems to put the cache on CM1 into an inconsistent state, even before anyone touches the cache at CM2.

Is there something else CM2 should do as a good citizen _before_ calling getCache() ? I try getMembers() etc and getStatus() but they seems very boring (null, and initializing) until I have tried a getCache(). (Even cm2.start() followed by sleep 5s doesn't change status from initialized to running, or populate members, which was unexpected.)

Or possibly (due to the 7s join-time interruption and other network warnings) something not compatible with the default infinispan jgroups config?

Cheers
Alex
Actions

4. Re: Beginner Problem — Data Loss Starting Two Nodes in Order

alex.heneveld Jul 15, 2011 10:27 AM (in response to alex.heneveld)

Healthier environment now, with -Djgroups.bind_addr=127.0.0.1, gets rid of the delay on joining and worrisome warnings, but the problem is still here.

    public void bug() throws InterruptedException {
        EmbeddedCacheManager cm1 = newCM();
        Cache c1 = cm1.getCache("x");
        c1.put("key", "value");
        Thread.sleep(3000);
        EmbeddedCacheManager cm2 = newCM();
        System.out.println(c1.get("key"));  //always says "value"
        Cache c2 = cm2.getCache("x");
        System.out.println(c1.get("key"));  //says null sometimes
        assert c1.get("key") != null : "value at cache 1 was lost";        
        cm1.stop();
        cm2.stop();
    }

    public EmbeddedCacheManager newCM() {
        GlobalConfiguration gc = GlobalConfiguration.getClusteredDefault();
        Configuration cfg = new Configuration().fluent()
            .mode(Configuration.CacheMode.DIST_SYNC)
            .hash().numOwners(1)
            .clustering().l1().disable()
            .build();
        return new DefaultCacheManager(gc, cfg);
    }

Log file attached.

healthier.log.zip 911 bytes

5. Re: Beginner Problem — Data Loss Starting Two Nodes in Order

alex.heneveld Jul 15, 2011 11:10 AM (in response to alex.heneveld)

Filed as bug https://issues.jboss.org/browse/ISPN-1244 on Sanne's advice.
Actions

Go to original post