6 Replies Latest reply on Nov 9, 2011 9:48 AM by tfromm

5.1.x questions regarding error handling and behaviour

tfromm Nov 7, 2011 6:22 AM

Maybe you can help me find correct answers :-)

1) Node A is started and running. I start Node B. Node A creates 2 VIEW_CHANGED events. The first event with viewId=1 contains the correct values for the member lists I expect: newMembers=[A, B], oldMembers=[A]

The second event viewId=2 contains: newMembers=[A], oldMembers=[A, B]

This 2nd event only appears when starting the 2nd node at first time in the cluster.

Is this a bug? Should I ignore always Views with viewId=2?

2) I need a counter inside the cluster. I tried to realise it with a replicated cache with cache store.

The source looks like:

	rc.getAdvancedCache().getTransactionManager().begin();
	((CacheImpl)rc).lock("cnt");
	Integer i = (Integer)rc.get("cnt");
	if(i == null){
	i = 1;
	} else {
	i++;
	}
	rc.put("cnt", i);
	rc.getAdvancedCache().getTransactionManager().commit();

Problem 2a) When I use a replicated cache without store, it works very well. When using a jdbc loader (in my case oracle) the transaction runs only the 1st time. The second time, all operations on the cache hangs after the lock and runs into replication timeout. All nodes using the same configuration, the jdbc store is shared, it also happens when I just have a single local cache. Different values for useLockStriping have no effect.

When I change to file loader, then it works. Any known problems with jdbc at this point?

Problem 2b) When I use file loader, I run into DeadlockDetectedExceptions, NotSupportedExceptions, when using the way above on different nodes at the same time. Exists there a kind of best-practise to handle these errors? Or just wait few ms and try again?

1. Re: 5.1.x questions regarding error handling and behaviour

galder.zamarreno Nov 7, 2011 10:01 AM (in response to tfromm)

Re 1) Maybe NodeB dropped? We'd have to see some logs with TRACE on org.infinispan package to verify.

Re 2a) First of all, why are you casting to CacheImpl? You shouldn't need to do that. AdvancedCache, which you retrieve via cache.getAdvancedCache() has the lock() API you're after.

Maybe you can try running with a single local node and a JDBC cache store, and try to get some thread dumps after modification and again, some TRACE logs... Please also post the config you're using.

Re 2b) Those deadlock exceptions might be due to modification of the same entry from different nodes. This problem is going away in Infinispan 5.1 because we'll only acquire locks on a single node in the cluster: http://community.jboss.org/wiki/SingleNodeLockingModel

I dunno what those NotSupportedExceptions are about. Post the stacktrace or log file.
Actions
2. Re: 5.1.x questions regarding error handling and behaviour

tfromm Nov 8, 2011 1:51 AM (in response to galder.zamarreno)
Sorry, for not attaching these informations :-)

The configuration for all nodes is attached. I use 5.1.0 Beta3

To 1)
The trace to problem 1 is attached as 1.log, the lines starting with "View changed" is where I write to System.out the events I get at the listener.
As you can see, these events appear very early when I start the 2nd node. No more events are fired.

To 2) I changed it to getAdvancedCache().lock(..)

To 2a) I've attached the E01.java and the E01.xml as configuration. In the following configuration, the E01 hangs.
It will work if you remove the transport-element and/or use a local loader e.g. h2 mem or file.
<clustering mode="local"/> for LOCALCS cache changes also nothing.

Note: I tested it today with Mysql and there it works. It seems to be an oracle related issue.

To 2b) Yes, this node should be modified from different nodes. Then I'll wait for BETA 4 and hope the best :-) I thought 1st the option eagerLockSingleNode was mentioned with that.

1.log.zip 3.2 KB

infinispan.xml 10.7 KB

E01.java.zip 505 bytes

E01.xml 4.0 KB

E01.log.zip 1.5 KB
Actions
3. Re: 5.1.x questions regarding error handling and behaviour

galder.zamarreno Nov 8, 2011 3:39 AM (in response to tfromm)

Re 1) What other events are you expecting? A new view is installed with obelix-39463 and then this node leaves (I dunno why, you'd need TRACE logging on org.jgroups to find that out exactly). Not other views are set.

Re 2a) As said earlier, to figure out what's wrong with oracle, we need a log with TRACE on org.infinispan *and* thread dumps (i.e. kill -3 <pid>) when the system hangs

Re 2b) The alternative at the moment, till single lock owner is in, is to retry operations/transactions.
Actions
4. Re: 5.1.x questions regarding error handling and behaviour

tfromm Nov 8, 2011 5:11 AM (in response to galder.zamarreno)
1) Traces infinispan+jgroups+example source are added E03*

Inside the trace I see, that the new node joins and leaves the cluster short time after. For these things I get events..
To the time of event with viewId=2 the cluster size is 1. Few seconds later, the cluster size is 2, but I dont have received a event for that.
The startup of the 1st node finishes at
2011-11-08 11:01:32,585 [DEBUG] org.infinispan.CacheImpl - Started cache DISTCS on obelix-11061

2a) Attached the Threaddump. E01.dmp.zip

2b) Ok.

E01.dmp.zip 3.1 KB

E03.xml 1.4 KB

E03.java.zip 812 bytes

E03.log.zip 7.9 KB
Actions
5. Re: 5.1.x questions regarding error handling and behaviour

galder.zamarreno Nov 9, 2011 5:31 AM (in response to tfromm)

Re 1) The reason they split is because FD_SOCK cannot open a socket between the two nodes:
2011-11-08 11:02:02,191 [DEBUG] org.jgroups.protocols.FD_SOCK - could not create socket to obelix-2176

So, it thinks that the other node is down. You can either disable the firewall for these tests, or tie the socket to particular port that's open in both nodes.

So, what happens afterwards is that they merge:
2011-11-08 11:02:09,272 [DEBUG] org.jgroups.protocols.pbcast.GMS - obelix-11061: view is MergeView::[obelix-2176|3] [obelix-2176, obelix-11061], subgroups=[[obelix-11061|1] [obelix-2176], [obelix-11061|2] [obelix-11061]]
To deal with merges, you need to handle @Merged too, so you could do:

      @Merged
      @ViewChanged
      public void handleViewChange(final ViewChangedEvent e) {
       ....

Re 2a) The thread dump seems to indicate that Infinispan is reading an entry from a binary stream connecting to the database. How big are the objects you're storing? Could you get several thread dumps? I.e. every 30 seconds or so.
Actions
6. Re: 5.1.x questions regarding error handling and behaviour

tfromm Nov 9, 2011 9:48 AM (in response to galder.zamarreno)

1) Sup, @Merged was the missing piece :-D Thx

2a) https://issues.jboss.org/browse/ISPN-1514
Actions

Go to original post