6 Replies Latest reply on Nov 9, 2011 9:48 AM by tfromm

    5.1.x questions regarding error handling and behaviour

    tfromm

      Maybe you can help me find correct answers :-)

       

      1) Node A is started and running. I start Node B. Node A creates 2 VIEW_CHANGED events. The first event with viewId=1 contains the correct values for the member lists I expect: newMembers=[A, B],  oldMembers=[A]

      The second event viewId=2 contains: newMembers=[A], oldMembers=[A, B]

      This 2nd event only appears when starting the 2nd node at first time in the cluster.

       

      Is this a bug? Should I ignore always Views with viewId=2?

       

      2) I need a counter inside the cluster. I tried to realise it with a replicated cache with cache store.

      The source looks like:

              rc.getAdvancedCache().getTransactionManager().begin();
              ((CacheImpl)rc).lock("cnt");
              Integer i = (Integer)rc.get("cnt");
              if(i == null){
                  i = 1;
              } else {
                  i++;
              }
              rc.put("cnt", i);
             

      rc.getAdvancedCache().getTransactionManager().commit();

       

      Problem 2a) When I use a replicated cache without store, it works very well. When using a jdbc loader (in my case oracle) the transaction runs only the 1st time. The second time, all operations on the cache hangs after the lock and runs into replication timeout. All nodes using the same configuration, the jdbc store is shared, it also happens when I just have a single local cache. Different values for useLockStriping have no effect.

      When I change to file loader, then it works. Any known problems with jdbc at this point?

       

      Problem 2b) When I use file loader, I run into DeadlockDetectedExceptions, NotSupportedExceptions, when using the way above on different nodes at the same time. Exists there a kind of best-practise to handle these errors? Or just wait few ms and try again?

        • 1. Re: 5.1.x questions regarding error handling and behaviour
          galder.zamarreno

          Re 1) Maybe NodeB dropped? We'd have to see some logs with TRACE on org.infinispan package to verify.

           

          Re 2a) First of all, why are you casting to CacheImpl? You shouldn't need to do that. AdvancedCache, which you retrieve via cache.getAdvancedCache() has the lock() API you're after.

           

          Maybe you can try running with a single local node and a JDBC cache store, and try to get some thread dumps after modification and again, some TRACE logs... Please also post the config you're using.

           

          Re 2b) Those deadlock exceptions might be due to modification of the same entry from different nodes. This problem is going away in Infinispan 5.1 because we'll only acquire locks on a single node in the cluster: http://community.jboss.org/wiki/SingleNodeLockingModel

           

          I dunno what those NotSupportedExceptions are about. Post the stacktrace or log file.

          • 2. Re: 5.1.x questions regarding error handling and behaviour
            tfromm

            Sorry, for not attaching these informations :-)

             

            The configuration for all nodes is attached. I use 5.1.0 Beta3

             

            To 1)

            The trace to problem 1 is attached as 1.log, the lines starting with "View changed" is where I write to System.out the events I get at the listener.

            As you can see, these events appear very early when I start the 2nd node. No more events are fired.

             

             

            To 2) I changed it to getAdvancedCache().lock(..)

             

            To 2a) I've attached the E01.java and the E01.xml as configuration. In the following configuration, the E01 hangs.

            It will work if you remove the transport-element and/or use a local loader e.g. h2 mem or file. 

            <clustering mode="local"/> for LOCALCS cache changes also nothing.

             

            Note: I tested it today with Mysql and there it works. It seems to be an oracle related issue.

             

            To 2b) Yes, this node should be modified from different nodes. Then I'll wait for BETA 4 and hope the best :-) I thought 1st the option eagerLockSingleNode was mentioned with that.

            • 3. Re: 5.1.x questions regarding error handling and behaviour
              galder.zamarreno

              Re 1) What other events are you expecting? A new view is installed with obelix-39463 and then this node leaves (I dunno why, you'd need TRACE logging on org.jgroups to find that out exactly). Not other views are set.

               

              Re 2a) As said earlier, to figure out what's wrong with oracle, we need a log with TRACE on org.infinispan *and* thread dumps (i.e. kill -3 <pid>) when the system hangs

               

              Re 2b) The alternative at the moment, till single lock owner is in, is to retry operations/transactions.

              • 4. Re: 5.1.x questions regarding error handling and behaviour
                tfromm

                1) Traces infinispan+jgroups+example source are added E03*

                 

                Inside the trace I see, that the new node joins and leaves the cluster short time after. For these things I get events..

                To the time of event with viewId=2 the cluster size is 1. Few seconds later, the cluster size is 2, but I dont have received a event for that.

                The startup of the 1st node finishes at

                2011-11-08 11:01:32,585 [DEBUG] org.infinispan.CacheImpl - Started cache DISTCS on obelix-11061

                 

                 

                2a) Attached the Threaddump. E01.dmp.zip

                 

                2b) Ok.

                • 5. Re: 5.1.x questions regarding error handling and behaviour
                  galder.zamarreno

                  Re 1) The reason they split is because FD_SOCK cannot open a socket between the two nodes:

                  2011-11-08 11:02:02,191 [DEBUG] org.jgroups.protocols.FD_SOCK - could not create socket to obelix-2176

                   

                  So, it thinks that the other node is down. You can either disable the firewall for these tests, or tie the socket to particular port that's open in both nodes.

                   

                  So, what happens afterwards is that they merge:

                  2011-11-08 11:02:09,272 [DEBUG] org.jgroups.protocols.pbcast.GMS - obelix-11061: view is MergeView::[obelix-2176|3] [obelix-2176, obelix-11061], subgroups=[[obelix-11061|1] [obelix-2176], [obelix-11061|2] [obelix-11061]]                                                                                                                                                                                                       

                  To deal with merges, you need to handle @Merged too, so you could do:

                   

                        @Merged

                        @ViewChanged

                        public void handleViewChange(final ViewChangedEvent e) {

                         ....

                   

                  Re 2a) The thread dump seems to indicate that Infinispan is reading an entry from a binary stream connecting to the database. How big are the objects you're storing? Could you get several thread dumps? I.e. every 30 seconds or so.

                  • 6. Re: 5.1.x questions regarding error handling and behaviour
                    tfromm

                    1) Sup, @Merged was the missing piece :-D Thx

                     

                    2a) https://issues.jboss.org/browse/ISPN-1514