some q's re: impact(s) of ISPN/JGRPS config ...| JBoss.org Content Archive (Read Only)

15. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 11, 2014 6:11 PM (in response to mircea.markus)

Hi Mircea/Bela,

Please respond to the below questions at your earliest:

In fact, this was an error on our part as the event listener was registered on Master/Reducer nodes only and not on Mapper nodes. Modifying the implementation allowed us to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; upon which the cache is restarted for that node. But, we are observing a discrepancy as the Mapper nodes contain the full set of nodes while the Master node excludes the suspected mapper nodes. Please refer to the debug statements below & provide any resolution to this issue for us to resolve this quickly. We are close to the solution except for this anomaly as our Master node doesn't accept any further requests if any of the mappers are down (which is a fall-out of the suspected nodes being evicted form cluster even after their caches were successfully restarted)?

After ascertaining that the ViewChanged event is triggered for Mapper nodes as well; the behavior of Infinispan/JGroups for the cluster has been very unpredictable. For instance; all the nodes received view changed event when cluster is being started; but only 1 of the mappers received this event (which restarted the cache for that node) once SuspectException was encountered while other mappers did not receive view change at all. The root cause of SuspectException would be java.lang.OutOfMemoryError for mapper nodes.

Also, the Master node evicts all mapper nodes with it's own view of the cluster & doesn't include any re-started mapper nodes.

Essentially; SuspectException leaves the cluster dis-jointed with the view changed event not triggered for any or all mapper nodes and in addition the master has it's own cluster view. Resubmitting the request fails with the following error regardless of whether the cache on that node was re-started or not.

Please advise what's the best way to approach this as there is erratic behavior observed from ISPN/JGroups & we need a solution of suspected node(s) to join back to the same cluster view that it started with.

Node 201 cache not re-started (as view changed event not triggered upon SuspectException) -->

AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]]

java.lang.IllegalArgumentException: Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]

Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]

]

Node 201 cache re-started but now contains a different node suffix -->

AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request

], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef

[supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null,

calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData],

code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-7163

(201) is not a cluster member, members are [105-52520(105), 111-56485(111), 101-37987(101), 201-46672(201)]]
java.lang.IllegalArgumentException: Target node 201-7163(201) is not a cluster member, members are [105-52520(105),

111-56485(111), 101-37987(101), 201-46672(201)]
Target node 201-7163(201) is not a cluster member, members are [105-52520(105), 111-56485(111), 101-37987(101),

201-46672(201)]
]

Master node does not contain suspected nodes in it;'s view although the cache would have been re-started -->

1642 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event invoked
1643 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event old members

size : 4
1644 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event old members

list : [10 1, 105, 107, 111]
1645 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event new members

size : 3
1646 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event new members

list : [10        1, 105, 111]
   1647 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - downNodesSet size : 4
   1648 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - downNodesSet list : [202, 203,

107, 201] 1649 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event

completed

OOM error received by JGroups code & not by our code. Is this an error?

DEBUG [14:06:47,265] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from 101-

4785 9(101) for 64514 ms, adding it to suspect list
5459 DEBUG [14:07:40,130] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

111-3391 1(111) for 64515 ms, adding it to suspect list
5460 DEBUG [14:07:48,426] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - suspecting [101-47859(101), 107-

28395(107) , 111-33911(111), 105-31363(105)]
5461 ERROR [14:11:46,986] [Timer-4,AggregationEngine,201-14315(201)][TimeScheduler3] - failed executing task

NAKACK2: Ret        ransmitTask (interval=1000 ms)
   5462 java.lang.OutOfMemoryError: Java heap space
   5463 DEBUG [14:12:17,880] [Timer-2,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

107-2839 5(107) for 434946 ms, adding it to suspect list
5464 DEBUG [14:14:30,246] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

105-3136 3(105) for 140651 ms, adding it to suspect list
5465 ERROR [14:14:36,792] [multicast receiver,AggregationEngine,201-14315(201)][UDP] - 201-14315(201): failed

handling in        coming message
   5466 java.lang.OutOfMemoryError: Java heap space
   5467 DEBUG [14:15:13,885] [VERIFY_SUSPECT.TimerThread,AggregationEngine,201-14315(201)][RequestCorrelator] -

suspect=105- 31363(105)
5468 DEBUG [14:14:36,792] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

107-2839 5(107) for 567312 ms, adding it to suspect list
5469 ERROR [14:15:28,714] [multicast receiver,AggregationEngine,201-14315(201)][UDP] - 201-14315(201): failed

handling in coming message
5470 java.lang.OutOfMemoryError: Java heap space

2. ASYNC timeout stacktrace

DEBUG [11:57:12,125] [Incoming-7,AggregationEngine,101-48199(101)][LocalTopologyManagerImpl] - Error sending the

leave request for cache ___defaultcache to coordinator
4175 org.infinispan.util.concurrent.TimeoutException: Node 105-32211(105) timed out
4176 at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand

(CommandAwareRpcDi spatcher.java:196)
4177 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely

(JGroupsTransport.java:531)
4178 at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator

(LocalTopologyManagerImpl.java:296)
   4179         at org.infinispan.topology.LocalTopologyManagerImpl.leave(LocalTopologyManagerImpl.java:135)
   4180         at org.infinispan.statetransfer.StateTransferManagerImpl.stop(StateTransferManagerImpl.java:234)
   4181         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   4182         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   4183         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   4184         at java.lang.reflect.Method.invoke(Method.java:606)
   4185         at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
   4186         at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke

(AbstractComponentRegistry.jav a:889)

3. Exact APIs to retrieve the nodeName attribute -->

GlobalConfigurationBuilder.defaultClusteredBuilder().build().transport().nodeName();

cacheManager.getCacheManagerConfiguration().transport().nodeName();

16. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 12, 2014 6:00 PM (in response to belaban)

Hi Bela / Mircea,

Please find below further evidence on Suspect Exception issue which is still open:

Even after modifying the implementation to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; the mapper nodes receive view changed event only when the cluster is being started (and NOT once they get suspected).

3874 DEBUG [16:06:58,068] [main][Node] - viewChanged event invoked

3875 DEBUG [16:06:58,068] [main][Node] - viewChanged event old members size : 0

3876 DEBUG [16:06:58,068] [main][Node] - viewChanged event old members list : []

3877 DEBUG [16:06:58,068] [main][Node] - viewChanged event new members size : 5

3878 DEBUG [16:06:58,068] [main][Node] - viewChanged event new members list : [101, 105, 107, 111, 201]

3879 DEBUG [16:06:58,068] [main][Node] - viewChanged event completed

The Master node receives SuspectException for 1 of the mapper nodes although others may have exhausted heap space as well.

WARN [15:46:19,272] [pool-2-thread-1][CommandExec] - callOnMaster ExecutionException, throws
1563 java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: Node

107-3348(107) was suspected
   1564         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
   1565         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
   1566         at org.infinispan.distexec.DefaultExecutorService$DistributedTaskPart.innerGet

(DefaultExecutorService.java:9 39)
1567 at org.infinispan.distexec.DefaultExecutorService$DistributedTaskPart.get

After the mapper nodes exhibit java.lang.OutOfMemoryError; they do NOT receive any view changed event from the cluster. EmbeddedCacheManager.stop() & EmbeddedCacheManager.start() are invoked by catching the exception thrown which does work as per below but none of the suspected nodes receive any view changed event from the cluster.

1264 WARN [16:28:46,407] [remote-thread-0,201][Node] - Begin restarting the cache on Node: 201; as it will be evicted from the cluster once suspected.

1265 DEBUG [16:28:46,407] [remote-thread-0,201][DefaultCacheManager] - Stopping cache manager AggregationEngine on 201-42 312(201)

1266 DEBUG [16:28:46,408] [remote-thread-0,201][CacheImpl] - Stopping cache ___defaultcache on 201-42312(201)

1267 DEBUG [16:28:46,410] [remote-thread-0,201][TransactionTable] - Wait for on-going transactions to finish for 30 secon ds.

1268 DEBUG [16:28:46,411] [remote-thread-0,201][TransactionTable] - All transactions terminated

1269 DEBUG [16:28:46,411] [remote-thread-0,201][LocalTopologyManagerImpl] - Node 201-42312(201) leaving cache ___defaultc ache

...

1282 DEBUG [16:28:48,748] [remote-thread-0,201][DefaultCacheManager] - Started cache manager AggregationEngine on 201

1283 DEBUG [16:28:48,748] [transport-thread-19,201][AbstractJBossMarshaller] - Using JBoss Marshalling

1284 DEBUG [16:28:48,749] [transport-thread-19,201][LocalTopologyManagerImpl] - Ignoring rebalance request 7 for cache __ _defaultcache, the local cache manager is not running

1285 DEBUG [16:28:48,749] [transport-thread-17,201][LocalTopologyManagerImpl] - Ignoring consistent hash update 6 for cac he ___defaultcache, the local cache manager is not running

1286 WARN [16:28:48,749] [remote-thread-0,201][Node] - Completed restarting the cache on Node: 201; as it will be evicte d from the cluster once suspected.

Sometimes, the mappers do receive merge event but not always though ...

DEBUG [15:59:55,306] [Incoming-7,AggregationEngine,107-3348(107)][JGroupsTransport] - New view accepted:

MergeView:: [111-53500(111)|4] [111-53500(111), 105-46262(105), 101-9807(101), 107-3348(107)], subgroups=

[201-42312(201)|2] [107 -3348(107)], [111-53500(111)|3] [111-53500(111), 105-46262(105), 101-9807(101)]
2043 INFO [15:59:55,306] [Incoming-7,AggregationEngine,107-3348(107)][JGroupsTransport] - ISPN000093: Received

new, MERG ED cluster view: MergeView::[111-53500(111)|4] [111-53500(111), 105-46262(105), 101-9807(101),

107-3348(107)], subgr oups=[201-42312(201)|2] [107-3348(107)], [111-53500(111)|3] [111-53500(111), 105-

46262(105), 101-9807(101)]

DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - Merge event invoked
2079 DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - mergedView event merged members

size : 4
2080 DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - mergedView event merged members

list : [101, 105, 107, 111]

Unfortunately, the Master node does NOT reflect any suspected nodes even after their cache have been re-started (with our down nodes list reflecting the member list returned by view changed event) . It only reflects itself & reducer nodes as per below ...

1555 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event invoked
1556 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event old members

size : 5
1557 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event old members

list : [101, 105, 107, 111, 201]
1558 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event new members

size : 3
1559 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event new members

list : [101, 105, 111]
1560 WARN [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][RequestListenerServiceImpl] - Cluster

down. down Nodes [201, 107]
1561 WARN [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][SnapshotStatusMonitorService] - Cluster

down. down Nodes [201, 107]

Even tried to invoke DefaultCacheManager.startCache() (after re-starting the cache) but that throws exception as well ...

ERROR [11:32:18,071] [remote-thread-0,107][InboundInvocationHandlerImpl] - Exception executing command
1352 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache

container        .
   1353         at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)
   1354         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)
   1355         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)
   1356         at org.infinispan.manager.DefaultCacheManager.startCache(DefaultCacheManager.java:884)
   1357         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)
   1358         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
   1359         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
   1360         at org.infinispan.commands.read.DistributedExecuteCommand.perform

(DistributedExecuteCommand.java:121)
1361 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal

(InboundInvocationHandlerImpl.java:122 )
1362 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000

(InboundInvocationHandlerImpl.java:68)
1363 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run

(InboundInvocationHandlerImpl.java:194)
   1364         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   1365         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   1366         at java.lang.Thread.run(Thread.java:724)

Some mapper nodes do contain following error once suspected ...

DEBUG [15:59:55,306] [remote-thread-0,107][LocalTopologyManagerImpl] - Error sending the leave request for cache

___        defaultcache to coordinator
   2045 org.infinispan.remoting.transport.jgroups.SuspectException: Node 201-42312(201) was suspected
   2046         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand

(CommandAwareRpcDi spatcher.java:194)
2047 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely

(JGroupsTransport.java:531)
2048 at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator

(LocalTopologyManagerImpl.java:296)
   2049         at org.infinispan.topology.LocalTopologyManagerImpl.leave(LocalTopologyManagerImpl.java:135)
   2050         at org.infinispan.statetransfer.StateTransferManagerImpl.stop(StateTransferManagerImpl.java:234)
   2051         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   2052         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   2053         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   2054         at java.lang.reflect.Method.invoke(Method.java:606)
   2055         at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
   2056         at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke

(AbstractComponentRegistry.jav a:889)
2057 at org.infinispan.factories.AbstractComponentRegistry.internalStop

(AbstractComponentRegistry.java:694)
   2058         at org.infinispan.factories.AbstractComponentRegistry.stop(AbstractComponentRegistry.java:572)
   2059         at org.infinispan.factories.ComponentRegistry.stop(ComponentRegistry.java:246)
   2060         at org.infinispan.CacheImpl.stop(CacheImpl.java:713)
   2061         at org.infinispan.CacheImpl.stop(CacheImpl.java:708)
   2062         at org.infinispan.manager.DefaultCacheManager.stop(DefaultCacheManager.java:738)
   2063         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:240)
   2064         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
   2065         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
   2066         at org.infinispan.commands.read.DistributedExecuteCommand.perform

(DistributedExecuteCommand.java:121)
2067 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal

(InboundInvocationHandlerImpl.java:122 )
2068 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000

(InboundInvocationHandlerImpl.java:68)
2069 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run

(InboundInvocationHandlerImpl.java:194)
   2070         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   2071         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   2072         at java.lang.Thread.run(Thread.java:724)
   2073 Caused by: SuspectedException
   2074         at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:406)
   2075         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall

(CommandAwareRpcDisp atcher.java:375)
2076 at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand

(CommandAwareRpcDi spatcher.java:189)

Resubmitting the request does not work as master node has already evicted the suspected nodes & hence throws following error for any of the suspected node picked-up by Infinispan:

AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request

], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef

[supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null,

calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData],

code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-55381

(201) is not a cluster member, members are [111-51035(111), 105-19421(105), 101-62405(101)]]
java.lang.IllegalArgumentException: Target node 201-55381(201) is not a cluster member, members are [111-51035

(111), 105-19421(105), 101-62405(101)]
Target node 201-55381(201) is not a cluster member, members are [111-51035(111), 105-19421(105), 101-62405(101)]
]
at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)

In summary; we have implemented & tested the cache restart option a few ways but that does not work at all; as the cluster remains dis-jointed forever & does NOT allow the mapper nodes to join back to the original cluster (which contains only the Master/Reducer nodes once any of the mappers have been suspected).

Please provide any other alternative solutions as we have exhausted (to the best of our knowledge) all solutions to gracefully restart the suspected nodes but in vain. If there are any oversights or other recommendations; do provide at your earliest.

Thanks.

17. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 13, 2014 3:09 PM (in response to belaban)

Hi Bela/Mircea,

We are still awaiting your response to this stalemate we have been facing with the SuspectException issue as per the above findings.

Retried the scenarios few times today with same results as Suspected Nodes DO NOT join back to the parent cluster even after the cache has been successfully re-started; with the fact that the Master node only includes itself & Reducers in the cluster with all the Mapper nodes evicted from the cluster & reported as down nodes in the view changed event.

Let us know if there are any other remediation measures w.r.t. ISPN APIs that will help the suspected nodes join back to the cluster programmatically.

In the interim; I have modified the timeout for FD & SUSPECT protocols (see below) to be a higher value which does allow our long-running requests to get completed obviating the suspect exception scenario. Given that SuspectException is deemed to be benign & planned to be removed in ISPN 7.x; would this approach of increased timeout suffice?

<FD_ALL timeout="600000" interval="30000" />

<VERIFY_SUSPECT timeout="300000" num_msgs="5" />

<FD_SOCK />

Please reply at your earliest to the detailed findings/stacktraces observed on Feb 11th & Feb 12th posting as well as the conclusion drawn herein. Thanks.

18. Re: some q's re: impact(s) of ISPN/JGRPS config ...

mircea.markus Feb 13, 2014 7:39 PM (in response to nileshbhagat)

You to mention OutOfMemoryExceptions. I would strongly suggest investigating the cause of this and fixing it, before doing anything else, as the JVM would behave in an undeterministic manner once this errors are reported, and everything we discuss on this forum might not work. This thread is a bit convoluted, so I'll try to sum it up just to make sure we're on the same page: 1. cluster is formed of caches {A, B, C, D} 2. node D is suspected, it receives a viewChanged event 3. in this view change event you stop the cache and restart it 4. this restart process doesn't trigger another viewChanged event, as you'd expect Is this the case? If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?

19. Re: some q's re: impact(s) of ISPN/JGRPS config ...

belaban Feb 14, 2014 2:29 AM (in response to nileshbhagat)

No, this config isn't good. The timeout in FD_ALL means that you'd have to wait for up to 10 minutes for a suspicion to happen if a node hangs (if it crashes, FD_SOCK will detect this immediately). Plus, VERIFY_SUSPECT will take another 5 minutes to verify this. Absolutely BAD ! I suggest set the timeout in FD_ALL to 60s and in VERIFY_SUSPECT to 5s.

20. Re: some q's re: impact(s) of ISPN/JGRPS config ...

cotton.ben Feb 14, 2014 8:31 AM (in response to belaban)

> Mircea wrote:

> [...] If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?

Thanks ... working on this now.

> Bela wrote:

> No, this config isn't good. [...]

We are in a lose/lose at the moment. A good config results in our M-R request(s) not being able to complete (LOSE due to SUSPECT/timeout). A bad config (huge timeout) enables our M-R request(s) to complete (but we LOSE because our bad config has other consequences)

Thanks for sticking with us in this struggle. This is a good fight. Together, we are confident remedy will prevail.

21. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 14, 2014 5:46 PM (in response to mircea.markus)

Hi Mircea,

To correct your understanding ...

1. cluster is formed of caches {A, B, C, D} with ALL nodes receiving viewChanged event

2. node D is suspected, it does NOT receive any viewChanged event & hence cannot restart the cache

3. In the catch block of OOM exception; we stop the cache and restart it

4. this restart process doesn't trigger another viewChanged event to any of the suspected, as you'd expect Is this the case? YES.

5. Master node evicts the suspected nodes permanently from the cluster & reports only {A, B, C} in this scenario (as per below)...

24304 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event invoked

24305 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event old members size : 4

24306 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event old members list : [10 1, 105, 111, 201]

24307 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event new members size : 3

24308 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event new members list : [10 1, 105, 111]

24309 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - downNodesSet size : 2

24310 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - downNodesSet list : [201, 107]

24311 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event completed

6. Resubmitting the request to the Master node results in "Target node not being part of cluster" exception (refer to the stack-traces in above post & below as well)...

AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null,

getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0,

properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code

value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105),

101-29628(101)]]
java.lang.IllegalArgumentException: Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105), 101-29628(101)]
Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105), 101-29628(101)]
]
at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)

______________________________________________________________________________

If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?

Yes, I did implement CacheManager.stop and then CacheManager.getCache() (without the stop() API) but it results in following exception. Tried this scenario a few times but resulted in the same exception thrown.

2104 ERROR [11:38:59,286] [remote-thread-0,201][InboundInvocationHandlerImpl] - Exception executing command

2105 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container .

2106 at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)

2107 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)

2108 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)

2109 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)

2110 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)

2111 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)

2112 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)

2113 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )

2114 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)

2115 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)

2116 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

2117 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

2118 at java.lang.Thread.run(Thread.java:724)

Then, tried invoking CacheManager.stop, CacheManager.start and then CacheManager.getCache but that also results in same exception ...

ERROR [14:56:20,204] [remote-thread-1,201][InboundInvocationHandlerImpl] - Exception executing command

4928 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container .

4929 at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)

4930 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)

4931 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)

4932 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)

4933 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)

4934 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)

4935 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)

4936 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )

4937 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)

4938 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)

4939 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

4940 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

4941 at java.lang.Thread.run(Thread.java:724)

The Master node reports one of the nodes as suspected as per below ...

AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null,

getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0,

properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code

value=2016, getMessage()=org.infinispan.remoting.transport.jgroups.SuspectException: Node 107-53543(107) was suspected]
org.infinispan.remoting.transport.jgroups.SuspectException: Node 107-53543(107) was suspected
Node 107-53543(107) was suspected
null
]
at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)

_____________________________________________________________

Given that none of the suggested APIs have worked to have suspected node(s) re-join the cluster; is there any tests you or the team can try to validate this scenario & identify which of the ISPN APIs would indeed work? It would be appreciated if the suggestions are tested to validate the findings (given that none of them have worked so far for us).

__________________________________________________________

Hi Bela,

The rationale for increased timeout for FD_ALL & SUSPECT was to ensure that the nodes do not get suspected by infinispan & allow our request to complete. Also, the SuspectException is planned to be deprecated in ISPN 7.x (based on some other posts).

We have tried all suggested API approaches but in vain. Hope that your team can validate the suggestions as that would save on the time & effort (given that this issue has been for a while).

22. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 18, 2014 5:47 PM (in response to belaban)

Still awaiting response from the community on this SuspectException issue as did restart the cache (with & without CacheManager.getCache) with the same results as mapper nodes do not join back to the cluster while the master evicts the mappers permanently rejecting any subsequent requests.

23. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 21, 2014 10:18 AM (in response to belaban)

Hi Bela/Mircea,

This SuspectException issue is still open & after testing the CacheManager stop/start/getCache APIs; the results are still negative with the suspected mapper node(s) being evicted from the cluster & never joins back with the master shunning the suspects forever.

We have obviated this issue by increasing the timeout values for FD_ALL & VERIFY_SUSPECT as the mapper nodes do become unresponsive with Full GC but they are still alive. Do confirm if SuspectException may be removed from ISPN 7.x as per https://issues.jboss.org/i#browse/ISPN-2577

Can you assert the following config values which were set after reviewing the JGroups/ISPN docs:

1) Should the ISPN hash.numSegments value be less than or equal to 10 * max_cluster_size (with cluster size assumed to be Master+Reducer+Mapper for all machines the cluster)?

2) What's the basis for ISPN L1 cache values l1.lifespan & l1.cleanupTaskFrequency? Would the default 10mins suffice?

3) Confirm the basis for min threads, max threads, queue max size, keep alive time & rejection policy for Thread Pool & OOB pools .(with the current settings pasted below)?

thread_pool.enabled="true"

thread_pool.min_threads="30"

thread_pool.max_threads="100"

thread_pool.keep_alive_time="5000"

thread_pool.queue_enabled="true"

thread_pool.queue_max_size="10000"

thread_pool.rejection_policy="Run"

oob_thread_pool.enabled="true"

oob_thread_pool.min_threads="30"

oob_thread_pool.max_threads="100"

oob_thread_pool.keep_alive_time="5000"

oob_thread_pool.queue_enabled="true"

oob_thread_pool.queue_max_size="100"

oob_thread_pool.rejection_policy="Run"

4) Please review the following values for UNICAST & NAKACK & recommend any changes?

<pbcast.NAKACK2 xmit_interval="1000"

max_rebroadcast_timeout="2000"

use_mcast_xmit="true"

use_mcast_xmit_req="true"

discard_delivered_msgs="true"/>

<UNICAST3 xmit_interval="1000"

max_retransmit_time="2000"

conn_expiry_timeout="0" />

5) What's the basis for the different timeout values in GMS & confirm any changes to below settings?

<pbcast.GMS print_local_addr="true"

join_timeout="10000"

leave_timeout="10000"

merge_timeout="5000"

view_bundling="true"

max_bundling_time="500"

view_ack_collection_timeout="2000"

resume_task_timeout="15000"/>

6) Any suggestions on maximizing the cluster throughput with the mapper nodes being able to send results back to the Master/Reducer with minimal latency?

Hope that above can be confirmed at the earliest.

Thanks,

Nilesh.

24. Re: some q's re: impact(s) of ISPN/JGRPS config ...

cotton.ben Feb 25, 2014 11:29 AM (in response to nileshbhagat)

Hi ISPN, Bela, Mircea,

Just to recap this matter:

1. When our ISPN 5.3 deployment uses the recommended JGRPS configs, our DistributedExectuorService dispatch onto our DIST_SYNC grid cannot meet its SLA ... it fails due to SUSPECT_EXCEPTION

2. We remedy this by using a non-recommended config, providing HUGE timeouts. This gratuitously accommodates our DES dispatch completing, but we are now deployed with a non-preferred config.

3. In this exact forum thread, remedy providing both (1) a preferred config and (2) the ability for us to consistently realize our DES dispatch succeeding w/o SUSPECT_EXCEPTION has been ... well .. it has been elusive.

4. In ISPN 7 the SUSPECT_EXCEPTION goes away, at the time we upgrade we should then on ISPN 7 fully expect our DES dispatch to work perfectly with a preferred JGRPs config

5. If we cannot wait until we upgrade to ISPN 7, and want both a preferred config and the expectation that our DES dispath will not endure a SUSPECT_EXCEPTION, we should take advantage of the fact that we are now customers and open a ISPN 5.3 bug ticket via our RedHat subscribed customer account. And request a 5.3 pathc

Could you guys kindly confirm that, from your vantage point, all of 1-5 are true?

Thanks,

Ben

25. Re: some q's re: impact(s) of ISPN/JGRPS config ...

mircea.markus Feb 26, 2014 8:28 AM (in response to nileshbhagat)

Yes, I did implement CacheManager.stop and then CacheManager.getCache() (without the stop() API) but it results in following exception. Tried this scenario a few times but resulted in the same exception thrown.

2104 ERROR [11:38:59,286] [remote-thread-0,201][InboundInvocationHandlerImpl] - Exception executing command

   2105 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container        .

   2106         at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)

   2107         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)

   2108         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)

   2109         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)

   2110         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)

   2111         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)

   2112         at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)

   2113         at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122        )

   2114         at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)

   2115         at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)

   2116         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

   2117         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

   2118         at java.lang.Thread.run(Thread.java:724)

right, you'd need to do CacheManager.stop() then CacheManager.start() then CacheManager.getCache

26. Re: some q's re: impact(s) of ISPN/JGRPS config ...

mircea.markus Feb 26, 2014 8:36 AM (in response to cotton.ben)

1. When our ISPN 5.3 deployment uses the recommended JGRPS configs, our DistributedExectuorService dispatch onto our DIST_SYNC grid cannot meet its SLA ... it fails due to SUSPECT_EXCEPTION

2. We remedy this by using a non-recommended config, providing HUGE timeouts. This gratuitously accommodates our DES dispatch completing, but we are now deployed with a non-preferred config.

3. In this exact forum thread, remedy providing both (1) a preferred config and (2) the ability for us to consistently realize our DES dispatch succeeding w/o SUSPECT_EXCEPTION has been ... well .. it has been elusive.

4. In ISPN 7 the SUSPECT_EXCEPTION goes away, at the time we upgrade we should then on ISPN 7 fully expect our DES dispatch to work perfectly witha preferred JGRPs config

5. If we cannot wait until we upgrade to ISPN 7, and want both a preferred config and the expectation that our DES dispath will not endure a SUSPECT_EXCEPTION, we should take advantage of the fact that we are now customers and open a ISPN 5.3 bug ticket via our RedHat subscribed customer account. And request a 5.3 pathc

Ben, I'm afraid there's no straight answer to the suspect exception questions. Suspect exceptions happen because a node is seen as not being present in the cluster by other members. This might be cause by a variety of factors, e.g. node crashing, NIC crashing, network flooded, node being overwhelmed (long GC run, OOM). They way I'd tackle this would be to see why the node gets suspended. In your case sounds like either an OOM or a long GC pause, but might be network as well. Monitoring the system (failed node in particular) and the logs from the JGroups failure detection protocols you configured (FD_ALL, VERIFY_SUSPECT, FD_SOCK) should hint you on that. Might be a case that the system is overwhelmed, I imagine the computations you're doing there are not trivial.

27. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 26, 2014 10:33 AM (in response to mircea.markus)

It's surprising that my post was overlooked as I had already tried exactly that & even that failed with the same exception as per below ...

Then, tried invoking CacheManager.stop, CacheManager.start and then CacheManager.getCache but that also results in same exception ...