-
15. Re: some q's re: impact(s) of ISPN/JGRPS config ...
nileshbhagat Feb 11, 2014 6:11 PM (in response to mircea.markus)
Hi Mircea/Bela,Please respond to the below questions at your earliest:
- In fact, this was an error on our part as the event listener was registered on Master/Reducer nodes only and not on Mapper nodes. Modifying the implementation allowed us to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; upon which the cache is restarted for that node. But, we are observing a discrepancy as the Mapper nodes contain the full set of nodes while the Master node excludes the suspected mapper nodes. Please refer to the debug statements below & provide any resolution to this issue for us to resolve this quickly. We are close to the solution except for this anomaly as our Master node doesn't accept any further requests if any of the mappers are down (which is a fall-out of the suspected nodes being evicted form cluster even after their caches were successfully restarted)?
After ascertaining that the ViewChanged event is triggered for Mapper nodes as well; the behavior of Infinispan/JGroups for the cluster has been very unpredictable. For instance; all the nodes received view changed event when cluster is being started; but only 1 of the mappers received this event (which restarted the cache for that node) once SuspectException was encountered while other mappers did not receive view change at all. The root cause of SuspectException would be java.lang.OutOfMemoryError for mapper nodes.
Also, the Master node evicts all mapper nodes with it's own view of the cluster & doesn't include any re-started mapper nodes.
Essentially; SuspectException leaves the cluster dis-jointed with the view changed event not triggered for any or all mapper nodes and in addition the master has it's own cluster view. Resubmitting the request fails with the following error regardless of whether the cache on that node was re-started or not.
Please advise what's the best way to approach this as there is erratic behavior observed from ISPN/JGroups & we need a solution of suspected node(s) to join back to the same cluster view that it started with.
Node 201 cache not re-started (as view changed event not triggered upon SuspectException) -->
AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]]
java.lang.IllegalArgumentException: Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]
Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]
]
Node 201 cache re-started but now contains a different node suffix -->
AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request
], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef
[supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null,
calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData],
code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-7163
(201) is not a cluster member, members are [105-52520(105), 111-56485(111), 101-37987(101), 201-46672(201)]]
java.lang.IllegalArgumentException: Target node 201-7163(201) is not a cluster member, members are [105-52520(105),111-56485(111), 101-37987(101), 201-46672(201)]
Target node 201-7163(201) is not a cluster member, members are [105-52520(105), 111-56485(111), 101-37987(101),201-46672(201)]
]Master node does not contain suspected nodes in it;'s view although the cache would have been re-started -->
1642 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event invoked
1643 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event old memberssize : 4
1644 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event old memberslist : [10 1, 105, 107, 111]
1645 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event new memberssize : 3
1646 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event new memberslist : [10 1, 105, 111]
1647 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - downNodesSet size : 4
1648 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - downNodesSet list : [202, 203,107, 201] 1649 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event
completed
OOM error received by JGroups code & not by our code. Is this an error?
DEBUG [14:06:47,265] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from 101-
4785 9(101) for 64514 ms, adding it to suspect list
5459 DEBUG [14:07:40,130] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from111-3391 1(111) for 64515 ms, adding it to suspect list
5460 DEBUG [14:07:48,426] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - suspecting [101-47859(101), 107-28395(107) , 111-33911(111), 105-31363(105)]
5461 ERROR [14:11:46,986] [Timer-4,AggregationEngine,201-14315(201)][TimeScheduler3] - failed executing taskNAKACK2: Ret ransmitTask (interval=1000 ms)
5462 java.lang.OutOfMemoryError: Java heap space
5463 DEBUG [14:12:17,880] [Timer-2,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from107-2839 5(107) for 434946 ms, adding it to suspect list
5464 DEBUG [14:14:30,246] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from105-3136 3(105) for 140651 ms, adding it to suspect list
5465 ERROR [14:14:36,792] [multicast receiver,AggregationEngine,201-14315(201)][UDP] - 201-14315(201): failedhandling in coming message
5466 java.lang.OutOfMemoryError: Java heap space
5467 DEBUG [14:15:13,885] [VERIFY_SUSPECT.TimerThread,AggregationEngine,201-14315(201)][RequestCorrelator] -suspect=105- 31363(105)
5468 DEBUG [14:14:36,792] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from107-2839 5(107) for 567312 ms, adding it to suspect list
5469 ERROR [14:15:28,714] [multicast receiver,AggregationEngine,201-14315(201)][UDP] - 201-14315(201): failedhandling in coming message
5470 java.lang.OutOfMemoryError: Java heap space2. ASYNC timeout stacktrace
DEBUG [11:57:12,125] [Incoming-7,AggregationEngine,101-48199(101)][LocalTopologyManagerImpl] - Error sending the
leave request for cache ___defaultcache to coordinator
4175 org.infinispan.util.concurrent.TimeoutException: Node 105-32211(105) timed out
4176 at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDi spatcher.java:196)
4177 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:531)
4178 at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator(LocalTopologyManagerImpl.java:296)
4179 at org.infinispan.topology.LocalTopologyManagerImpl.leave(LocalTopologyManagerImpl.java:135)
4180 at org.infinispan.statetransfer.StateTransferManagerImpl.stop(StateTransferManagerImpl.java:234)
4181 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
4182 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
4183 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
4184 at java.lang.reflect.Method.invoke(Method.java:606)
4185 at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
4186 at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.jav a:889)
3. Exact APIs to retrieve the nodeName attribute -->
GlobalConfigurationBuilder.defaultClusteredBuilder().build().transport().nodeName();
cacheManager.getCacheManagerConfiguration().transport().nodeName();
-
16. Re: some q's re: impact(s) of ISPN/JGRPS config ...
nileshbhagat Feb 12, 2014 6:00 PM (in response to belaban)Hi Bela / Mircea,
Please find below further evidence on Suspect Exception issue which is still open:
Even after modifying the implementation to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; the mapper nodes receive view changed event only when the cluster is being started (and NOT once they get suspected).
3874 DEBUG [16:06:58,068] [main][Node] - viewChanged event invoked
3875 DEBUG [16:06:58,068] [main][Node] - viewChanged event old members size : 0
3876 DEBUG [16:06:58,068] [main][Node] - viewChanged event old members list : []
3877 DEBUG [16:06:58,068] [main][Node] - viewChanged event new members size : 5
3878 DEBUG [16:06:58,068] [main][Node] - viewChanged event new members list : [101, 105, 107, 111, 201]
3879 DEBUG [16:06:58,068] [main][Node] - viewChanged event completed
The Master node receives SuspectException for 1 of the mapper nodes although others may have exhausted heap space as well.
WARN [15:46:19,272] [pool-2-thread-1][CommandExec] - callOnMaster ExecutionException, throws
1563 java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: Node107-3348(107) was suspected
1564 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
1565 at java.util.concurrent.FutureTask.get(FutureTask.java:119)
1566 at org.infinispan.distexec.DefaultExecutorService$DistributedTaskPart.innerGet(DefaultExecutorService.java:9 39)
1567 at org.infinispan.distexec.DefaultExecutorService$DistributedTaskPart.getAfter the mapper nodes exhibit java.lang.OutOfMemoryError; they do NOT receive any view changed event from the cluster. EmbeddedCacheManager.stop() & EmbeddedCacheManager.start() are invoked by catching the exception thrown which does work as per below but none of the suspected nodes receive any view changed event from the cluster.
1264 WARN [16:28:46,407] [remote-thread-0,201][Node] - Begin restarting the cache on Node: 201; as it will be evicted from the cluster once suspected.
1265 DEBUG [16:28:46,407] [remote-thread-0,201][DefaultCacheManager] - Stopping cache manager AggregationEngine on 201-42 312(201)
1266 DEBUG [16:28:46,408] [remote-thread-0,201][CacheImpl] - Stopping cache ___defaultcache on 201-42312(201)
1267 DEBUG [16:28:46,410] [remote-thread-0,201][TransactionTable] - Wait for on-going transactions to finish for 30 secon ds.
1268 DEBUG [16:28:46,411] [remote-thread-0,201][TransactionTable] - All transactions terminated
1269 DEBUG [16:28:46,411] [remote-thread-0,201][LocalTopologyManagerImpl] - Node 201-42312(201) leaving cache ___defaultc ache
...
1282 DEBUG [16:28:48,748] [remote-thread-0,201][DefaultCacheManager] - Started cache manager AggregationEngine on 201
1283 DEBUG [16:28:48,748] [transport-thread-19,201][AbstractJBossMarshaller] - Using JBoss Marshalling
1284 DEBUG [16:28:48,749] [transport-thread-19,201][LocalTopologyManagerImpl] - Ignoring rebalance request 7 for cache __ _defaultcache, the local cache manager is not running
1285 DEBUG [16:28:48,749] [transport-thread-17,201][LocalTopologyManagerImpl] - Ignoring consistent hash update 6 for cac he ___defaultcache, the local cache manager is not running
1286 WARN [16:28:48,749] [remote-thread-0,201][Node] - Completed restarting the cache on Node: 201; as it will be evicte d from the cluster once suspected.
Sometimes, the mappers do receive merge event but not always though ...
DEBUG [15:59:55,306] [Incoming-7,AggregationEngine,107-3348(107)][JGroupsTransport] - New view accepted:
MergeView:: [111-53500(111)|4] [111-53500(111), 105-46262(105), 101-9807(101), 107-3348(107)], subgroups=
[201-42312(201)|2] [107 -3348(107)], [111-53500(111)|3] [111-53500(111), 105-46262(105), 101-9807(101)]
2043 INFO [15:59:55,306] [Incoming-7,AggregationEngine,107-3348(107)][JGroupsTransport] - ISPN000093: Receivednew, MERG ED cluster view: MergeView::[111-53500(111)|4] [111-53500(111), 105-46262(105), 101-9807(101),
107-3348(107)], subgr oups=[201-42312(201)|2] [107-3348(107)], [111-53500(111)|3] [111-53500(111), 105-
46262(105), 101-9807(101)]
DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - Merge event invoked
2079 DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - mergedView event merged memberssize : 4
2080 DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - mergedView event merged memberslist : [101, 105, 107, 111]
Unfortunately, the Master node does NOT reflect any suspected nodes even after their cache have been re-started (with our down nodes list reflecting the member list returned by view changed event) . It only reflects itself & reducer nodes as per below ...
1555 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event invoked
1556 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event old memberssize : 5
1557 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event old memberslist : [101, 105, 107, 111, 201]
1558 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event new memberssize : 3
1559 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event new memberslist : [101, 105, 111]
1560 WARN [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][RequestListenerServiceImpl] - Clusterdown. down Nodes [201, 107]
1561 WARN [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][SnapshotStatusMonitorService] - Clusterdown. down Nodes [201, 107]
Even tried to invoke DefaultCacheManager.startCache() (after re-starting the cache) but that throws exception as well ...
ERROR [11:32:18,071] [remote-thread-0,107][InboundInvocationHandlerImpl] - Exception executing command
1352 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cachecontainer .
1353 at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)
1354 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)
1355 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)
1356 at org.infinispan.manager.DefaultCacheManager.startCache(DefaultCacheManager.java:884)
1357 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)
1358 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
1359 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
1360 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)
1361 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )
1362 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
1363 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
1364 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
1365 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
1366 at java.lang.Thread.run(Thread.java:724)Some mapper nodes do contain following error once suspected ...
DEBUG [15:59:55,306] [remote-thread-0,107][LocalTopologyManagerImpl] - Error sending the leave request for cache
___ defaultcache to coordinator
2045 org.infinispan.remoting.transport.jgroups.SuspectException: Node 201-42312(201) was suspected
2046 at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDi spatcher.java:194)
2047 at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:531)
2048 at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator(LocalTopologyManagerImpl.java:296)
2049 at org.infinispan.topology.LocalTopologyManagerImpl.leave(LocalTopologyManagerImpl.java:135)
2050 at org.infinispan.statetransfer.StateTransferManagerImpl.stop(StateTransferManagerImpl.java:234)
2051 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2052 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
2053 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2054 at java.lang.reflect.Method.invoke(Method.java:606)
2055 at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
2056 at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.jav a:889)
2057 at org.infinispan.factories.AbstractComponentRegistry.internalStop(AbstractComponentRegistry.java:694)
2058 at org.infinispan.factories.AbstractComponentRegistry.stop(AbstractComponentRegistry.java:572)
2059 at org.infinispan.factories.ComponentRegistry.stop(ComponentRegistry.java:246)
2060 at org.infinispan.CacheImpl.stop(CacheImpl.java:713)
2061 at org.infinispan.CacheImpl.stop(CacheImpl.java:708)
2062 at org.infinispan.manager.DefaultCacheManager.stop(DefaultCacheManager.java:738)
2063 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:240)
2064 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
2065 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
2066 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)
2067 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )
2068 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
2069 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
2070 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
2071 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
2072 at java.lang.Thread.run(Thread.java:724)
2073 Caused by: SuspectedException
2074 at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:406)
2075 at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDisp atcher.java:375)
2076 at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDi spatcher.java:189)
Resubmitting the request does not work as master node has already evicted the suspected nodes & hence throws following error for any of the suspected node picked-up by Infinispan:
AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request
], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef
[supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null,
calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData],
code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-55381
(201) is not a cluster member, members are [111-51035(111), 105-19421(105), 101-62405(101)]]
java.lang.IllegalArgumentException: Target node 201-55381(201) is not a cluster member, members are [111-51035(111), 105-19421(105), 101-62405(101)]
Target node 201-55381(201) is not a cluster member, members are [111-51035(111), 105-19421(105), 101-62405(101)]
]
at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)In summary; we have implemented & tested the cache restart option a few ways but that does not work at all; as the cluster remains dis-jointed forever & does NOT allow the mapper nodes to join back to the original cluster (which contains only the Master/Reducer nodes once any of the mappers have been suspected).
Please provide any other alternative solutions as we have exhausted (to the best of our knowledge) all solutions to gracefully restart the suspected nodes but in vain. If there are any oversights or other recommendations; do provide at your earliest.
Thanks.
-
17. Re: some q's re: impact(s) of ISPN/JGRPS config ...
nileshbhagat Feb 13, 2014 3:09 PM (in response to belaban)Hi Bela/Mircea,
We are still awaiting your response to this stalemate we have been facing with the SuspectException issue as per the above findings.
Retried the scenarios few times today with same results as Suspected Nodes DO NOT join back to the parent cluster even after the cache has been successfully re-started; with the fact that the Master node only includes itself & Reducers in the cluster with all the Mapper nodes evicted from the cluster & reported as down nodes in the view changed event.
Let us know if there are any other remediation measures w.r.t. ISPN APIs that will help the suspected nodes join back to the cluster programmatically.
In the interim; I have modified the timeout for FD & SUSPECT protocols (see below) to be a higher value which does allow our long-running requests to get completed obviating the suspect exception scenario. Given that SuspectException is deemed to be benign & planned to be removed in ISPN 7.x; would this approach of increased timeout suffice?
<FD_ALL timeout="600000" interval="30000" />
<VERIFY_SUSPECT timeout="300000" num_msgs="5" />
<FD_SOCK />
Please reply at your earliest to the detailed findings/stacktraces observed on Feb 11th & Feb 12th posting as well as the conclusion drawn herein. Thanks. -
18. Re: some q's re: impact(s) of ISPN/JGRPS config ...
mircea.markus Feb 13, 2014 7:39 PM (in response to nileshbhagat)You to mention OutOfMemoryExceptions. I would strongly suggest investigating the cause of this and fixing it, before doing anything else, as the JVM would behave in an undeterministic manner once this errors are reported, and everything we discuss on this forum might not work. This thread is a bit convoluted, so I'll try to sum it up just to make sure we're on the same page: 1. cluster is formed of caches {A, B, C, D} 2. node D is suspected, it receives a viewChanged event 3. in this view change event you stop the cache and restart it 4. this restart process doesn't trigger another viewChanged event, as you'd expect Is this the case? If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?
-
19. Re: some q's re: impact(s) of ISPN/JGRPS config ...
belaban Feb 14, 2014 2:29 AM (in response to nileshbhagat)No, this config isn't good. The timeout in FD_ALL means that you'd have to wait for up to 10 minutes for a suspicion to happen if a node hangs (if it crashes, FD_SOCK will detect this immediately). Plus, VERIFY_SUSPECT will take another 5 minutes to verify this. Absolutely BAD ! I suggest set the timeout in FD_ALL to 60s and in VERIFY_SUSPECT to 5s.
-
20. Re: some q's re: impact(s) of ISPN/JGRPS config ...
cotton.ben Feb 14, 2014 8:31 AM (in response to belaban)> Mircea wrote:
> [...] If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?
Thanks ... working on this now.
> Bela wrote:
> No, this config isn't good. [...]
We are in a lose/lose at the moment. A good config results in our M-R request(s) not being able to complete (LOSE due to SUSPECT/timeout). A bad config (huge timeout) enables our M-R request(s) to complete (but we LOSE because our bad config has other consequences)
Thanks for sticking with us in this struggle. This is a good fight. Together, we are confident remedy will prevail.
-
21. Re: some q's re: impact(s) of ISPN/JGRPS config ...
nileshbhagat Feb 14, 2014 5:46 PM (in response to mircea.markus)Hi Mircea,
To correct your understanding ...
1. cluster is formed of caches {A, B, C, D} with ALL nodes receiving viewChanged event
2. node D is suspected, it does NOT receive any viewChanged event & hence cannot restart the cache
3. In the catch block of OOM exception; we stop the cache and restart it
4. this restart process doesn't trigger another viewChanged event to any of the suspected, as you'd expect Is this the case? YES.
5. Master node evicts the suspected nodes permanently from the cluster & reports only {A, B, C} in this scenario (as per below)...
24304 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event invoked
24305 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event old members size : 4
24306 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event old members list : [10 1, 105, 111, 201]
24307 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event new members size : 3
24308 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event new members list : [10 1, 105, 111]
24309 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - downNodesSet size : 2
24310 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - downNodesSet list : [201, 107]
24311 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event completed
6. Resubmitting the request to the Master node results in "Target node not being part of cluster" exception (refer to the stack-traces in above post & below as well)...
AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null,
getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0,
properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code
value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105),
101-29628(101)]]
java.lang.IllegalArgumentException: Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105), 101-29628(101)]
Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105), 101-29628(101)]
]
at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)______________________________________________________________________________
If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?
Yes, I did implement CacheManager.stop and then CacheManager.getCache() (without the stop() API) but it results in following exception. Tried this scenario a few times but resulted in the same exception thrown.
2104 ERROR [11:38:59,286] [remote-thread-0,201][InboundInvocationHandlerImpl] - Exception executing command
2105 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container .
2106 at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)
2107 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)
2108 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)
2109 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)
2110 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
2111 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
2112 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)
2113 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )
2114 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
2115 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
2116 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
2117 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
2118 at java.lang.Thread.run(Thread.java:724)
Then, tried invoking CacheManager.stop, CacheManager.start and then CacheManager.getCache but that also results in same exception ...
ERROR [14:56:20,204] [remote-thread-1,201][InboundInvocationHandlerImpl] - Exception executing command
4928 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container .
4929 at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)
4930 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)
4931 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)
4932 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)
4933 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
4934 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
4935 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)
4936 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )
4937 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
4938 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
4939 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
4940 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
4941 at java.lang.Thread.run(Thread.java:724)
The Master node reports one of the nodes as suspected as per below ...
AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null,
getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0,
properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code
value=2016, getMessage()=org.infinispan.remoting.transport.jgroups.SuspectException: Node 107-53543(107) was suspected]
org.infinispan.remoting.transport.jgroups.SuspectException: Node 107-53543(107) was suspected
Node 107-53543(107) was suspected
null
]
at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)_____________________________________________________________
Given that none of the suggested APIs have worked to have suspected node(s) re-join the cluster; is there any tests you or the team can try to validate this scenario & identify which of the ISPN APIs would indeed work? It would be appreciated if the suggestions are tested to validate the findings (given that none of them have worked so far for us).
__________________________________________________________
Hi Bela,
The rationale for increased timeout for FD_ALL & SUSPECT was to ensure that the nodes do not get suspected by infinispan & allow our request to complete. Also, the SuspectException is planned to be deprecated in ISPN 7.x (based on some other posts).
We have tried all suggested API approaches but in vain. Hope that your team can validate the suggestions as that would save on the time & effort (given that this issue has been for a while).
-
22. Re: some q's re: impact(s) of ISPN/JGRPS config ...
nileshbhagat Feb 18, 2014 5:47 PM (in response to belaban)Still awaiting response from the community on this SuspectException issue as did restart the cache (with & without CacheManager.getCache) with the same results as mapper nodes do not join back to the cluster while the master evicts the mappers permanently rejecting any subsequent requests.
-
23. Re: some q's re: impact(s) of ISPN/JGRPS config ...
nileshbhagat Feb 21, 2014 10:18 AM (in response to belaban)Hi Bela/Mircea,
This SuspectException issue is still open & after testing the CacheManager stop/start/getCache APIs; the results are still negative with the suspected mapper node(s) being evicted from the cluster & never joins back with the master shunning the suspects forever.
We have obviated this issue by increasing the timeout values for FD_ALL & VERIFY_SUSPECT as the mapper nodes do become unresponsive with Full GC but they are still alive. Do confirm if SuspectException may be removed from ISPN 7.x as per https://issues.jboss.org/i#browse/ISPN-2577
Can you assert the following config values which were set after reviewing the JGroups/ISPN docs:
1) Should the ISPN hash.numSegments value be less than or equal to 10 * max_cluster_size (with cluster size assumed to be Master+Reducer+Mapper for all machines the cluster)?
2) What's the basis for ISPN L1 cache values l1.lifespan & l1.cleanupTaskFrequency? Would the default 10mins suffice?
3) Confirm the basis for min threads, max threads, queue max size, keep alive time & rejection policy for Thread Pool & OOB pools .(with the current settings pasted below)?
thread_pool.enabled="true"
thread_pool.min_threads="30"
thread_pool.max_threads="100"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="10000"
thread_pool.rejection_policy="Run"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="30"
oob_thread_pool.max_threads="100"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="true"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="Run"
4) Please review the following values for UNICAST & NAKACK & recommend any changes?
<pbcast.NAKACK2 xmit_interval="1000"
max_rebroadcast_timeout="2000"
use_mcast_xmit="true"
use_mcast_xmit_req="true"
discard_delivered_msgs="true"/>
<UNICAST3 xmit_interval="1000"
max_retransmit_time="2000"
conn_expiry_timeout="0" />
5) What's the basis for the different timeout values in GMS & confirm any changes to below settings?
<pbcast.GMS print_local_addr="true"
join_timeout="10000"
leave_timeout="10000"
merge_timeout="5000"
view_bundling="true"
max_bundling_time="500"
view_ack_collection_timeout="2000"
resume_task_timeout="15000"/>
6) Any suggestions on maximizing the cluster throughput with the mapper nodes being able to send results back to the Master/Reducer with minimal latency?
Hope that above can be confirmed at the earliest.
Thanks,
Nilesh.
-
24. Re: some q's re: impact(s) of ISPN/JGRPS config ...
cotton.ben Feb 25, 2014 11:29 AM (in response to nileshbhagat)Hi ISPN, Bela, Mircea,
Just to recap this matter:
1. When our ISPN 5.3 deployment uses the recommended JGRPS configs, our DistributedExectuorService dispatch onto our DIST_SYNC grid cannot meet its SLA ... it fails due to SUSPECT_EXCEPTION
2. We remedy this by using a non-recommended config, providing HUGE timeouts. This gratuitously accommodates our DES dispatch completing, but we are now deployed with a non-preferred config.
3. In this exact forum thread, remedy providing both (1) a preferred config and (2) the ability for us to consistently realize our DES dispatch succeeding w/o SUSPECT_EXCEPTION has been ... well .. it has been elusive.
4. In ISPN 7 the SUSPECT_EXCEPTION goes away, at the time we upgrade we should then on ISPN 7 fully expect our DES dispatch to work perfectly with a preferred JGRPs config
5. If we cannot wait until we upgrade to ISPN 7, and want both a preferred config and the expectation that our DES dispath will not endure a SUSPECT_EXCEPTION, we should take advantage of the fact that we are now customers and open a ISPN 5.3 bug ticket via our RedHat subscribed customer account. And request a 5.3 pathc
Could you guys kindly confirm that, from your vantage point, all of 1-5 are true?
Thanks,
Ben
-
25. Re: some q's re: impact(s) of ISPN/JGRPS config ...
mircea.markus Feb 26, 2014 8:28 AM (in response to nileshbhagat)Yes, I did implement CacheManager.stop and then CacheManager.getCache() (without the stop() API) but it results in following exception. Tried this scenario a few times but resulted in the same exception thrown.
2104 ERROR [11:38:59,286] [remote-thread-0,201][InboundInvocationHandlerImpl] - Exception executing command
2105 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container .
2106 at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)
2107 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)
2108 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)
2109 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)
2110 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
2111 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
2112 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)
2113 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )
2114 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
2115 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
2116 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
2117 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
2118 at java.lang.Thread.run(Thread.java:724)
right, you'd need to do CacheManager.stop() then CacheManager.start() then CacheManager.getCache
-
26. Re: some q's re: impact(s) of ISPN/JGRPS config ...
mircea.markus Feb 26, 2014 8:36 AM (in response to cotton.ben)1. When our ISPN 5.3 deployment uses the recommended JGRPS configs, our DistributedExectuorService dispatch onto our DIST_SYNC grid cannot meet its SLA ... it fails due to SUSPECT_EXCEPTION
2. We remedy this by using a non-recommended config, providing HUGE timeouts. This gratuitously accommodates our DES dispatch completing, but we are now deployed with a non-preferred config.
3. In this exact forum thread, remedy providing both (1) a preferred config and (2) the ability for us to consistently realize our DES dispatch succeeding w/o SUSPECT_EXCEPTION has been ... well .. it has been elusive.
4. In ISPN 7 the SUSPECT_EXCEPTION goes away, at the time we upgrade we should then on ISPN 7 fully expect our DES dispatch to work perfectly witha preferred JGRPs config
5. If we cannot wait until we upgrade to ISPN 7, and want both a preferred config and the expectation that our DES dispath will not endure a SUSPECT_EXCEPTION, we should take advantage of the fact that we are now customers and open a ISPN 5.3 bug ticket via our RedHat subscribed customer account. And request a 5.3 pathc
Ben, I'm afraid there's no straight answer to the suspect exception questions. Suspect exceptions happen because a node is seen as not being present in the cluster by other members. This might be cause by a variety of factors, e.g. node crashing, NIC crashing, network flooded, node being overwhelmed (long GC run, OOM). They way I'd tackle this would be to see why the node gets suspended. In your case sounds like either an OOM or a long GC pause, but might be network as well. Monitoring the system (failed node in particular) and the logs from the JGroups failure detection protocols you configured (FD_ALL, VERIFY_SUSPECT, FD_SOCK) should hint you on that. Might be a case that the system is overwhelmed, I imagine the computations you're doing there are not trivial.
-
27. Re: some q's re: impact(s) of ISPN/JGRPS config ...
nileshbhagat Feb 26, 2014 10:33 AM (in response to mircea.markus)It's surprising that my post was overlooked as I had already tried exactly that & even that failed with the same exception as per below ...
Then, tried invoking CacheManager.stop, CacheManager.start and then CacheManager.getCache but that also results in same exception ...
ERROR [14:56:20,204] [remote-thread-1,201][InboundInvocationHandlerImpl] - Exception executing command4928 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container .
4929 at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)
4930 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)
4931 at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)
4932 at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)
4933 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
4934 at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
4935 at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)
4936 at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122 )
4937 at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)
4938 at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)
4939 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
4940 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
4941 at java.lang.Thread.run(Thread.java:724)
-
28. Re: some q's re: impact(s) of ISPN/JGRPS config ...
cotton.ben Feb 26, 2014 11:15 AM (in response to mircea.markus)> I imagine the computations you're doing there are not trivial.
No, they are not trivial ... real time liquidity risk can only be rendered on our Linux HPC supercomputer (240CPUs, 3TB RAM, UDP/IP/Infinband OSI interconnect).
> They way I'd tackle this would be to see why the node gets suspended. In your case sounds like either an OOM or a long GC pause, but might be network as well.
Yep, I see what you mean - but our finding the cause does not sound like we can then find remedy. We have to get a solution to long GC pauses triggering our ISPN/JGRPS stack triggering a "that node is SUSPECT, let's throw an Exception" response as the only possible outcome. You guys got to change it so that there are more manageable outcomes in that situation other than throw SUSPECT_EXCEPTION. When you throw *that* Exception it immediately betrays our DistributedExecutorService/NotifyingFuture (DES/NF is *the* beauty of Infinispan, BTW) processing from Reducing/Fusing all our Mapper Nodes' rendered "risk essential". operand set(s) ...... and w/o soundly/completely rendered operand set(s) ... our SLA obligation FAILS.
You guys need to fix that. It doesn't matter what we find as the cause of SUSPECT_EXCEPTION if the only outcome remains "throw it". We do care why you think a Mapper node has left the cluster, but, what we need the most are more manageable outcomes (not just throw SUSPECT_EXCEPTION) so that our DES/NF service has the chance to survive.
-
29. Re: some q's re: impact(s) of ISPN/JGRPS config ...
mircea.markus Feb 26, 2014 11:38 AM (in response to cotton.ben)You guys need to fix that. It doesn't matter what we find as the cause of SUSPECT_EXCEPTION if the only outcome remains "throw it". We do care why you think a Mapper node has left the cluster, but, what we need the most are more manageable outcomes (not just throw SUSPECT_EXCEPTION) so that our DES/NF service has the chance to survive.
So the good news is this[1]. The bad news is that you'll only have it in ISPN 7.0 :-)
[1] [ISPN-2577] Silence SuspectExceptions - JBoss Issue Tracker