1 2 3 Previous Next 33 Replies Latest reply on Mar 1, 2014 4:04 AM by belaban Go to original post
      • 15. Re: some q's re: impact(s) of ISPN/JGRPS config ...
        nileshbhagat


        Hi Mircea/Bela,

         

        Please respond to the below questions at your earliest:

         

        1. In fact, this was an error on our part as the event listener was registered on Master/Reducer nodes only and not on Mapper nodes. Modifying the implementation allowed us to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; upon which the cache is restarted for that node. But, we are observing a discrepancy as the Mapper nodes contain the full set of nodes while the Master node excludes the suspected mapper nodes. Please refer to the debug statements below & provide any resolution to this issue for us to resolve this quickly. We are close to the solution except for this anomaly as our Master node doesn't accept any further requests if any of the mappers are down (which is a fall-out of the suspected nodes being evicted form cluster even after their caches were successfully restarted)?

         

                 After ascertaining that the ViewChanged event is triggered for Mapper nodes as well; the behavior of Infinispan/JGroups for the cluster has been very unpredictable. For instance; all the nodes received view changed event when cluster is being started; but only 1 of the mappers received this event (which restarted the cache for that node) once SuspectException was encountered while other mappers did not receive view change at all. The root cause of SuspectException would be java.lang.OutOfMemoryError for mapper nodes.

                  Also, the Master node evicts all mapper nodes with it's own view of the cluster & doesn't include any re-started mapper nodes.

             Essentially; SuspectException leaves the cluster dis-jointed with the view changed event not triggered for any or all mapper nodes and in addition the master has it's own cluster view. Resubmitting the request fails with the following error regardless of whether the cache on that node was re-started  or not.

         

        Please advise what's the best way to approach this as there is erratic behavior observed from ISPN/JGroups & we need a solution of suspected node(s) to join back to the same cluster view that it started with.

         

        Node 201 cache not re-started (as view changed event not triggered upon SuspectException) -->

        AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]]

        java.lang.IllegalArgumentException: Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]

        Target node 201-55337(201) is not a cluster member, members are [105-59547(105), 111-24339(111), 101-38424(101), 107-27577(107)]

        ]

         

        Node 201 cache re-started but now contains a different node suffix -->

        AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request

        ], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef

        [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null,

        calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData],

        code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-7163

        (201) is not a cluster member, members are [105-52520(105), 111-56485(111), 101-37987(101), 201-46672(201)]]
        java.lang.IllegalArgumentException: Target node 201-7163(201) is not a cluster member, members are [105-52520(105),

        111-56485(111), 101-37987(101), 201-46672(201)]
        Target node 201-7163(201) is not a cluster member, members are [105-52520(105), 111-56485(111), 101-37987(101),

        201-46672(201)]
        ]

         

        Master node does not contain suspected nodes in it;'s view although the cache would have been re-started -->

        1642 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event invoked
           1643 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event old members

        size : 4
           1644 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event old members

        list : [10        1, 105, 107, 111]
           1645 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event new members

        size : 3
           1646 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event new members

        list : [10        1, 105, 111]
           1647 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - downNodesSet size : 4
           1648 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - downNodesSet list : [202, 203,

        107, 201]   1649 DEBUG [11:13:11,152] [Incoming-10,AggregationEngine,101-32714(101)][Node] - viewChanged event

        completed

         

        OOM error received by JGroups code & not by our code. Is this an error?

        DEBUG [14:06:47,265] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from 101-

        4785        9(101) for 64514 ms, adding it to suspect list
           5459 DEBUG [14:07:40,130] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

        111-3391        1(111) for 64515 ms, adding it to suspect list
           5460 DEBUG [14:07:48,426] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - suspecting [101-47859(101), 107-

        28395(107)        , 111-33911(111), 105-31363(105)]
           5461 ERROR [14:11:46,986] [Timer-4,AggregationEngine,201-14315(201)][TimeScheduler3] - failed executing task

        NAKACK2: Ret        ransmitTask (interval=1000 ms)
           5462 java.lang.OutOfMemoryError: Java heap space
           5463 DEBUG [14:12:17,880] [Timer-2,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

        107-2839        5(107) for 434946 ms, adding it to suspect list
           5464 DEBUG [14:14:30,246] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

        105-3136        3(105) for 140651 ms, adding it to suspect list
           5465 ERROR [14:14:36,792] [multicast receiver,AggregationEngine,201-14315(201)][UDP] - 201-14315(201): failed

        handling in        coming message
           5466 java.lang.OutOfMemoryError: Java heap space
           5467 DEBUG [14:15:13,885] [VERIFY_SUSPECT.TimerThread,AggregationEngine,201-14315(201)][RequestCorrelator] -

        suspect=105-        31363(105)
           5468 DEBUG [14:14:36,792] [Timer-3,AggregationEngine,201-14315(201)][FD_ALL] - haven't received a heartbeat from

        107-2839        5(107) for 567312 ms, adding it to suspect list
           5469 ERROR [14:15:28,714] [multicast receiver,AggregationEngine,201-14315(201)][UDP] - 201-14315(201): failed

        handling in        coming message
           5470 java.lang.OutOfMemoryError: Java heap space

         

         

        2.   ASYNC timeout stacktrace

         

        DEBUG [11:57:12,125] [Incoming-7,AggregationEngine,101-48199(101)][LocalTopologyManagerImpl] - Error sending the

        leave request for cache ___defaultcache to coordinator
           4175 org.infinispan.util.concurrent.TimeoutException: Node 105-32211(105) timed out
           4176         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand

        (CommandAwareRpcDi        spatcher.java:196)
           4177         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely

        (JGroupsTransport.java:531)
           4178         at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator

        (LocalTopologyManagerImpl.java:296)
           4179         at org.infinispan.topology.LocalTopologyManagerImpl.leave(LocalTopologyManagerImpl.java:135)
           4180         at org.infinispan.statetransfer.StateTransferManagerImpl.stop(StateTransferManagerImpl.java:234)
           4181         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           4182         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
           4183         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           4184         at java.lang.reflect.Method.invoke(Method.java:606)
           4185         at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
           4186         at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke

        (AbstractComponentRegistry.jav        a:889)

         

         

        3.  Exact APIs to retrieve the nodeName attribute -->

         

        GlobalConfigurationBuilder.defaultClusteredBuilder().build().transport().nodeName();

         

        cacheManager.getCacheManagerConfiguration().transport().nodeName();

         

         

        • 16. Re: some q's re: impact(s) of ISPN/JGRPS config ...
          nileshbhagat

          Hi Bela / Mircea,

           

          Please find below further evidence on Suspect Exception issue which is still open:

           

          Even after modifying the implementation to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; the mapper nodes receive view changed event only when the cluster is being started (and NOT once they get suspected).

          3874 DEBUG [16:06:58,068] [main][Node] - viewChanged event invoked

             3875 DEBUG [16:06:58,068] [main][Node] - viewChanged event old members size : 0

             3876 DEBUG [16:06:58,068] [main][Node] - viewChanged event old members list : []

             3877 DEBUG [16:06:58,068] [main][Node] - viewChanged event new members size : 5

             3878 DEBUG [16:06:58,068] [main][Node] - viewChanged event new members list : [101, 105, 107, 111, 201]

             3879 DEBUG [16:06:58,068] [main][Node] - viewChanged event completed

           

          The Master node receives SuspectException for 1 of the mapper nodes although others may have exhausted heap space as well.

          WARN  [15:46:19,272] [pool-2-thread-1][CommandExec] - callOnMaster ExecutionException, throws
             1563 java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: Node

          107-3348(107) was suspected
             1564         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
             1565         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
             1566         at org.infinispan.distexec.DefaultExecutorService$DistributedTaskPart.innerGet

          (DefaultExecutorService.java:9        39)
             1567         at org.infinispan.distexec.DefaultExecutorService$DistributedTaskPart.get

           

           

          After the mapper nodes exhibit java.lang.OutOfMemoryError; they do NOT receive any view changed event from the cluster. EmbeddedCacheManager.stop() & EmbeddedCacheManager.start() are invoked by catching the exception thrown which does work as per below but none of the suspected nodes receive any view changed event from the cluster.

            1264 WARN  [16:28:46,407] [remote-thread-0,201][Node] - Begin restarting the cache on Node: 201; as it will be evicted from the cluster once suspected.

             1265 DEBUG [16:28:46,407] [remote-thread-0,201][DefaultCacheManager] - Stopping cache manager AggregationEngine on 201-42        312(201)

             1266 DEBUG [16:28:46,408] [remote-thread-0,201][CacheImpl] - Stopping cache ___defaultcache on 201-42312(201)

             1267 DEBUG [16:28:46,410] [remote-thread-0,201][TransactionTable] - Wait for on-going transactions to finish for 30 secon        ds.

             1268 DEBUG [16:28:46,411] [remote-thread-0,201][TransactionTable] - All transactions terminated

             1269 DEBUG [16:28:46,411] [remote-thread-0,201][LocalTopologyManagerImpl] - Node 201-42312(201) leaving cache ___defaultc        ache

          ...

          1282 DEBUG [16:28:48,748] [remote-thread-0,201][DefaultCacheManager] - Started cache manager AggregationEngine on 201

             1283 DEBUG [16:28:48,748] [transport-thread-19,201][AbstractJBossMarshaller] - Using JBoss Marshalling

             1284 DEBUG [16:28:48,749] [transport-thread-19,201][LocalTopologyManagerImpl] - Ignoring rebalance request 7 for cache __        _defaultcache, the local cache manager is not running

             1285 DEBUG [16:28:48,749] [transport-thread-17,201][LocalTopologyManagerImpl] - Ignoring consistent hash update 6 for cac        he ___defaultcache, the local cache manager is not running

             1286 WARN  [16:28:48,749] [remote-thread-0,201][Node] - Completed restarting the cache on Node: 201; as it will be evicte        d from the cluster once suspected.

           

          Sometimes, the mappers do receive merge event but not always though ...

          DEBUG [15:59:55,306] [Incoming-7,AggregationEngine,107-3348(107)][JGroupsTransport] - New view accepted:

          MergeView::        [111-53500(111)|4] [111-53500(111), 105-46262(105), 101-9807(101), 107-3348(107)], subgroups=

          [201-42312(201)|2] [107        -3348(107)], [111-53500(111)|3] [111-53500(111), 105-46262(105), 101-9807(101)]
             2043 INFO  [15:59:55,306] [Incoming-7,AggregationEngine,107-3348(107)][JGroupsTransport] - ISPN000093: Received

          new, MERG        ED cluster view: MergeView::[111-53500(111)|4] [111-53500(111), 105-46262(105), 101-9807(101),

          107-3348(107)], subgr        oups=[201-42312(201)|2] [107-3348(107)], [111-53500(111)|3] [111-53500(111), 105-

          46262(105), 101-9807(101)]

          DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - Merge event invoked
             2079 DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - mergedView event merged members

          size : 4
             2080 DEBUG [15:59:55,308] [Incoming-7,AggregationEngine,107-3348(107)][Node] - mergedView event merged members

          list : [101, 105, 107, 111]

           

          Unfortunately, the Master node does NOT reflect any suspected nodes even after their cache have been re-started (with our down nodes list reflecting the member list returned by view changed event) . It only reflects itself & reducer nodes as per below ...

          1555 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event invoked
             1556 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event old members

          size : 5
             1557 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event old members

          list : [101,         105, 107, 111, 201]
             1558 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event new members

          size : 3
             1559 DEBUG [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][Node] - viewChanged event new members

          list : [101,         105, 111]
             1560 WARN  [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][RequestListenerServiceImpl] - Cluster

          down. down Nodes  [201, 107]
             1561 WARN  [15:46:19,271] [Incoming-6,AggregationEngine,101-9807(101)][SnapshotStatusMonitorService] - Cluster

          down. down         Nodes  [201, 107]

           

          Even tried to invoke DefaultCacheManager.startCache() (after re-starting the cache) but that throws exception as well ...

           

          ERROR [11:32:18,071] [remote-thread-0,107][InboundInvocationHandlerImpl] - Exception executing command
             1352 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache

          container        .
             1353         at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)
             1354         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)
             1355         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)
             1356         at org.infinispan.manager.DefaultCacheManager.startCache(DefaultCacheManager.java:884)
             1357         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)
             1358         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
             1359         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
             1360         at org.infinispan.commands.read.DistributedExecuteCommand.perform

          (DistributedExecuteCommand.java:121)
             1361         at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal

          (InboundInvocationHandlerImpl.java:122        )
             1362         at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000

          (InboundInvocationHandlerImpl.java:68)
             1363         at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run

          (InboundInvocationHandlerImpl.java:194)
             1364         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
             1365         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
             1366         at java.lang.Thread.run(Thread.java:724)

           

          Some mapper nodes do contain following error once suspected ...

          DEBUG [15:59:55,306] [remote-thread-0,107][LocalTopologyManagerImpl] - Error sending the leave request for cache

          ___        defaultcache to coordinator
             2045 org.infinispan.remoting.transport.jgroups.SuspectException: Node 201-42312(201) was suspected
             2046         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand

          (CommandAwareRpcDi        spatcher.java:194)
             2047         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely

          (JGroupsTransport.java:531)
             2048         at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator

          (LocalTopologyManagerImpl.java:296)
             2049         at org.infinispan.topology.LocalTopologyManagerImpl.leave(LocalTopologyManagerImpl.java:135)
             2050         at org.infinispan.statetransfer.StateTransferManagerImpl.stop(StateTransferManagerImpl.java:234)
             2051         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
             2052         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
             2053         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
             2054         at java.lang.reflect.Method.invoke(Method.java:606)
             2055         at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
             2056         at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke

          (AbstractComponentRegistry.jav        a:889)
             2057         at org.infinispan.factories.AbstractComponentRegistry.internalStop

          (AbstractComponentRegistry.java:694)
             2058         at org.infinispan.factories.AbstractComponentRegistry.stop(AbstractComponentRegistry.java:572)
             2059         at org.infinispan.factories.ComponentRegistry.stop(ComponentRegistry.java:246)
             2060         at org.infinispan.CacheImpl.stop(CacheImpl.java:713)
             2061         at org.infinispan.CacheImpl.stop(CacheImpl.java:708)
             2062         at org.infinispan.manager.DefaultCacheManager.stop(DefaultCacheManager.java:738)
             2063         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:240)
             2064         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)
             2065         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)
             2066         at org.infinispan.commands.read.DistributedExecuteCommand.perform

          (DistributedExecuteCommand.java:121)
             2067         at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal

          (InboundInvocationHandlerImpl.java:122        )
             2068         at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000

          (InboundInvocationHandlerImpl.java:68)
             2069         at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run

          (InboundInvocationHandlerImpl.java:194)
             2070         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
             2071         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
             2072         at java.lang.Thread.run(Thread.java:724)
             2073 Caused by: SuspectedException
             2074         at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:406)
             2075         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall

          (CommandAwareRpcDisp        atcher.java:375)
             2076         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand

          (CommandAwareRpcDi        spatcher.java:189)

           

          Resubmitting the request does not work as master node has already evicted the suspected nodes & hence throws following error for any of the suspected node picked-up by Infinispan:

          AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request

          ], execId=0, nodeId=null, request=null, getMessage()=CommandException [commandDef=CommandGenericDef

          [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0, properties=null,

          calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData],

          code=CommandPrepateData, code value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-55381

          (201) is not a cluster member, members are [111-51035(111), 105-19421(105), 101-62405(101)]]
          java.lang.IllegalArgumentException: Target node 201-55381(201) is not a cluster member, members are [111-51035

          (111), 105-19421(105), 101-62405(101)]
          Target node 201-55381(201) is not a cluster member, members are [111-51035(111), 105-19421(105), 101-62405(101)]
          ]
          at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
          at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
          at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)

           

           

          In summary; we have implemented & tested the cache restart option a few ways but that does not work at all; as the cluster remains dis-jointed forever & does NOT allow the mapper nodes to join back to the original cluster (which contains only the Master/Reducer nodes once any of the mappers have been suspected).

          Please provide any other alternative solutions as we have exhausted (to the best of our knowledge) all solutions to gracefully restart the suspected nodes but in vain. If there are any oversights or other recommendations; do provide at your earliest.

           

          Thanks.

          • 17. Re: some q's re: impact(s) of ISPN/JGRPS config ...
            nileshbhagat

            Hi Bela/Mircea,

             

            We are still awaiting your response to this stalemate we have been facing with the SuspectException issue as per the above findings.

             

            Retried the scenarios few times today with same results as Suspected Nodes DO NOT join back to the parent cluster even after the cache has been successfully re-started; with the fact that the Master node only includes itself & Reducers in the cluster with all the Mapper nodes evicted from the cluster & reported as down nodes in the view changed event.

             

            Let us know if there are any other remediation measures w.r.t. ISPN APIs that will help the suspected nodes join back to the cluster programmatically.

             

            In the interim; I have modified the timeout for FD & SUSPECT protocols (see below) to be a higher value which does allow our long-running requests to get completed obviating the suspect exception scenario. Given that SuspectException is deemed to be benign & planned to be removed in ISPN 7.x; would this approach of increased timeout suffice?

             

            <FD_ALL timeout="600000" interval="30000" />

            <VERIFY_SUSPECT timeout="300000" num_msgs="5" />

             

            <FD_SOCK />

             


            Please reply at your earliest to the detailed findings/stacktraces observed on Feb 11th & Feb 12th posting as well as the conclusion drawn herein. Thanks.

            • 18. Re: some q's re: impact(s) of ISPN/JGRPS config ...
              mircea.markus

              You  to mention OutOfMemoryExceptions. I would strongly suggest investigating the cause of this and fixing it, before doing anything else, as the JVM would behave in an undeterministic manner once this errors are reported, and everything we discuss on this forum might not work. This thread is a bit convoluted, so I'll try to sum it up just to make sure we're on the same page: 1. cluster is formed of caches {A, B, C, D} 2. node D is suspected, it receives a viewChanged event 3. in this view change event you stop the cache and restart it 4. this restart process doesn't trigger another viewChanged event, as you'd expect Is this the case? If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?

              • 19. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                belaban

                No, this config isn't good. The timeout in FD_ALL means that you'd have to wait for up to 10 minutes for a suspicion to happen if a node hangs (if it crashes, FD_SOCK will detect this immediately). Plus, VERIFY_SUSPECT will take another 5 minutes to verify this. Absolutely BAD ! I suggest set the timeout in FD_ALL to 60s and in VERIFY_SUSPECT to 5s.

                • 20. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                  cotton.ben

                  >  Mircea wrote:

                  >  [...]  If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?

                   

                  Thanks ... working on this now. 

                   

                  > Bela wrote:

                  > No, this config isn't good. [...]

                   

                  We are in a lose/lose at the moment.  A good config results in our M-R request(s) not being able to complete (LOSE due to SUSPECT/timeout).   A bad config (huge timeout) enables our M-R request(s) to complete (but we LOSE because our bad config has other consequences)

                   

                  Thanks for sticking with us in this struggle.  This is a good fight.  Together, we are confident remedy will prevail.

                  • 21. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                    nileshbhagat

                    Hi Mircea,

                     

                    To correct your understanding ...

                     

                    1. cluster is formed of caches {A, B, C, D} with ALL nodes receiving viewChanged event

                    2. node D is suspected, it does NOT receive any viewChanged event & hence cannot restart the cache

                    3. In the catch block of OOM exception; we stop the cache and restart it

                    4. this restart process doesn't trigger another viewChanged event to any of the suspected, as you'd expect Is this the case? YES.

                    5. Master node evicts the suspected nodes permanently from the cluster & reports only {A, B, C} in this scenario (as per below)...

                    24304 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event invoked

                      24305 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event old members size : 4

                      24306 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event old members list : [10        1, 105, 111, 201]

                      24307 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event new members size : 3

                      24308 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event new members list : [10        1, 105, 111]

                      24309 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - downNodesSet size : 2

                      24310 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - downNodesSet list : [201, 107]

                      24311 DEBUG [11:40:53,465] [Incoming-10,AggregationEngine,101-29628(101)][Node] - viewChanged event completed

                    6. Resubmitting the request to the Master node results in "Target node not being part of cluster" exception (refer to the stack-traces in above post & below as well)...

                    AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null,

                    getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0,

                    properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code

                    value=2016, getMessage()=java.lang.IllegalArgumentException: Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105),

                    101-29628(101)]]
                    java.lang.IllegalArgumentException: Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105), 101-29628(101)]
                    Target node 201-2234(201) is not a cluster member, members are [111-31021(111), 105-1425(105), 101-29628(101)]
                    ]
                    at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
                    at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
                    at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)

                     

                    ______________________________________________________________________________

                     

                     

                    If so, would you try an CacheManager.stop and then CacheManager.getCache(<>) and see how that behaves?

                    Yes, I did implement CacheManager.stop and then CacheManager.getCache() (without the stop() API) but it results in following exception. Tried this scenario a few times but resulted in the same exception thrown.

                    2104 ERROR [11:38:59,286] [remote-thread-0,201][InboundInvocationHandlerImpl] - Exception executing command

                       2105 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container        .

                       2106         at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)

                       2107         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)

                       2108         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)

                       2109         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)

                       2110         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)

                       2111         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)

                       2112         at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)

                       2113         at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122        )

                       2114         at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)

                       2115         at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)

                       2116         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

                       2117         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

                       2118         at java.lang.Thread.run(Thread.java:724)

                     

                    Then, tried invoking CacheManager.stop, CacheManager.start  and then CacheManager.getCache but that also results in same exception ...

                     

                    ERROR [14:56:20,204] [remote-thread-1,201][InboundInvocationHandlerImpl] - Exception executing command

                       4928 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container        .

                       4929         at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)

                       4930         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)

                       4931         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)

                       4932         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)

                       4933         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)

                       4934         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)

                       4935         at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)

                       4936         at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122        )

                       4937         at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)

                       4938         at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)

                       4939         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

                       4940         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

                       4941         at java.lang.Thread.run(Thread.java:724)

                     

                     

                    The Master node reports one of the nodes as suspected as per below ...

                     

                    AggregationEngineException [excCode=AggEngDistrExec[Error Id=1002, Message=Can't process Aggregation Engine Request ], execId=0, nodeId=null, request=null,

                    getMessage()=CommandException [commandDef=CommandGenericDef [supplyContextKey=0, snapshotContextKey=147584717001, resultContextKey=0, refContextKey=0,

                    properties=null, calcRefObjects={RULES_CONTEXT=140579685001, SMGR_ASSUMPTIONS_STRESS=142455226001}, name=PrepareData], code=CommandPrepateData, code

                    value=2016, getMessage()=org.infinispan.remoting.transport.jgroups.SuspectException: Node 107-53543(107) was suspected]
                    org.infinispan.remoting.transport.jgroups.SuspectException: Node 107-53543(107) was suspected
                    Node 107-53543(107) was suspected
                    null
                    ]
                    at com.jpmorgan.lri.cs.ae.distexec.util.Util.createAggException(Util.java:108)
                    at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.execute(RequestListenerServiceImpl.java:176)
                    at com.jpmorgan.lri.cs.ae.listener.RequestListenerServiceImpl.handle(RequestListenerServiceImpl.java:306)

                    _____________________________________________________________

                     

                    Given that none of the suggested APIs have worked to have suspected node(s) re-join the cluster; is there any tests you or the team can try to validate this scenario & identify which of the ISPN APIs would indeed work? It would be appreciated if the suggestions are tested to validate the findings (given that none of them have worked so far for us).

                     

                    __________________________________________________________

                     

                    Hi Bela,

                     

                    The rationale for increased timeout for FD_ALL & SUSPECT was to ensure that the nodes do not get suspected by infinispan & allow our request to complete. Also, the SuspectException is planned to be deprecated in ISPN 7.x (based on some other posts).

                     

                    We have tried all suggested API approaches but in vain. Hope that your team can validate the suggestions as that would save on the time & effort (given that this issue has been for a while).

                    • 22. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                      nileshbhagat

                      Still awaiting response from the community on this SuspectException issue as did restart the cache (with & without CacheManager.getCache) with the same results as mapper nodes do not join back to the cluster while the master evicts the mappers permanently rejecting any subsequent requests.

                      • 23. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                        nileshbhagat

                        Hi Bela/Mircea,

                         

                        This SuspectException issue is still open & after testing the CacheManager stop/start/getCache APIs; the results are still negative with the suspected mapper node(s) being evicted from the cluster & never joins back with the master shunning the suspects forever.

                         

                        We have obviated this issue by increasing the timeout values for FD_ALL & VERIFY_SUSPECT as the mapper nodes do become unresponsive with Full GC but they are still alive. Do confirm if SuspectException may be removed from ISPN 7.x as per https://issues.jboss.org/i#browse/ISPN-2577

                         

                        Can you assert the following config values which were set after reviewing the JGroups/ISPN docs:

                         

                        1) Should the ISPN hash.numSegments value be less than or equal to 10 * max_cluster_size (with cluster size assumed to be Master+Reducer+Mapper for all machines the cluster)?

                         

                        2) What's the basis for ISPN L1 cache values l1.lifespan & l1.cleanupTaskFrequency? Would the default 10mins suffice?

                         

                        3) Confirm the basis for min threads, max threads, queue max size, keep alive time & rejection policy for Thread Pool & OOB pools .(with the current settings pasted below)?

                         

                        thread_pool.enabled="true"

                         

                        thread_pool.min_threads="30"

                         

                        thread_pool.max_threads="100"

                         

                        thread_pool.keep_alive_time="5000"

                         

                        thread_pool.queue_enabled="true"

                         

                        thread_pool.queue_max_size="10000"

                         

                        thread_pool.rejection_policy="Run"

                         

                         

                         

                        oob_thread_pool.enabled="true"

                         

                        oob_thread_pool.min_threads="30"

                         

                        oob_thread_pool.max_threads="100"

                         

                        oob_thread_pool.keep_alive_time="5000"

                         

                        oob_thread_pool.queue_enabled="true"

                         

                        oob_thread_pool.queue_max_size="100"

                         

                        oob_thread_pool.rejection_policy="Run"

                         

                        4) Please review the following values for UNICAST & NAKACK & recommend any changes?

                         

                        <pbcast.NAKACK2 xmit_interval="1000"

                         

                        max_rebroadcast_timeout="2000"

                         

                        use_mcast_xmit="true"

                         

                        use_mcast_xmit_req="true"

                         

                        discard_delivered_msgs="true"/>

                         

                        <UNICAST3 xmit_interval="1000"

                         

                        max_retransmit_time="2000"

                         

                        conn_expiry_timeout="0" />

                         

                        5) What's the basis for the different timeout values in GMS & confirm any changes to below settings?

                         

                         

                        <pbcast.GMS print_local_addr="true"

                         

                        join_timeout="10000"

                         

                        leave_timeout="10000"

                         

                        merge_timeout="5000"

                         

                        view_bundling="true"

                         

                        max_bundling_time="500"

                         

                        view_ack_collection_timeout="2000"

                         

                        resume_task_timeout="15000"/>

                         

                        6) Any suggestions on maximizing the cluster throughput with the mapper nodes being able to send results back to the Master/Reducer with minimal latency?

                         

                        Hope that above can be confirmed at the earliest.

                         

                        Thanks,

                        Nilesh.

                        • 24. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                          cotton.ben

                          Hi ISPN, Bela, Mircea,

                           

                          Just to recap this matter:

                           

                          1.  When our ISPN 5.3  deployment uses the recommended JGRPS configs, our DistributedExectuorService dispatch onto our DIST_SYNC grid  cannot meet its SLA  ... it fails due to SUSPECT_EXCEPTION

                          2.  We remedy this by using a non-recommended config, providing HUGE timeouts.  This gratuitously accommodates our DES dispatch completing, but we are now deployed with a non-preferred config.

                          3.  In this exact forum thread, remedy providing both (1) a preferred config and (2) the ability for us to consistently realize our DES dispatch succeeding w/o SUSPECT_EXCEPTION has been ... well .. it has been elusive.

                          4.  In ISPN 7 the SUSPECT_EXCEPTION goes away, at the time we upgrade we should then on ISPN 7 fully expect  our DES dispatch to work perfectly with a preferred JGRPs config

                          5.  If we cannot wait until we upgrade to ISPN 7, and want both a preferred config and the expectation that our DES dispath will not endure a SUSPECT_EXCEPTION, we should take advantage of the fact that we are now customers and open a ISPN 5.3 bug ticket via our RedHat subscribed customer account.  And request a 5.3 pathc

                           

                          Could you guys kindly confirm that, from your vantage point,  all of 1-5 are true?

                           

                          Thanks,

                          Ben

                          • 25. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                            mircea.markus

                            Yes, I did implement CacheManager.stop and then CacheManager.getCache() (without the stop() API) but it results in following exception. Tried this scenario a few times but resulted in the same exception thrown.

                            2104 ERROR [11:38:59,286] [remote-thread-0,201][InboundInvocationHandlerImpl] - Exception executing command

                               2105 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container        .

                               2106         at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)

                               2107         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)

                               2108         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)

                               2109         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)

                               2110         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)

                               2111         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)

                               2112         at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)

                               2113         at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122        )

                               2114         at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)

                               2115         at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)

                               2116         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

                               2117         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

                               2118         at java.lang.Thread.run(Thread.java:724)

                            right, you'd need to do CacheManager.stop() then CacheManager.start() then CacheManager.getCache

                            • 26. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                              mircea.markus

                              1.  When our ISPN 5.3  deployment uses the recommended JGRPS configs, our DistributedExectuorService dispatch onto our DIST_SYNC grid  cannot meet its SLA  ... it fails due to SUSPECT_EXCEPTION

                              2.  We remedy this by using a non-recommended config, providing HUGE timeouts.  This gratuitously accommodates our DES dispatch completing, but we are now deployed with a non-preferred config.

                              3.  In this exact forum thread, remedy providing both (1) a preferred config and (2) the ability for us to consistently realize our DES dispatch succeeding w/o SUSPECT_EXCEPTION has been ... well .. it has been elusive.

                              4.  In ISPN 7 the SUSPECT_EXCEPTION goes away, at the time we upgrade we should then on ISPN 7 fully expect  our DES dispatch to work perfectly witha preferred JGRPs config

                              5.  If we cannot wait until we upgrade to ISPN 7, and want both a preferred config and the expectation that our DES dispath will not endure a SUSPECT_EXCEPTION, we should take advantage of the fact that we are now customers and open a ISPN 5.3 bug ticket via our RedHat subscribed customer account.  And request a 5.3 pathc

                              Ben, I'm afraid there's no straight answer to the suspect exception questions. Suspect exceptions happen because a node is seen as not being present in the cluster by other members. This might be cause by a variety of factors, e.g. node crashing, NIC crashing, network flooded, node being overwhelmed (long GC run, OOM). They way I'd tackle this would be to see why the node gets suspended. In your case sounds like either an OOM or a long GC pause, but might be network as well. Monitoring the system (failed node in particular) and the logs from the JGroups failure detection protocols you configured (FD_ALL, VERIFY_SUSPECT, FD_SOCK) should hint you on that. Might be a case that the system is overwhelmed, I imagine the computations you're doing there are not trivial.

                              • 27. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                                nileshbhagat

                                It's surprising that my post was overlooked as I had already tried exactly that & even that failed with the same exception as per below ...

                                 

                                 

                                Then, tried invoking CacheManager.stop, CacheManager.start  and then CacheManager.getCache but that also results in same exception ...


                                ERROR [14:56:20,204] [remote-thread-1,201][InboundInvocationHandlerImpl] - Exception executing command

                                   4928 java.lang.IllegalStateException: Cache container has been stopped and cannot be reused. Recreate the cache container        .

                                   4929         at org.infinispan.manager.DefaultCacheManager.assertIsNotTerminated(DefaultCacheManager.java:935)

                                   4930         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:535)

                                   4931         at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:517)

                                   4932         at com.jpmorgan.lri.cs.ae.distexec.Node.restartCache(Node.java:244)

                                   4933         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:268)

                                   4934         at com.jpmorgan.lri.cs.ae.distexec.exec.command.CommandExec.call(CommandExec.java:28)

                                   4935         at org.infinispan.commands.read.DistributedExecuteCommand.perform(DistributedExecuteCommand.java:121)

                                   4936         at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:122        )

                                   4937         at org.infinispan.remoting.InboundInvocationHandlerImpl.access$000(InboundInvocationHandlerImpl.java:68)

                                   4938         at org.infinispan.remoting.InboundInvocationHandlerImpl$2.run(InboundInvocationHandlerImpl.java:194)

                                   4939         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

                                   4940         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

                                   4941         at java.lang.Thread.run(Thread.java:724)

                                • 28. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                                  cotton.ben

                                  > I imagine the computations you're doing there are not trivial.

                                   

                                  No, they are not trivial ... real time liquidity risk can only be rendered on our Linux HPC supercomputer (240CPUs, 3TB RAM, UDP/IP/Infinband OSI interconnect).

                                   

                                  > They way I'd tackle this would be to see why the node gets suspended. In your case sounds like either an OOM or a long GC pause, but might be network as well.

                                   

                                  Yep, I see what you mean - but our finding the cause does not sound like we can then find remedy.  We have to get a solution to long GC pauses triggering our ISPN/JGRPS stack triggering a "that node is SUSPECT, let's throw an Exception" response as the only possible outcome.  You guys got to change it so that there are more manageable outcomes in that situation other than throw SUSPECT_EXCEPTION.  When you throw  *that* Exception it immediately betrays our DistributedExecutorService/NotifyingFuture (DES/NF is *the* beauty of Infinispan, BTW) processing from Reducing/Fusing all our Mapper Nodes' rendered "risk essential". operand set(s) ...... and w/o  soundly/completely rendered operand set(s) ...  our SLA obligation  FAILS.

                                   

                                  You guys need to fix that.  It doesn't matter what we find as the cause of SUSPECT_EXCEPTION if the only outcome remains "throw it".  We do care why you think a Mapper node has left the cluster, but, what we need the most are more manageable outcomes (not just throw SUSPECT_EXCEPTION) so that our DES/NF service has the chance to survive.

                                  • 29. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                                    mircea.markus

                                    You guys need to fix that.  It doesn't matter what we find as the cause of SUSPECT_EXCEPTION if the only outcome remains "throw it".  We do care why you think a Mapper node has left the cluster, but, what we need the most are more manageable outcomes (not just throw SUSPECT_EXCEPTION) so that our DES/NF service has the chance to survive.

                                    So the good news is this[1]. The bad news is that you'll only have it in ISPN 7.0 :-)

                                     

                                    [1] [ISPN-2577] Silence SuspectExceptions - JBoss Issue Tracker