1 2 3 Previous Next 33 Replies Latest reply on Mar 1, 2014 4:04 AM by belaban

some q's re: impact(s) of ISPN/JGRPS config ...

cotton.ben Jan 30, 2014 4:25 PM

Please consider reviewing our updated JGroups & Infinispan xml config files (attached) as it is the basis for our asking the following questions:

Why would the Unicast/Muticast buffer sizes for UDP transport only consider rmem_max settings on unix? Can it consider udp_mem settings which is more relevant?

Why does the org.infinispan.remoting.transport.TopologyAwareAddress API not provide accessor/setter for nodeName attribute?

Is is true that ASYNC communications is not supported in Distributed mode & it will fall-back to SYNC communications with default timeout of 15 secs?

Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

What's the default value for lifespan & cleanupTaskFrequency of L1 element? As per schema; default is 60000 (1min) while the documentation indicates 10mins.

Why is the RadarGun measured throughput(s) lower with the newer protocols' settings in Jgroups 3.3 (when compared to the higher throughput in earlier Jgroups version)?

---- jgroups-new.xml ----

<!--

Fast configuration for local mode, ie. all members reside on the same host. Setting ip_ttl to 0 means that

no multicast packet will make it outside the local host.

Therefore, this configuration will NOT work to cluster members residing on different hosts !

Author: Bela Ban

Version: $Id: fast-local.xml,v 1.9 2009/12/18 14:50:00 belaban Exp $

-->

<config xmlns="urn:org:jgroups"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.3.xsd">

<UDP mcast_addr="239.1.1.1"

mcast_port="${jgroups.udp.mcast_port:45111}"

tos="8"

ucast_recv_buf_size="4M"

ucast_send_buf_size="640K"

mcast_recv_buf_size="4M"

mcast_send_buf_size="640K"

loopback="true"

max_bundle_size="64k"

max_bundle_timeout="30"

bundler_type="new"

ip_ttl="${jgroups.udp.ip_ttl:1}"

ip_mcast="true"

enable_diagnostics="true"

thread_naming_pattern="clp"

timer_type="new3"

timer.min_threads="4"

timer.max_threads="10"

timer.keep_alive_time="3000"

timer.queue_max_size="1000"

timer.rejection_policy="discard"

thread_pool.enabled="true"

thread_pool.min_threads="40"

thread_pool.max_threads="100"

thread_pool.keep_alive_time="5000"

thread_pool.queue_enabled="true"

thread_pool.queue_max_size="10000"

thread_pool.rejection_policy="discard"

oob_thread_pool.enabled="true"

oob_thread_pool.min_threads="40"

oob_thread_pool.max_threads="100"

oob_thread_pool.keep_alive_time="5000"

oob_thread_pool.queue_enabled="true"

oob_thread_pool.queue_max_size="100"

oob_thread_pool.rejection_policy="discard"/>

<PING timeout="10000"

num_initial_members="10"

break_on_coord_rsp="true"/>

<MERGE3 max_interval="30000"

min_interval="10000"

max_participants_in_merge="0"/>

<VERIFY_SUSPECT timeout="60000" num_msgs="5" />

<!-- <pbcast.NAKACK use_mcast_xmit="false"

retransmit_timeout="100,300,600,1200"

discard_delivered_msgs="true"/> -->

<pbcast.NAKACK2 xmit_interval="1000"

max_rebroadcast_timeout="2000"

use_mcast_xmit="true"

use_mcast_xmit_req="true"

discard_delivered_msgs="true"/>

<!-- <UNICAST2 timeout="300,600,1200"

conn_expiry_timeout="0"/> -->

<UNICAST3 xmit_interval="1000"

max_retransmit_time="2000"

conn_expiry_timeout="0" />

<!-- <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

max_bytes="1000000"/> -->

<pbcast.STABLE stability_delay="1000" desired_avg_gossip="60000"

max_bytes="5M"/>

<pbcast.FLUSH timeout="1000" />

<!-- <pbcast.GMS print_local_addr="true" join_timeout="60000" leave_timeout="60000"

max_bundling_time="20000"

view_bundling="true"/> -->

<pbcast.GMS print_local_addr="true"

join_timeout="10000"

leave_timeout="10000"

merge_timeout="5000"

view_bundling="true"

max_bundling_time="500"

view_ack_collection_timeout="2000"/>

<!-- <FC max_credits="2M"

min_threshold="0.10"/> -->

<MFC max_credits="2M"

min_threshold="0.25"/>

<UFC max_credits="2M"

min_threshold="0.25"/>

</config>

------------------ infinsipan-config.xml ---------------------------

<?xml version="1.0" encoding="UTF-8"?>

<!--

Inc. and/or its affiliates and other ~ contributors as indicated by

distribution for a full listing of ~ individual contributors. ~ ~ This

is free software; you can redistribute it and/or modify it ~ under the

terms of the GNU Lesser General Public License as ~ published by the

Free Software Foundation; either version 2.1 of ~ the License, or (at

your option) any later version. ~ ~ This software is distributed in

the hope that it will be useful, ~ but WITHOUT ANY WARRANTY; without

even the implied warranty of ~ MERCHANTABILITY or FITNESS FOR A

PARTICULAR PURPOSE. See the GNU ~ Lesser General Public License for

more details. ~ ~ You should have received a copy of the GNU Lesser

General Public ~ License along with this software; if not, write to

the Free ~ Software Foundation, Inc., 51 Franklin St, Fifth Floor,

Boston, MA ~ 02110-1301 USA, or see the FSF site: http://www.fsf.org.

-->

<!--

infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:infinispan:config:5.3

http://www.infinispan.org/schemas/infinispan-config-5.3.xsd"

xmlns="urn:infinispan:config:5.3"

-->

<infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:infinispan:config:5.3

http://www.infinispan.org/schemas/infinispan-config-5.3.xsd"

xmlns="urn:infinispan:config:5.3">

<transport siteId="${nodename}" machineId="m1" rackId="r1"

nodeName="${nodename}" clusterName="AggregationEngine">

</properties>

</transport>

</properties>

</asyncListenerExecutor>

</properties>

</asyncTransportExecutor>

</global>

</clustering>

</default>

</infinispan>

1. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 3, 2014 4:13 PM (in response to cotton.ben)

Still awaiting reply from JGroups/ISPN community on this post. Can Bela Ban or any other resource confirm the xml settings & respond to the questions?
Actions
2. Re: some q's re: impact(s) of ISPN/JGRPS config ...

mircea.markus Feb 5, 2014 2:52 PM (in response to cotton.ben)

Why does the org.infinispan.remoting.transport.TopologyAwareAddress API not provide accessor/setter for nodeName attribute?

The nodeName is not something that can be changed dinamically, but must be set before opening the underlaying JGroups channel. Hence no point in having a setter. If you want to read the value, a better way of doing it is by reading it through the configuration directly: GlobalConfiguration.transport().nodeName(). TopologyAwareAddress is pretty internal stuff

Is is true that ASYNC communications is not supported in Distributed mode & it will fall-back to SYNC communications with default timeout of 15 secs?

No. Actually starting with ISPN 5.3. we only have distributed mode, the replicate mode being implemented as an degenerated case of distribution with numOwners > clusterSize.

Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

I don't think so. What you can do, though is implement a listener that will live on that node, and will restart it in order to trigger a clean join.I'd be curious to see dan.berindei's thoughts on this as well.

What's the default value for lifespan & cleanupTaskFrequency of L1 element? As per schema; default is 60000 (1min) while the documentation indicates 10mins.

Code won't lie :-) infinispan/core/src/main/java/org/infinispan/configuration/cache/L1ConfigurationBuilder.java at master · infinispan/infi…
Care to create a JIRA for the documentation?

Why is the RadarGun measured throughput(s) lower with the newer protocols' settings in Jgroups 3.3 (when compared to the higher throughput in earlier Jgroups version)?

In case you're using ISPN 5.3, there were some significant performance degradations in 5.3 that were fixed in 6.0: [ISPN-3534] Investigate performance regressions in Infinispan 6.0.0 - JBoss Issue Tracker

Why would the Unicast/Muticast buffer sizes for UDP transport only consider rmem_max settings on unix? Can it consider udp_mem settings which is more relevant?

belaban - any idea? ^^
Actions
3. Re: some q's re: impact(s) of ISPN/JGRPS config ...

belaban Feb 6, 2014 2:12 AM (in response to mircea.markus)

I didn't know about udp_mem / udp_rmem_min and udp_wmem_min. It seems like this is definitely relevant in cases where you send a lot of UDP datagram (and multicast) packets. I suggest monitor your UDP buffer stats with netstat -su and increase the values if you see a lot of packet drops.
Actions
4. Re: some q's re: impact(s) of ISPN/JGRPS config ...

cotton.ben Feb 6, 2014 2:03 PM (in response to belaban)

Thanks Bela and Mircea.

Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

I don't think so. What you can do, though is implement a listener that will live on that node, and will restart it in order to trigger a clean join.I'd be curious to see Dan Berindei's thoughts on this as well.

To be precise, the listener that lives on the disconnected Node does the following (linearizably in time):

1. exists to listen for its disconnect from the cluster
2. upon notification of its disconnect it then proceeds to 3 and 4.
3. re-start to trigger a clean join (How does it do this? ask the OS run-time to re-boot a new identical node's JavaVM process in the exact same way the disconnected node's JavaVM process was booted?)
4. commits suicide

correct? Does Dan Beindei likely have a craftier approach?
Actions
5. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 6, 2014 2:41 PM (in response to belaban)

Hi Bela,

I would appreciate your response to the following questions:

1. Given that JGroups only considers rmem/wmem settings for UDP send/receive buffer sizes; can it be modified to refer to the udp_mem settings which are configured at 27MB max on our machines. The rmem/wmem values are set to 4MB max & can’t be modified on our machines. Refer to the warning when buffer sizes are exceeded:
7646 WARN [12:14:04,114] [main][UDP] - [JGRP00015] the receive buffer of socket DatagramSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)
   7647 WARN [12:14:04,115] [main][UDP] - [JGRP00015] the receive buffer of socket MulticastSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)
2.
       2. Can you suggest guidelines on the send & receive buffer sizes with any ratio thereof?

3.    3.   Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?
As per response from Mircea, restarting the node for a clean join may mean losing the cached data for the suspected node. But, is there any other mechanism for a node to auto rejoin by itself such as the “shun” attribute of FD protocol OR providing the JChannel API that supports auto-joining (refer to JBoss Server link https://docs.jboss.org/jbossclustering/cluster_guide/5.1/html/jgroups.chapt.html?

4.    4.   Would you please review the pasted jgroups & infinispan xml to confirm that the protocols used & their attribute values are in conformance with the respective 3.3. & 5.3 versions being used?

Thanks,
Nilesh Bhagat.
Actions
6. Re: some q's re: impact(s) of ISPN/JGRPS config ...

mircea.markus Feb 6, 2014 3:00 PM (in response to cotton.ben)

for 3. and 4. a cache.stop()/cache.start() should do the trick.
Actions
7. Re: some q's re: impact(s) of ISPN/JGRPS config ...

belaban Feb 7, 2014 4:36 AM (in response to nileshbhagat)
Here are a few comments re the posted config:
FLUSH: it's in the wrong place (should be at the top of the stack), but the question is why do you use it in the first place ? Infinispan does *not* require it anymore
timer.rejection_policy should be "abort", so a new thread is spawned instead of discarding a timer task. The latter might have disastrous consequences
Your thread pool min sizes are quite high (40). I suggest lower them (especially as you have a high max size and a queue enabled)
Which version of JGroups is this ?
Why is FD_SOCK commented ?
I recommend FD_ALL (with a high timeout) with UDP rather than FD

Re the udp_mem setting: apparently udp_mem_X is the new net.core.r/wmem_max, so if you change this, the send and receive buffers will be honored.
Actions
8. Re: some q's re: impact(s) of ISPN/JGRPS config ...

belaban Feb 7, 2014 4:47 AM (in response to nileshbhagat)
Nilesh Bhagat wrote:

Hi Bela,

I would appreciate your response to the following questions:

1. Given that JGroups only considers rmem/wmem settings for UDP send/receive buffer sizes; can it be modified to refer to the udp_mem settings which are configured at 27MB max on our machines. The rmem/wmem values are set to 4MB max & can’t be modified on our machines. Refer to the warning when buffer sizes are exceeded:

7646 WARN [12:14:04,114] [main][UDP] - [JGRP00015] the receive buffer of socket DatagramSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)

   7647 WARN [12:14:04,115] [main][UDP] - [JGRP00015] the receive buffer of socket MulticastSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)

2.

       2. Can you suggest guidelines on the send & receive buffer sizes with any ratio thereof?

3.    3.   Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

As per response from Mircea, restarting the node for a clean join may mean losing the cached data for the suspected node. But, is there any other mechanism for a node to auto rejoin by itself such as the “shun” attribute of FD protocol OR providing the JChannel API that supports auto-joining (refer to JBoss Server link https://docs.jboss.org/jbossclustering/cluster_guide/5.1/html/jgroups.chapt.html?

4.    4.   Would you please review the pasted jgroups & infinispan xml to confirm that the protocols used & their attribute values are in conformance with the respective 3.3. & 5.3 versions being used?

Thanks,

Nilesh Bhagat.

JGroups cannot directly make use of net.core.rmem_max or udp_mem, but these values help JGroups to obtain the buffer sizes that are configured. In your case above, I suggest reduce the send/receive buffer sizes to 4MB. As a matter of fact, the default buffer sizes are a bit too big and you can set both to be under 4MB. I see you've already done that in your config.
In TCP, buffers are set using the bandwidth-delay product. However, in JGroups we receive messages not from a single peer but potentially from multiple peers. So this depends on a few things, such as how many peers is a node receiving messages from, what's the avg size of those messages, what's the message arrival rate etc. I suggest leave the buffers as they are in your config, run your test program which mimicks the real app, and watch stats with netstat -su. If you see a lot of dropped packets, increase the buffer sizes.
No, shun was deprecated. I suggest you try out Mircea's suggestion of using a listener and restarting the cache programmatically
Done, see my other post
Actions
9. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 7, 2014 5:38 PM (in response to belaban)

Hi Bela & Mircea,

Thanks for your responses. Given that I can only post once per day; I am assimilating further questions under this reply & would kindly request to respond accordingly.

My questions are italicized for clarity (with last 3 being important ones).
_________________________________________________________________

GlobalConfiguration is a deprecated class while it's successor GlobalConfigurationBuilder only has a setter for nodeName with the GlobalConfigurationBuilder.defaultClusteredBuilder().transport().nodeName() API. There are no getters for any of site, node or cluster attributes & hence our usage of TopologyAwareAddress. Let me know of any oversight.

_______________________________________________________________________
It has been proven otherwise as ASYNC always times-out a node within 16secs with org.infinispan.util.concurrent.TimeoutException: Node 201-8334(201) timed out. Without any sync tag specified; I do believe that it is timing-out after the default of 15secs.
We have been using SYNC with a time-out of 1hr ... hence do suggest what should be the optimal value for the time-out? Also, how does this differ from FD detection?

____________________________________________________________________________

I have modified the JGroups config xml to incorporate the above changes. We use Jgroups 3.3. Also modified thread_pool.rejection_policy to "Run" (same for OOB as well). Reduced the thread pool min sizes to 30.
Can somebody review the Infinispan config xml as well to ensure that the values are in conformance especially the <asyncTransportExecutor>, <l1> & <hash> values.

_________________________________________________________________________________

We did implement this solution by invoking stop() & start() of cache under the ViewChanged event but it has been observed that the suspected node(s) DO NOT receive any event (only those nodes that are still part of the cluster received it). Probably Infinispan does not send any events to the evicted node(s) which have left the cluster.
We also tried restarting cache under the Merged event but still the same results (evicted node(s) do not receive any event but only the current members do).
Also, observed that there are quite a few nodes that eventually get evicted from cluster (once SuspectException is encountered for 1 or 2 nodes).
Hence, please suggest if there is any event that is indeed sent to the evicted node(s) for us to restart the cache & join the cluster? If not, is there any other mechanism for the suspected node(s) to rejoin the cluster (without restarting the entire cluster which we have been doing all along)?

_________________________________________________________________________

Does this mean that we need to upgrade to ISPN 6.0 & have all the 5.3 performance issues addressed? We do have ISPN 5.3 running in Production & hence this question.
Actions
10. Re: some q's re: impact(s) of ISPN/JGRPS config ...

mircea.markus Feb 10, 2014 7:00 AM (in response to nileshbhagat)

GlobalConfiguration is a deprecated class while it's successor GlobalConfigurationBuilder only has a setter for nodeName with the GlobalConfigurationBuilder.defaultClusteredBuilder().transport().nodeName() API. There are no getters for any of site, node or cluster attributes & hence our usage of TopologyAwareAddress. Let me know of any oversight.

Not sure what your question here really is, but you can configure the the site/rack/machine using the globalConfigruation.transport().rack(..) etc.

It has been proven otherwise as ASYNC always times-out a node within 16secs with org.infinispan.util.concurrent.TimeoutException: Node 201-8334(201) timed out. Without any sync tag specified; I do believe that it is timing-out after the default of 15secs.

We have been using SYNC with a time-out of 1hr ... hence do suggest what should be the optimal value for the time-out? Also, how does this differ from FD detection?

What's the full stack trace? Async shouldn't thrown TimeoutExceptions.

We did implement this solution by invoking stop() & start() of cache under the ViewChanged event but it has been observed that the suspected node(s) DO NOT receive any event (only those nodes that are still part of the cluster received it). Probably Infinispan does not send any events to the evicted node(s) which have left the cluster.

We also tried restarting cache under the Merged event but still the same results (evicted node(s) do not receive any event but only the current members do).

Not sure what you mean by "the suspected node(s) DO NOT receive any event". So the node that's restarted doesn't rejoin the cluster?

Does this mean that we need to upgrade to ISPN 6.0 & have all the 5.3 performance issues addressed? We do have ISPN 5.3 running in Production & hence this question.

yes.
Actions
11. Re: some q's re: impact(s) of ISPN/JGRPS config ...

cotton.ben Feb 10, 2014 7:15 AM (in response to mircea.markus)

Not sure what you mean by "the suspected node(s) DO NOT receive any event". So the node that's restarted doesn't rejoin the cluster?
Unfortunately, it does not.
Actions
12. Re: some q's re: impact(s) of ISPN/JGRPS config ...

nileshbhagat Feb 10, 2014 5:44 PM (in response to mircea.markus)

Hi Mircea,

Thanks for your responses. My further comments inline ...

Yes, the intention is to obtain the nodeName attribute value which as per API would be GlobalConfigurationBuilder.defaultClusteredBuilder().build().transport().nodeName(); Let us know if this is incorrect.

Yes, it does throw TimeoutException but I don't have the full stack-trace yet. Need to change the config xml to reproduce it again.

In fact, this was an error on our part as the event listener was registered on Master/Reducer nodes only and not on Mapper nodes. Modifying the implementation allowed us to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; upon which the cache is restarted for that node. But, we are observing a discrepancy as the Mapper nodes contain the full set of nodes while the Master node excludes the suspected mapper nodes. Please refer to the debug statements below & provide any resolution to this issue for us to resolve this quickly. We are close to the solution except for this anomaly as our Master node doesn't accept any further requests if any of the mappers are down (which is a fall-out of the suspected nodes being evicted form cluster even after their caches were successfully restarted)?

Mapper Node -->
7346 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event invoked
   7347 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members size : 6
   7348 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members list : [105        , 107, 111, 201, 202, 203]
   7349 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members size : 7
   7350 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201, 202, 203]

Master Node -->
DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event invoked
14905 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members size : 7
14906 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members list : [101        , 105, 107, 111, 201, 202, 203]
14907 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members size : 5
14908 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201]
Actions
13. Re: some q's re: impact(s) of ISPN/JGRPS config ...

dan.berindei Feb 11, 2014 5:08 AM (in response to nileshbhagat)

Nilesh Bhagat wrote:

Yes, the intention is to obtain the nodeName attribute value which as per API would be GlobalConfigurationBuilder.defaultClusteredBuilder().build().transport().nodeName(); Let us know if this is incorrect.

To read the configured values, you need to use the new GlobalConfiguration in org.infinispan.configuration.global. You can get it either with GlobalConfigurationBuilder.build(), or with EmbeddedCacheManager.getCacheManagerConfiguration().

In fact, this was an error on our part as the event listener was registered on Master/Reducer nodes only and not on Mapper nodes. Modifying the implementation allowed us to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; upon which the cache is restarted for that node. But, we are observing a discrepancy as the Mapper nodes contain the full set of nodes while the Master node excludes the suspected mapper nodes. Please refer to the debug statements below & provide any resolution to this issue for us to resolve this quickly. We are close to the solution except for this anomaly as our Master node doesn't accept any further requests if any of the mappers are down (which is a fall-out of the suspected nodes being evicted form cluster even after their caches were successfully restarted)?

Mapper Node -->

7346 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event invoked

   7347 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members size : 6

   7348 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members list : [105        , 107, 111, 201, 202, 203]

   7349 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members size : 7

   7350 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201, 202, 203]

Master Node -->

DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event invoked

14905 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members size : 7

14906 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members list : [101        , 105, 107, 111, 201, 202, 203]

14907 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members size : 5

14908 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201]

How do you split the cluster? It could be that the mapper node doesn't receive a new view because its FD can still receive messages from the previous node in the view...
Actions
14. Re: some q's re: impact(s) of ISPN/JGRPS config ...

cotton.ben Feb 11, 2014 12:16 PM (in response to dan.berindei)

or 3. and 4. a cache.stop()/cache.start() should do the trick.
Not sure what you mean by "the suspected node(s) DO NOT receive any event". So the node that's restarted doesn't rejoin the cluster?
Unfortunately, it does not.

I stand corrected. The cache.stop()/cache.start() invokes (triggered from the ViewChange event) sometimes do (but sometimes don't) result in the Node re-joining the cluster. The Node may (or may not) re-join the cluster. The outcome is, frankly, erratic.

How do you split the cluster? It could be that the mapper node doesn't receive a new view because its FD can still receive messages from the previous node in the view...

Not sure what you mean by "split the cluster". We deploy as a ISPN 5.3 DIST_SYNC data grid. We take the hello world example of Node.java and co-erce onto it a "logical view" of Nodes being in 1 of role tree_set ={MASTER,REDUCER,MAPPER}. However, from the ISPN 5.3 view, these are all just "Nodes".

P.S. Thanks for the highly interactive responses. The support effort here has been stellar. :-)
Actions

1 2 3 Previous Next

Go to original post