1 2 3 Previous Next 33 Replies Latest reply on Mar 1, 2014 4:04 AM by belaban

    some q's re: impact(s) of ISPN/JGRPS config ...

    cotton.ben



      Please consider reviewing our updated JGroups & Infinispan xml config files (attached) as it is the basis for our asking the following questions:

       

      Why would the Unicast/Muticast buffer sizes for UDP transport only consider rmem_max settings on unix? Can it consider udp_mem settings which is more relevant?


      Why does the org.infinispan.remoting.transport.TopologyAwareAddress API not provide accessor/setter for nodeName attribute?

       

      Is is true that ASYNC communications is not supported in Distributed mode & it will fall-back to SYNC communications with default timeout of 15 secs?

       

      Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

       

      What's the default value for lifespan & cleanupTaskFrequency of L1 element? As per schema; default is 60000 (1min) while the documentation indicates 10mins.

       

      Why is the RadarGun measured throughput(s) lower with the newer protocols'  settings in Jgroups 3.3 (when compared to the higher throughput in earlier Jgroups version)?

       

       

       

       

       

       

       

       

       

      ---- jgroups-new.xml ----

       

       

       

       

      <!--

        Fast configuration for local mode, ie. all members reside on the same host. Setting ip_ttl to 0 means that

        no multicast packet will make it outside the local host.

        Therefore, this configuration will NOT work to cluster members residing on different hosts !

       

       

        Author: Bela Ban

        Version: $Id: fast-local.xml,v 1.9 2009/12/18 14:50:00 belaban Exp $

      -->

       

       

      <config xmlns="urn:org:jgroups"

              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

              xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.3.xsd">

             

          <!-- TRANSPORT -->   

          <UDP mcast_addr="239.1.1.1"

               mcast_port="${jgroups.udp.mcast_port:45111}"

               tos="8"

               ucast_recv_buf_size="4M"

               ucast_send_buf_size="640K"

               mcast_recv_buf_size="4M"

               mcast_send_buf_size="640K"

               loopback="true"

               max_bundle_size="64k"

               max_bundle_timeout="30"

               bundler_type="new"

               ip_ttl="${jgroups.udp.ip_ttl:1}"

               ip_mcast="true"

               enable_diagnostics="true"

               thread_naming_pattern="clp"

       

       

               timer_type="new3"

               timer.min_threads="4"

               timer.max_threads="10"

               timer.keep_alive_time="3000"

               timer.queue_max_size="1000"

               timer.rejection_policy="discard"

       

       

               thread_pool.enabled="true"

               thread_pool.min_threads="40"

               thread_pool.max_threads="100"

               thread_pool.keep_alive_time="5000"

               thread_pool.queue_enabled="true"

               thread_pool.queue_max_size="10000"

               thread_pool.rejection_policy="discard"

       

       

               oob_thread_pool.enabled="true"

               oob_thread_pool.min_threads="40"

               oob_thread_pool.max_threads="100"

               oob_thread_pool.keep_alive_time="5000"

               oob_thread_pool.queue_enabled="true"

               oob_thread_pool.queue_max_size="100"

               oob_thread_pool.rejection_policy="discard"/>

       

       

        <!-- MEMBER DISCOVERY -->

          <PING timeout="10000"

                  num_initial_members="10"

                  break_on_coord_rsp="true"/>

         

          <!-- MERGE AFTER NETWORK PARTITION -->       

          <MERGE3 max_interval="30000"

                  min_interval="10000"

                  max_participants_in_merge="0"/>

         

          <!-- FAILURE DETECTION -->       

          <!-- <FD_SOCK /> -->

          <!-- <FD_PING /> -->

          <!-- <FD_ALL /> -->

          <FD timeout="60000" max_tries="10" />

          <VERIFY_SUSPECT timeout="60000" num_msgs="5" />

         

         

          <!-- MESSAGE TRANSMISSION -->

          <!-- <pbcast.NAKACK use_mcast_xmit="false"

                         retransmit_timeout="100,300,600,1200"

                         discard_delivered_msgs="true"/> -->

          <pbcast.NAKACK2 xmit_interval="1000"

           max_rebroadcast_timeout="2000"

           use_mcast_xmit="true"

           use_mcast_xmit_req="true"

           discard_delivered_msgs="true"/>

         

          <!-- <UNICAST2 timeout="300,600,1200"

                    conn_expiry_timeout="0"/> -->

          <UNICAST3 xmit_interval="1000"

           max_retransmit_time="2000"

           conn_expiry_timeout="0" />

          

        <!-- <RSVP resend_interval="2000" timeout="10000"/> -->

         

          <!-- MESSAGE STABILITY -->

          <!-- <BARRIER /> -->

          <!-- <pbcast.STATE_TRANSFER /> -->

               

          <!-- <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

                         max_bytes="1000000"/> -->

          <pbcast.STABLE stability_delay="1000" desired_avg_gossip="60000"

                         max_bytes="5M"/>

          <pbcast.FLUSH  timeout="1000" />

         

         

          <!-- GROUP MEMBERSHIP -->                  

          <!-- <pbcast.GMS print_local_addr="true" join_timeout="60000" leave_timeout="60000"

                      max_bundling_time="20000"

                      view_bundling="true"/> -->

          <pbcast.GMS print_local_addr="true"

           join_timeout="10000"

           leave_timeout="10000"

           merge_timeout="5000"

           view_bundling="true"

           max_bundling_time="500"

           view_ack_collection_timeout="2000"/>

         

          <!-- FLOW CONTROL -->           

          <!-- <FC max_credits="2M"

              min_threshold="0.10"/> -->

          <MFC max_credits="2M"

          min_threshold="0.25"/>

          <UFC max_credits="2M"

          min_threshold="0.25"/>

             

         

          <!-- FRAGMENTATION -->   

          <!-- <FRAG2 frag_size="60000"  /> -->

          <FRAG2 frag_size="62K"  />

       

       

        

      </config>

       

       

       

       

      ------------------ infinsipan-config.xml ---------------------------

       

       

      <?xml version="1.0" encoding="UTF-8"?>

        <!--

        ~ JBoss, Home of Professional Open Source ~ Copyright 2009 Red Hat

        Inc. and/or its affiliates and other ~ contributors as indicated by

        the @author tags. All rights reserved. ~ See the copyright.txt in the

        distribution for a full listing of ~ individual contributors. ~ ~ This

        is free software; you can redistribute it and/or modify it ~ under the

        terms of the GNU Lesser General Public License as ~ published by the

        Free Software Foundation; either version 2.1 of ~ the License, or (at

        your option) any later version. ~ ~ This software is distributed in

        the hope that it will be useful, ~ but WITHOUT ANY WARRANTY; without

        even the implied warranty of ~ MERCHANTABILITY or FITNESS FOR A

        PARTICULAR PURPOSE. See the GNU ~ Lesser General Public License for

        more details. ~ ~ You should have received a copy of the GNU Lesser

        General Public ~ License along with this software; if not, write to

        the Free ~ Software Foundation, Inc., 51 Franklin St, Fifth Floor,

        Boston, MA ~ 02110-1301 USA, or see the FSF site: http://www.fsf.org.

        -->

        <!--

        infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

        xsi:schemaLocation="urn:infinispan:config:5.3

        http://www.infinispan.org/schemas/infinispan-config-5.3.xsd"

        xmlns="urn:infinispan:config:5.3"

        -->

      <infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

        xsi:schemaLocation="urn:infinispan:config:5.3

        http://www.infinispan.org/schemas/infinispan-config-5.3.xsd"

        xmlns="urn:infinispan:config:5.3">

       

       

        <global>

        <transport siteId="${nodename}" machineId="m1" rackId="r1"

        nodeName="${nodename}" clusterName="AggregationEngine">

        <properties>

        <property name="configurationFile" value="./xml/jgroups-new.xml" />

        </properties>

        </transport>

       

       

        <asyncListenerExecutor>

        <properties>

        <property name="maxThreads" value="4" />

        </properties>

        </asyncListenerExecutor>

       

       

        <asyncTransportExecutor>

        <properties>

        <property name="maxThreads" value="100" />

        </properties>

        </asyncTransportExecutor>

       

        <globalJmxStatistics enabled="true" />

        </global>

       

       

        <default>

        <jmxStatistics enabled="true" />

       

        <clustering mode="dist">

        <!-- <l1 enabled="true" lifespan="60000" /> -->

        <l1 enabled="true" lifespan="60000" cleanupTaskFrequency="60000"/>

       

        <!-- <hash numOwners="2" /> -->

        <hash numOwners="2" numSegments="1000"/>

       

          <!-- SYNC mode would timeout in specified secs & hence greater value is recommended -->

        <sync replTimeout="3600000" />

       

        <!-- ASYNC mode almost always times-out in 16secs & hence not used -->

        <!-- <async /> -->

        </clustering>

        </default>

      </infinispan>

        • 1. Re: some q's re: impact(s) of ISPN/JGRPS config ...
          nileshbhagat

          Still awaiting reply from JGroups/ISPN community on this post. Can Bela Ban or any other resource confirm the xml settings & respond to the questions?

          • 2. Re: some q's re: impact(s) of ISPN/JGRPS config ...
            mircea.markus
            Why does the org.infinispan.remoting.transport.TopologyAwareAddress API not provide accessor/setter for nodeName attribute?

             

            The nodeName is not something that can be changed dinamically, but must be set before opening  the underlaying JGroups channel. Hence no point in having a setter. If you want to read the value, a better way of doing it is by reading it through the configuration directly: GlobalConfiguration.transport().nodeName(). TopologyAwareAddress is pretty internal stuff


            Is is true that ASYNC communications is not supported in Distributed mode & it will fall-back to SYNC communications with default timeout of 15 secs?

            No. Actually starting with ISPN 5.3. we only have distributed mode, the replicate mode being implemented as an degenerated case of distribution with numOwners > clusterSize.


             

            Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

             

            I don't think so. What you can do, though is implement a listener that will live on that node, and will restart it in order to trigger a clean join.I'd be curious to see  dan.berindei's thoughts on this as well.

             

             

            What's the default value for lifespan & cleanupTaskFrequency of L1 element? As per schema; default is 60000 (1min) while the documentation indicates 10mins.

            Code won't lie :-) infinispan/core/src/main/java/org/infinispan/configuration/cache/L1ConfigurationBuilder.java at master · infinispan/infi…

            Care to create a JIRA for the documentation?

             

             

            Why is the RadarGun measured throughput(s) lower with the newer protocols'  settings in Jgroups 3.3 (when compared to the higher throughput in earlier Jgroups version)?

            In case you're using ISPN 5.3, there were some significant performance degradations in 5.3 that were fixed in 6.0: [ISPN-3534] Investigate performance regressions in Infinispan 6.0.0 - JBoss Issue Tracker

             

             

            Why would the Unicast/Muticast buffer sizes for UDP transport only consider rmem_max settings on unix? Can it consider udp_mem settings which is more relevant?

            belaban - any idea? ^^


            • 3. Re: some q's re: impact(s) of ISPN/JGRPS config ...
              belaban

              I didn't know about udp_mem / udp_rmem_min and udp_wmem_min. It seems like this is definitely relevant in cases where you send a lot of UDP datagram (and multicast) packets. I suggest monitor your UDP buffer stats with netstat -su and increase the values if you see a lot of packet drops.

              • 4. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                cotton.ben

                Thanks Bela and Mircea.

                Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

                 

                I don't think so. What you can do, though is implement a listener that will live on that node, and will restart it in order to trigger a clean join.I'd be curious to see  Dan Berindei's thoughts on this as well.

                 

                To be precise,  the listener that lives on the disconnected Node does the following (linearizably in time):

                 

                1.  exists to listen for its disconnect from the cluster

                2.  upon notification of its disconnect it then proceeds to 3 and 4.

                3.  re-start to trigger a clean join  (How does it do this?  ask the OS run-time to re-boot a new identical node's JavaVM process in the exact same way the disconnected node's JavaVM process was booted?)

                4. commits suicide

                 

                correct?  Does Dan Beindei likely have a craftier approach?

                • 5. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                  nileshbhagat

                  Hi Bela,

                   

                  I would appreciate your response to the following questions:

                   

                  1.  Given that JGroups only considers rmem/wmem settings for UDP send/receive buffer sizes; can it be modified to refer to the udp_mem settings which are configured at 27MB max on our machines. The rmem/wmem values are set to 4MB max & can’t be modified on our machines.  Refer to the warning when buffer sizes are exceeded:

                  7646 WARN  [12:14:04,114] [main][UDP] - [JGRP00015] the receive buffer of socket DatagramSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)

                     7647 WARN  [12:14:04,115] [main][UDP] - [JGRP00015] the receive buffer of socket MulticastSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)

                  2.      

                         2.  Can you suggest guidelines on the send & receive buffer sizes with any ratio thereof?

                   

                  3.    3.   Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

                  As per response from Mircea, restarting the node for a clean join may mean losing the cached data for the suspected node. But, is there any other mechanism for a node to auto rejoin by itself such as the “shun” attribute of FD protocol OR providing the JChannel API that supports auto-joining (refer to JBoss Server link https://docs.jboss.org/jbossclustering/cluster_guide/5.1/html/jgroups.chapt.html?

                   

                  4.    4.   Would you please review the pasted jgroups & infinispan xml to confirm that the protocols used & their attribute values are in conformance with the respective 3.3. & 5.3 versions being used?

                   

                   

                  Thanks,

                  Nilesh Bhagat.

                  • 6. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                    mircea.markus

                    for 3. and 4. a cache.stop()/cache.start() should do the trick.

                    • 7. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                      belaban

                      Here are a few comments re the posted config:

                      • FLUSH: it's in the wrong place (should be at the top of the stack), but the question is why do you use it in the first place ? Infinispan does *not* require it anymore
                      • timer.rejection_policy should be "abort", so a new thread is spawned instead of discarding a timer task. The latter might have disastrous consequences
                      • Your thread pool min sizes are quite high (40). I suggest lower them (especially as you have a high max size and a queue enabled)
                      • Which version of JGroups is this ?
                      • Why is FD_SOCK commented ?
                      • I recommend FD_ALL (with a high timeout) with UDP rather than FD

                       

                      Re the udp_mem setting: apparently udp_mem_X is the new net.core.r/wmem_max, so if you change this, the send and receive buffers will be honored.

                      • 8. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                        belaban

                        Nilesh Bhagat wrote:

                         

                        Hi Bela,

                         

                        I would appreciate your response to the following questions:

                         

                        1.  Given that JGroups only considers rmem/wmem settings for UDP send/receive buffer sizes; can it be modified to refer to the udp_mem settings which are configured at 27MB max on our machines. The rmem/wmem values are set to 4MB max & can’t be modified on our machines.  Refer to the warning when buffer sizes are exceeded:

                        7646 WARN  [12:14:04,114] [main][UDP] - [JGRP00015] the receive buffer of socket DatagramSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)

                           7647 WARN  [12:14:04,115] [main][UDP] - [JGRP00015] the receive buffer of socket MulticastSocket was set to 40MB, but the         OS only allocated 4.19MB. This might lead to performance problems. Please set your max receive buffer in the OS correctly (e.g. net.core.rmem_max on Linux)

                        2.      

                               2.  Can you suggest guidelines on the send & receive buffer sizes with any ratio thereof?

                         

                        3.    3.   Once the list of old & new members are identified using the org.infinispan.notifications.cachemanagerlistener.event.ViewChangedEvent API when a node leaves the cluster upon org.infinispan.remoting.transport.jgroups.SuspectException; is there any API that will allow the failed node to be joined-back to the cluster?

                        As per response from Mircea, restarting the node for a clean join may mean losing the cached data for the suspected node. But, is there any other mechanism for a node to auto rejoin by itself such as the “shun” attribute of FD protocol OR providing the JChannel API that supports auto-joining (refer to JBoss Server link https://docs.jboss.org/jbossclustering/cluster_guide/5.1/html/jgroups.chapt.html?

                         

                        4.    4.   Would you please review the pasted jgroups & infinispan xml to confirm that the protocols used & their attribute values are in conformance with the respective 3.3. & 5.3 versions being used?

                         

                         

                        Thanks,

                        Nilesh Bhagat.

                         

                        1. JGroups cannot directly make use of net.core.rmem_max or udp_mem, but these values help JGroups to obtain the buffer sizes that are configured. In your case above, I suggest reduce the send/receive buffer sizes to 4MB. As a matter of fact, the default buffer sizes are a bit too big and you can set both to be under 4MB. I see you've already done that in your config.
                        2. In TCP, buffers are set using the bandwidth-delay product. However, in JGroups we receive messages not from a single peer but potentially from multiple peers. So this depends on a few things, such as how many peers is a node receiving messages from, what's the avg size of those messages, what's the message arrival rate etc. I suggest leave the buffers as they are in your config, run your test program which mimicks the real app, and watch stats with netstat -su. If you see a lot of dropped packets, increase the buffer sizes.
                        3. No, shun was deprecated. I suggest you try out Mircea's suggestion of using a listener and restarting the cache programmatically
                        4. Done, see my other post
                        • 9. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                          nileshbhagat

                          Hi Bela & Mircea,


                          Thanks for your responses. Given that I can only post once per day; I am assimilating further questions under this reply & would kindly request to respond accordingly.

                           

                          My questions are italicized for clarity (with last 3 being important ones).
                          _________________________________________________________________

                           

                          Capture.JPG

                          GlobalConfiguration is a deprecated class while it's successor GlobalConfigurationBuilder only has a setter for nodeName with the GlobalConfigurationBuilder.defaultClusteredBuilder().transport().nodeName() API. There are no getters for any of site, node or cluster attributes & hence our usage of TopologyAwareAddress. Let me know of any oversight.

                           

                          _______________________________________________________________________

                          Capture.JPG

                          It has been proven otherwise as ASYNC always times-out a node within 16secs with org.infinispan.util.concurrent.TimeoutException: Node 201-8334(201) timed out. Without any sync tag specified; I do believe that it is timing-out after the default of 15secs.

                          We have been using SYNC with a time-out of 1hr ... hence do suggest what should be the optimal value for the time-out? Also, how does this differ from FD detection?

                           

                          ____________________________________________________________________________

                           

                          Capture.JPG

                           

                           

                          I have modified the JGroups config xml to incorporate the above changes. We use Jgroups 3.3. Also modified thread_pool.rejection_policy to "Run" (same for OOB as well). Reduced the thread pool min sizes to 30.

                          Can somebody review the Infinispan config xml as well to ensure that the values are in conformance especially the <asyncTransportExecutor>, <l1> & <hash> values.

                           

                          _________________________________________________________________________________

                           

                          Capture.JPG

                           

                           

                          We did implement this solution by invoking stop() & start() of cache under the ViewChanged event but it has been observed that the suspected node(s) DO NOT receive any event (only those nodes that are still part of the cluster received it). Probably Infinispan does not send any events to the evicted node(s) which have left the cluster.

                          We also tried restarting cache under the Merged event but still the same results (evicted node(s) do not receive any event but only the current members do).

                          Also, observed that there are quite a few nodes that eventually get evicted from cluster (once SuspectException is encountered for 1 or 2 nodes).

                          Hence, please suggest if there is any event that is indeed sent to the evicted node(s) for us to restart the cache & join the cluster? If not, is there any other mechanism for the suspected node(s) to rejoin the cluster (without restarting the entire cluster which we have been doing all along)?

                           

                          _________________________________________________________________________

                           

                          Capture.JPG

                           

                          Does this mean that we need to upgrade to ISPN 6.0 & have all the 5.3 performance issues addressed? We do have ISPN 5.3 running in Production & hence this question.


                          • 10. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                            mircea.markus

                            GlobalConfiguration is a deprecated class while it's successor GlobalConfigurationBuilder only has a setter for nodeName with the GlobalConfigurationBuilder.defaultClusteredBuilder().transport().nodeName() API. There are no getters for any of site, node or cluster attributes & hence our usage of TopologyAwareAddress. Let me know of any oversight.

                            Not sure what your question here really is, but you can configure the the site/rack/machine using the globalConfigruation.transport().rack(..) etc.


                            It has been proven otherwise as ASYNC always times-out a node within 16secs with org.infinispan.util.concurrent.TimeoutException: Node 201-8334(201) timed out. Without any sync tag specified; I do believe that it is timing-out after the default of 15secs.

                            We have been using SYNC with a time-out of 1hr ... hence do suggest what should be the optimal value for the time-out? Also, how does this differ from FD detection?

                            What's the full stack trace? Async shouldn't thrown TimeoutExceptions.


                            We did implement this solution by invoking stop() & start() of cache under the ViewChanged event but it has been observed that the suspected node(s) DO NOT receive any event (only those nodes that are still part of the cluster received it). Probably Infinispan does not send any events to the evicted node(s) which have left the cluster.

                            We also tried restarting cache under the Merged event but still the same results (evicted node(s) do not receive any event but only the current members do).

                            Not sure what you mean by "the suspected node(s) DO NOT receive any event". So the node that's restarted doesn't rejoin the cluster?

                             

                            Does this mean that we need to upgrade to ISPN 6.0 & have all the 5.3 performance issues addressed? We do have ISPN 5.3 running in Production & hence this question.

                            yes.



                            • 11. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                              cotton.ben
                              Not sure what you mean by "the suspected node(s) DO NOT receive any event". So the node that's restarted doesn't rejoin the cluster?

                              Unfortunately, it does not.

                              • 12. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                                nileshbhagat

                                Hi Mircea,

                                 

                                Thanks for your responses. My further comments inline ...

                                 

                                Capture.JPG

                                Yes, the intention is to obtain the nodeName attribute value which as per API would be GlobalConfigurationBuilder.defaultClusteredBuilder().build().transport().nodeName(); Let us know if this is incorrect.

                                 

                                 

                                Capture.JPG

                                Yes, it does throw TimeoutException but I don't have the full stack-trace yet. Need to change the config xml to reproduce it again.

                                 

                                 

                                Capture.JPG

                                In fact, this was an error on our part as the event listener was registered on Master/Reducer nodes only and not on Mapper nodes. Modifying the implementation allowed us to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; upon which the cache is restarted for that node. But, we are observing a discrepancy as the Mapper nodes contain the full set of nodes while the Master node excludes the suspected mapper nodes. Please refer to the debug statements below & provide any resolution to this issue for us to resolve this quickly. We are close to the solution except for this anomaly as our Master node doesn't accept any further requests if any of the mappers are down (which is a fall-out of the suspected nodes being evicted form cluster even after their caches were successfully restarted)?

                                 

                                Mapper Node -->

                                7346 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event invoked

                                   7347 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members size : 6

                                   7348 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members list : [105        , 107, 111, 201, 202, 203]

                                   7349 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members size : 7

                                   7350 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201, 202, 203]

                                 

                                Master Node -->

                                DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event invoked

                                  14905 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members size : 7

                                  14906 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members list : [101        , 105, 107, 111, 201, 202, 203]

                                  14907 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members size : 5

                                  14908 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201]

                                 

                                • 13. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                                  dan.berindei

                                  Nilesh Bhagat wrote:

                                  Yes, the intention is to obtain the nodeName attribute value which as per API would be GlobalConfigurationBuilder.defaultClusteredBuilder().build().transport().nodeName(); Let us know if this is incorrect.

                                   

                                  To read the configured values, you need to use the new GlobalConfiguration in org.infinispan.configuration.global. You can get it either with GlobalConfigurationBuilder.build(), or with EmbeddedCacheManager.getCacheManagerConfiguration().

                                   

                                  Capture.JPG

                                  In fact, this was an error on our part as the event listener was registered on Master/Reducer nodes only and not on Mapper nodes. Modifying the implementation allowed us to register the listener for Mapper nodes as well & we are able to receive the ViewChanged events for mappers; upon which the cache is restarted for that node. But, we are observing a discrepancy as the Mapper nodes contain the full set of nodes while the Master node excludes the suspected mapper nodes. Please refer to the debug statements below & provide any resolution to this issue for us to resolve this quickly. We are close to the solution except for this anomaly as our Master node doesn't accept any further requests if any of the mappers are down (which is a fall-out of the suspected nodes being evicted form cluster even after their caches were successfully restarted)?

                                   

                                  Mapper Node -->

                                  7346 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event invoked

                                     7347 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members size : 6

                                     7348 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event old members list : [105        , 107, 111, 201, 202, 203]

                                     7349 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members size : 7

                                     7350 DEBUG [16:21:42,533] [Incoming-6,AggregationEngine,202-43959(202)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201, 202, 203]

                                   

                                  Master Node -->

                                  DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event invoked

                                    14905 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members size : 7

                                    14906 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event old members list : [101        , 105, 107, 111, 201, 202, 203]

                                    14907 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members size : 5

                                    14908 DEBUG [16:42:00,868] [Incoming-8,AggregationEngine,101-40588(101)][Node] - viewChanged event new members list : [101        , 105, 107, 111, 201]

                                   

                                   

                                  How do you split the cluster? It could be that the mapper node doesn't receive a new view because its FD can still receive messages from the previous node in the view...

                                  • 14. Re: some q's re: impact(s) of ISPN/JGRPS config ...
                                    cotton.ben
                                    or 3. and 4. a cache.stop()/cache.start() should do the trick.
                                    Not sure what you mean by "the suspected node(s) DO NOT receive any event". So the node that's restarted doesn't rejoin the cluster?

                                    Unfortunately, it does not.


                                    I stand corrected.  The cache.stop()/cache.start() invokes (triggered from the ViewChange event) sometimes do (but sometimes don't) result in the Node re-joining the cluster.  The Node may (or may not) re-join the cluster.  The outcome is, frankly, erratic.


                                    How do you split the cluster? It could be that the mapper node doesn't receive a new view because its FD can still receive messages from the previous node in the view...



                                    Not sure what you mean by "split the cluster".  We deploy as a ISPN 5.3 DIST_SYNC data grid.  We take the hello world example of Node.java and co-erce onto it a "logical view" of Nodes being in 1 of role tree_set ={MASTER,REDUCER,MAPPER}.  However, from the ISPN 5.3 view, these are all just "Nodes".

                                     

                                    P.S.  Thanks for the highly interactive responses.  The support effort here has been stellar.  :-)

                                    1 2 3 Previous Next