11 Replies Latest reply on Jan 7, 2012 9:58 AM by belaban

    Problem with jgroupsSlave backend for Infinispan

    dungleonhart

      Hi,

       

      I got stuck when trying to configure jgroupsSlave as back-end worker for Infinispan (which is used for Hibernate Search in my project).

      I've installed my project on 2 nodes on Amazon EC2 (open all ports); if I set both of them to use jgroupsMaster back-end, everything is fine. Then, I would like to try jgroupsSlave for one node; because I think when the number of nodes is increase, the performance of Master-Slave model would be better then Peer-peer model. (Am I correct?). Unfornately for me, it didn't work as expected that:

           - When I update an object from Slave node, the data is not indexed. I can NOT search the update data in both nodes.

           - But when I update an object from Master node, it work well as in the 2 Master nodes case. I can search updated data in both nodes.

      It seems JGroups messages can not be sent properly from the Slave node to the Master node (just guess).

       

      Please help me to deal with this issue. Here my Infinispan & JGroups configurations

       

           1. Jars:

                * hibernate-search-3.4.1.Final.jar

                * hibernate-search-infinispan-3.4.1.Final.jar

                * infinispan-core-4.2.1.FINAL.jar

                * infinispan-lucene-directory-4.2.1.FINAL.jar

                * jgroups-2.11.1.Final.jar

       

           2. Hibernate:

                a. Master node:              

                     <bean id="sessionFactory" class="org.springframework.orm.hibernate3.annotation.AnnotationSessionFactoryBean">

                          <property name="hibernateProperties">

                               <props>

                                                                       <prop key="hibernate.dialect">org.hibernate.dialect.MySQLDialect</prop>

                                                                       <prop key="hibernate.search.default.directory_provider">infinispan</prop>

                                                                       <prop key="hibernate.search.infinispan.configuration_resourcename">hibernate-search-infinispan.xml</prop>

                                                                       <prop key="hibernate.search.worker.backend">jgroupsMaster</prop>

                                 <prop key="hibernate.search.worker.execution">async</prop>

                               </props>

                          </property>

                     </bean>

                b. Slave node               

                     <bean id="sessionFactory" class="org.springframework.orm.hibernate3.annotation.AnnotationSessionFactoryBean">

                          <property name="hibernateProperties">

                               <props>

                                 <prop key="hibernate.dialect">org.hibernate.dialect.MySQLDialect</prop>

                                 <prop key="hibernate.search.default.directory_provider">infinispan</prop>

                                 <prop key="hibernate.search.infinispan.configuration_resourcename">hibernate-search-infinispan.xml</prop>

                                 <prop key="hibernate.search.worker.backend">jgroupsSlave</prop>

                                 <prop key="hibernate.search.worker.execution">async</prop>

                               </props>

                          </property>

                     </bean>

       

           3. Configuration files: please find in the attachments (hibernate-search-infinispan.xml & jdbc_ping.xml)

        • 1. Re: Problem with jgroupsSlave backend for Infinispan
          dungleonhart

          Hi,

           

          One more thing that I've suspected that: when I put 2 nodes on my local machine (2 Tomcat servers), it works as expected. I wonder it might only work well in environments that allow multicast?

           

          Best Regards,

          • 2. Re: Problem with jgroupsSlave backend for Infinispan
            sannegrinovero

            Hello,

            I think you might be affected by https://hibernate.onjira.com/browse/HSEARCH-975

            You can find some workarounds on the Hibernate Search forums: https://forum.hibernate.org/viewtopic.php?f=9&t=1013648

             

            The short story: JGroups changed some method signature, making the JGroups backend incompatible with the version required by Infinispan.

             

            I will see if I can fix HSEARCH-975 , but not sure yet unless I can drop Java5 compatibility.

            Is upgrading to latest Hibernate Search version 4.0.0.Final not an option for you?

            • 3. Re: Problem with jgroupsSlave backend for Infinispan
              dungleonhart

              Hi Sanne,

               

              Upgrading to Hibernate Search 4.0.0.Final seems painfull to me since my project is stick with Hibernate 3.

               

              From this thread https://forum.hibernate.org/viewtopic.php?f=9&t=1013648, even if I use the Master/Slave model, some problems are still there?? Anyway, I have to give it a try.

               

              Moreover, when I debugged hibernate-search-3.4.1.Final and jgroups-2.11.1.Final codes, I saw some lines:

                   1. org.hibernate.search.backend.impl.jgroups.JGroupsBackendQueueProcessor class:        

                      /* Creates and send message with lucene works to master.

                       * As long as message destination address is null, Lucene works will be received by all listeners that implements

                       * org.jgroups.MessageListener interface, multiple master nodes in cluster are allowed. */

                      try {

                          Message message = new Message( null, factory.getAddress(), ( Serializable ) filteredQueue );

                          factory.getChannel().send( message );

                          if ( trace ) {

                              log.trace( "Lucene works have been sent from slave {} to master node.", factory.getAddress() );

                          }

                      }

                      catch ( ChannelNotConnectedException e ) {

                          throw new SearchException(

                                  "Unable to send Lucene work. Channel is not connected to: "

                                          + factory.getClusterName()

                          );

                      }

                      catch ( ChannelClosedException e ) {

                          throw new SearchException( "Unable to send Lucene work. Attempt to send message on closed JGroups channel" );

                      }

               

                    2. in org.jgroups.protocols.pbcast.FLUSH class:

                         case Event.MSG:

                              Message msg = (Message) evt.getArg();

                              Address dest = msg.getDest();

                              if (dest == null || dest.isMulticastAddress()) {

                                  // mcasts

                                  FlushHeader fh = (FlushHeader) msg.getHeader(this.id);

                                  if (fh != null && fh.type == FlushHeader.FLUSH_BYPASS) {

                                      return down_prot.down(evt);

                                  } else {

                                      blockMessageDuringFlush();

                                  }

                              } else {

                                  // unicasts are irrelevant in virtual synchrony, let them through

                                  return down_prot.down(evt);

                              }

                              break;

                   ---------------------------

               

              Those above codes make me think that jgroupSlave backend won't work properly on Amazon EC2 (which doens't allow multicast).

              Does it make sense to conclude so?

               

              Best Regards,

              • 4. Re: Problem with jgroupsSlave backend for Infinispan
                dungleonhart

                Hi Sanne,

                 

                I've just upgraded to the latest Hibernate Search version as you recommended. Make the search feature in my project work properly is very crucial for me, so I have to try every feasible method.

                However, I got the LockObtainFailedException error again with just some index actions (not under load test). The senario is:

                     - I start 2 local nodes (same configurations)

                     - Add some objects by node 1, everthing is fine, I can search them on both of the 2 nodes.

                     - Add one object by node 2, the error is thrown

                --------------------    

                2012-01-02 15:57:01,869 [Hibernate Search: Index updates queue processor for index tc.model.TestCase-1] ERROR org.hibernate.search.exception.impl.LogErrorHandler - HSEARCH000058: Exception occurred org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.infinispan.lucene.locking.BaseLuceneLock@21d1cd0d

                Primary Failure:

                    Entity tc.model.TestCase  Id 95  Work Type  org.hibernate.search.backend.AddLuceneWork

                org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.infinispan.lucene.locking.BaseLuceneLock@21d1cd0d

                    at org.apache.lucene.store.Lock.obtain(Lock.java:84)

                    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1112)

                    at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.createNewIndexWriter(IndexWriterHolder.java:125)

                    at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.getIndexWriter(IndexWriterHolder.java:100)

                    at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriter(AbstractWorkspaceImpl.java:114)

                    at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.applyUpdates(LuceneBackendQueueTask.java:101)

                    at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.run(LuceneBackendQueueTask.java:69)

                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

                    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

                    at java.util.concurrent.FutureTask.run(FutureTask.java:138)

                    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

                    at java.lang.Thread.run(Thread.java:662)

                -------------------

                     - Continue to add objects by node 1, it is still OK.

                     - Add by node 2, the error occur again.

                 

                Please give me some advices on this weird behavior. I might make some configuration mistakes.

                 

                1. Jars:

                     * hibernate-search-engine-4.0.0.Final.jar

                     * hibernate-search-orm-4.0.0.Final.jar

                     * hibernate-search-infinispan-4.0.0.Final.jar

                     * infinispan-core-5.1.0.CR1.jar

                     * infinispan-lucene-directory-5.0.1.FINAL.jar

                     * jgroups-3.0.1.Final.jar

                 

                2. Spring bean:

                     <bean id="sessionFactory" class="org.springframework.orm.hibernate4.LocalSessionFactoryBean">

                        <property name="dataSource" ref="dataSource" />

                        <property name="hibernateProperties">

                            <props>               

                                <prop key="hibernate.search.default.directory_provider">infinispan</prop>

                                <prop key="hibernate.search.infinispan.configuration_resourcename">hibernate-search-infinispan.xml</prop>

                                <prop key="hibernate.search.worker.backend">jgroupsMaster</prop>

                                <prop key="hibernate.search.worker.execution">async</prop>         

                                ....

                 

                3. Configuration files:

                     a. hibernate-search-infinispan.xml

                    

                <?xml version="1.0" encoding="UTF-8"?>

                <infinispan

                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                    xsi:schemaLocation="urn:infinispan:config:5.0 http://www.infinispan.org/schemas/infinispan-config-5.0.xsd"

                    xmlns="urn:infinispan:config:5.0">

                 

                    <!-- *************************** -->

                    <!-- System-wide global settings -->

                    <!-- *************************** -->

                 

                    <global>

                        <!-- Duplicate domains are allowed so that multiple deployments with default configuration

                            of Hibernate Search applications work - if possible it would be better to use JNDI to share

                            the CacheManager across applications -->

                        <globalJmxStatistics

                            enabled="true"

                            cacheManagerName="HibernateSearch"

                            allowDuplicateDomains="true" />

                 

                        <!-- If the transport is omitted, there is no way to create distributed or clustered

                            caches. There is no added cost to defining a transport but not creating a cache that uses one,

                            since the transport is created and initialized lazily. -->

                        <transport

                            clusterName="HibernateSearch-Infinispan-cluster"

                            distributedSyncTimeout="300000" >

                            <!-- Note that the JGroups transport uses sensible defaults if no configuration

                                property is defined. See the JGroupsTransport javadocs for more flags -->

                            <properties>

                                <property name="configurationFile" value="jdbc_ping.xml" />

                            </properties>

                        </transport>      

                        <!-- Used to register JVM shutdown hooks. hookBehavior: DEFAULT, REGISTER, DONT_REGISTER.

                            Hibernate Search takes care to stop the CacheManager so registering is not needed -->

                        <shutdown

                            hookBehavior="DONT_REGISTER" />

                    </global>

                 

                    <!-- *************************** -->

                    <!-- Default "template" settings -->

                    <!-- *************************** -->

                 

                    <default>

                 

                        <locking

                            lockAcquisitionTimeout="300000"

                            writeSkewCheck="false"

                            concurrencyLevel="5000"

                            useLockStriping="false" />

                 

                        <!-- Invocation batching is required for use with the Lucene Directory -->

                        <invocationBatching

                            enabled="true" />

                 

                        <!-- This element specifies that the cache is clustered. modes supported: distribution

                            (d), replication (r) or invalidation (i). Don't use invalidation to store Lucene indexes (as

                            with Hibernate Search DirectoryProvider). Replication is recommended for best performance of

                            Lucene indexes, but make sure you have enough memory to store the index in your heap.

                            Also distribution scales much better than replication on high number of nodes in the cluster. -->

                        <clustering

                            mode="replication">

                 

                            <!-- Prefer loading all data at startup than later -->

                            <stateRetrieval

                                timeout="300000"

                                logFlushTimeout="300000"

                                fetchInMemoryState="true"

                                alwaysProvideInMemoryState="true" />

                 

                            <!-- Network calls are synchronous by default -->

                            <sync

                                replTimeout="300000" />

                        </clustering>

                 

                        <jmxStatistics

                            enabled="true" />

                        <eviction

                            maxEntries="-1"

                            strategy="NONE" />

                        <expiration

                            maxIdle="-1" />

                 

                    </default>

                 

                    <!-- ******************************************************************************* -->

                    <!-- Individually configured "named" caches.                                         -->

                    <!--                                                                                 -->

                    <!-- While default configuration happens to be fine with similar settings across the -->

                    <!-- three caches, they should generally be different in a production environment.   -->

                    <!--                                                                                 -->

                    <!-- Current settings could easily lead to OutOfMemory exception as a CacheStore     -->

                    <!-- should be enabled, and maybe distribution is desired.                           -->

                    <!-- ******************************************************************************* -->

                 

                    <!-- *************************************** -->

                    <!--  Cache to store Lucene's file metadata  -->

                    <!-- *************************************** -->

                    <namedCache

                        name="LuceneIndexesMetadata">

                        <clustering

                            mode="replication">

                            <stateRetrieval

                                fetchInMemoryState="true"

                                logFlushTimeout="300000" />

                            <sync

                                replTimeout="300000" />

                        </clustering>

                    </namedCache>

                 

                    <!-- **************************** -->

                    <!--  Cache to store Lucene data  -->

                    <!-- **************************** -->

                    <namedCache

                        name="LuceneIndexesData">

                        <clustering

                            mode="replication">

                            <stateRetrieval

                                fetchInMemoryState="true"

                                logFlushTimeout="300000" />

                            <sync

                                replTimeout="300000" />

                        </clustering>

                    </namedCache>

                 

                    <!-- ***************************** -->

                    <!--  Cache to store Lucene locks  -->

                    <!-- ***************************** -->

                    <namedCache

                        name="LuceneIndexesLocking">

                        <clustering

                            mode="replication">

                            <stateRetrieval

                                fetchInMemoryState="true"

                                logFlushTimeout="300000" />

                            <sync

                                replTimeout="300000" />

                        </clustering>

                    </namedCache>

                 

                </infinispan>

                -----------------------------------

                 

                     b. jdbc_ping.xml


                <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                    xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">

                    <TCP bind_port="${jgroups.tcp.port:7800}" loopback="false" recv_buf_size="${tcp.recv_buf_size:20M}"

                        send_buf_size="${tcp.send_buf_size:640K}"

                        discard_incompatible_packets="true" max_bundle_size="64K"

                        max_bundle_timeout="30" enable_bundling="true" use_send_queues="true"

                        sock_conn_timeout="300" timer_type="new" timer.min_threads="4"

                        timer.max_threads="10" timer.keep_alive_time="3000"

                        timer.queue_max_size="500" thread_pool.enabled="true"

                        thread_pool.min_threads="1" thread_pool.max_threads="10"

                        thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false"

                        thread_pool.queue_max_size="100" thread_pool.rejection_policy="discard"

                 

                        oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1"

                        oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000"

                        oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100"

                        oob_thread_pool.rejection_policy="discard" />

                 

                    <JDBC_PING connection_driver="com.mysql.jdbc.Driver"

                        connection_username="root" connection_password="root"

                        connection_url="jdbc:mysql://localhost/clientdb2" level="debug" />

                       

                    <MERGE2 min_interval="10000" max_interval="30000" />

                    <FD_SOCK />

                    <FD timeout="3000" max_tries="3" />

                    <VERIFY_SUSPECT timeout="1500" />

                    <BARRIER />

                    <pbcast.NAKACK use_mcast_xmit="false"

                        exponential_backoff="500" discard_delivered_msgs="true" />

                    <UNICAST />

                    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

                        max_bytes="4M" />

                    <pbcast.GMS print_local_addr="true" join_timeout="3000"

                 

                        view_bundling="true" />

                    <UFC max_credits="2M" min_threshold="0.4" />

                    <MFC max_credits="2M" min_threshold="0.4" />

                    <FRAG2 frag_size="60K" />

                    <pbcast.STATE_TRANSFER />

                </config>

                 

                    

                    

                • 5. Re: Problem with jgroupsSlave backend for Infinispan
                  sannegrinovero

                  Hi,

                  excellent, was it easy to update all dependencies? I would expect so, but let me know if we need to clarify something in the docs, we don't want people to stay stuck on old versions.

                   

                  The locking error you have now is that since version 4.0 the property exclusive_index_use is now defaulting to true. This was existing even in previous versions and was highly recommended to be set to true, but we waited to change the default for the mayor release.

                   

                  So what happens is that the first node which is able to acquire the lock, is keeping it until the SearchFactory is shutdown; exactly what you're experiencing. This is because if you don't configure an alternative backend Hibernate Search is assuming (by default) to be the only user of the index (hence the option name).

                  You have two options:

                  • disable it by setting exclusive_index_use=false on all indexes you need (poor choice.. see below)
                  • configure the master/slave

                   

                  Of course under load the first option will fail, and it might even fail with low load with some bad luck so I wouldn't recommend it unless you have external locks.

                  The good news is that using the latest versions you're not affected by HSEARCH-975 so configuring a master/slave should be rather easy.

                  Consider that since you are on Hibernate Search 4 now you have the option - if you need massive scalability - to have a different master node for each index.

                   

                  Regarding the FLUSH issue you highlighed above, I don't think that comment means literally multicast in the sense of network packet type; I'm quite sure that JGroups is able to handle FLUSH properly even on EC2, so I'd assume that this comment should be reprhased as "send it to everyone using the best means possible", which would be multicast in most cases but is more precisely to be defined by the rest of the configured protocols.

                  Anyway I'll ask to some JGroups expert to confirm. thanks for looking into it!

                  1 of 1 people found this helpful
                  • 6. Re: Problem with jgroupsSlave backend for Infinispan
                    sannegrinovero

                    Just noticed this:

                    infinispan-core-5.1.0.CR1.jar

                    jgroups-3.0.1.Final.jar

                     

                    Sorry you're going too far now with updates, Hibernate Search 4.0.0.Final is depending on Infinispan 5.0.1.FINAL and JGroups 2.12.1.3.Final

                    you will need Hibernate Search 4.1.x to be compatible with JGroups 3.x .. sorry for the confusion I'd suggest to always check the Maven definitions as they specify the versions we use for our tests; minor component upgrades will usually be ok, but in this case it's mayor version number, which are allowed to change APIs (and actually do!).

                    • 7. Re: Problem with jgroupsSlave backend for Infinispan
                      belaban

                      A multicast in JGroups means a message to *all* cluster members, and *not* an IP multicast, so your assumption above is incorrect.

                      Bela

                      1 of 1 people found this helpful
                      • 8. Re: Problem with jgroupsSlave backend for Infinispan
                        dungleonhart

                        Hi Sanne,

                         

                        Thanks a lot your answer.

                        But unfortunately, my leader won't allow me to upgrade to Hibernate 4... So, I have to turn back to version 3.4.1.

                        For the sake of our system stability, I decided to apply the manual strategy and use a scheduled task for indexing.

                         

                        Best Regards,

                        • 9. Re: Problem with jgroupsSlave backend for Infinispan
                          dungleonhart

                          Hi Bela,

                           

                          It's great to have your confirmation.

                          By the way, I'm also facing a big problem with JGroups:

                               - I do load test with 4 nodes and monitor their CPUs usage percentage.

                               - I see when a node suffer too heavy load and utilize up to 100% CPU usage, it throws lots of warnings and hang the CPU 100% for a while

                                   

                          05 Jan 2012 06:42:27 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:

                          [9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]

                          05 Jan 2012 06:42:34 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:

                          [9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]

                          05 Jan 2012 06:42:39 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:

                          [9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]

                          05 Jan 2012 06:42:44 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:

                          [9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]

                          05 Jan 2012 06:42:49 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:

                          [9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]

                          05 Jan 2012 06:42:54 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:

                          [9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]

                          05 Jan 2012 06:42:59 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:

                          [9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]

                          --------------------------

                           

                          05 Jan 2012 06:40:22 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]

                          05 Jan 2012 06:40:49 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]

                          05 Jan 2012 06:41:02 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]

                          05 Jan 2012 06:41:24 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message

                          05 Jan 2012 06:41:33 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-156-134-94-9765 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]

                          05 Jan 2012 06:41:33 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message

                          05 Jan 2012 06:41:39 WARN pbcast.GMS - ip-10-162-55-83-37000: did not get any merge responses from partition coordinators, merge is cancelled

                          05 Jan 2012 06:41:39 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message

                          05 Jan 2012 06:41:43 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]

                          05 Jan 2012 06:41:46 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]

                          05 Jan 2012 06:41:48 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message

                          --------------------

                           

                               - Those errors stop us from scale out number of nodes. It only work stably with 2 or 3 nodes.

                           

                          Could you give some advices for this problem?

                          Please find my configuration in the first post of this thread.

                           

                          Thanks a lot and Best regards,

                          • 10. Re: Problem with jgroupsSlave backend for Infinispan
                            belaban

                            Did you get a stack trace to see what's going on when the CPU is pegged at 100% ? Do the retransmissions shown go on forever ?

                            Is this reproduceable ?

                             

                            You can definitely run clusters that are bigger than 2-3 nodes :-) Can you update to the latest 2.12.x release of JGroups ?

                            • 11. Re: Problem with jgroupsSlave backend for Infinispan
                              belaban

                              N.B.: if you have a system that runs into this problem again, unless it is production, leave it in this state: there are JMX calls that can retrieve useful information from the system !