Skip navigation

11 Replies Latest reply on Jan 7, 2012 9:58 AM by belaban

Problem with jgroupsSlave backend for Infinispan

dungleonhart Dec 28, 2011 8:42 AM

Hi,

I got stuck when trying to configure jgroupsSlave as back-end worker for Infinispan (which is used for Hibernate Search in my project).

I've installed my project on 2 nodes on Amazon EC2 (open all ports); if I set both of them to use jgroupsMaster back-end, everything is fine. Then, I would like to try jgroupsSlave for one node; because I think when the number of nodes is increase, the performance of Master-Slave model would be better then Peer-peer model. (Am I correct?). Unfornately for me, it didn't work as expected that:

- When I update an object from Slave node, the data is not indexed. I can NOT search the update data in both nodes.

- But when I update an object from Master node, it work well as in the 2 Master nodes case. I can search updated data in both nodes.

It seems JGroups messages can not be sent properly from the Slave node to the Master node (just guess).

Please help me to deal with this issue. Here my Infinispan & JGroups configurations

1. Jars:

* hibernate-search-3.4.1.Final.jar

* hibernate-search-infinispan-3.4.1.Final.jar

* infinispan-core-4.2.1.FINAL.jar

* infinispan-lucene-directory-4.2.1.FINAL.jar

* jgroups-2.11.1.Final.jar

2. Hibernate:

a. Master node:

<bean id="sessionFactory" class="org.springframework.orm.hibernate3.annotation.AnnotationSessionFactoryBean">

<property name="hibernateProperties">

<props>

<prop key="hibernate.dialect">org.hibernate.dialect.MySQLDialect</prop>

<prop key="hibernate.search.default.directory_provider">infinispan</prop>

<prop key="hibernate.search.infinispan.configuration_resourcename">hibernate-search-infinispan.xml</prop>

<prop key="hibernate.search.worker.backend">jgroupsMaster</prop>

<prop key="hibernate.search.worker.execution">async</prop>

</props>

</property>

</bean>

b. Slave node

<bean id="sessionFactory" class="org.springframework.orm.hibernate3.annotation.AnnotationSessionFactoryBean">

<property name="hibernateProperties">

<props>

<prop key="hibernate.dialect">org.hibernate.dialect.MySQLDialect</prop>

<prop key="hibernate.search.default.directory_provider">infinispan</prop>

<prop key="hibernate.search.infinispan.configuration_resourcename">hibernate-search-infinispan.xml</prop>

<prop key="hibernate.search.worker.backend">jgroupsSlave</prop>

<prop key="hibernate.search.worker.execution">async</prop>

</props>

</property>

</bean>

3. Configuration files: please find in the attachments (hibernate-search-infinispan.xml & jdbc_ping.xml)

jdbc_ping.xml 1.9 KB
hibernate-search-infinispan.xml 4.2 KB

1. Re: Problem with jgroupsSlave backend for Infinispan

dungleonhart Dec 30, 2011 8:41 AM (in response to dungleonhart)

Hi,

One more thing that I've suspected that: when I put 2 nodes on my local machine (2 Tomcat servers), it works as expected. I wonder it might only work well in environments that allow multicast?

Best Regards,
Actions
2. Re: Problem with jgroupsSlave backend for Infinispan

sannegrinovero Dec 31, 2011 7:51 AM (in response to dungleonhart)

Hello,
I think you might be affected by https://hibernate.onjira.com/browse/HSEARCH-975
You can find some workarounds on the Hibernate Search forums: https://forum.hibernate.org/viewtopic.php?f=9&t=1013648

The short story: JGroups changed some method signature, making the JGroups backend incompatible with the version required by Infinispan.

I will see if I can fix HSEARCH-975 , but not sure yet unless I can drop Java5 compatibility.
Is upgrading to latest Hibernate Search version 4.0.0.Final not an option for you?
Actions
3. Re: Problem with jgroupsSlave backend for Infinispan

dungleonhart Jan 1, 2012 5:03 AM (in response to sannegrinovero)

Hi Sanne,

Upgrading to Hibernate Search 4.0.0.Final seems painfull to me since my project is stick with Hibernate 3.

From this thread https://forum.hibernate.org/viewtopic.php?f=9&t=1013648, even if I use the Master/Slave model, some problems are still there?? Anyway, I have to give it a try.

Moreover, when I debugged hibernate-search-3.4.1.Final and jgroups-2.11.1.Final codes, I saw some lines:
     1. org.hibernate.search.backend.impl.jgroups.JGroupsBackendQueueProcessor class:
        /* Creates and send message with lucene works to master.
         * As long as message destination address is null, Lucene works will be received by all listeners that implements
         * org.jgroups.MessageListener interface, multiple master nodes in cluster are allowed. */
        try {
            Message message = new Message( null, factory.getAddress(), ( Serializable ) filteredQueue );
            factory.getChannel().send( message );
            if ( trace ) {
                log.trace( "Lucene works have been sent from slave {} to master node.", factory.getAddress() );
            }
        }
        catch ( ChannelNotConnectedException e ) {
            throw new SearchException(
                    "Unable to send Lucene work. Channel is not connected to: "
                            + factory.getClusterName()
            );
        }
        catch ( ChannelClosedException e ) {
            throw new SearchException( "Unable to send Lucene work. Attempt to send message on closed JGroups channel" );
        }

      2. in org.jgroups.protocols.pbcast.FLUSH class:
           case Event.MSG:
                Message msg = (Message) evt.getArg();
                Address dest = msg.getDest();
                if (dest == null || dest.isMulticastAddress()) {
                    // mcasts
                    FlushHeader fh = (FlushHeader) msg.getHeader(this.id);
                    if (fh != null && fh.type == FlushHeader.FLUSH_BYPASS) {
                        return down_prot.down(evt);
                    } else {
                        blockMessageDuringFlush();
                    }
                } else {
                    // unicasts are irrelevant in virtual synchrony, let them through
                    return down_prot.down(evt);
                }
                break;
     ---------------------------

Those above codes make me think that jgroupSlave backend won't work properly on Amazon EC2 (which doens't allow multicast).
Does it make sense to conclude so?

Best Regards,
Actions
4. Re: Problem with jgroupsSlave backend for Infinispan

dungleonhart Jan 2, 2012 4:24 AM (in response to sannegrinovero)

Hi Sanne,

I've just upgraded to the latest Hibernate Search version as you recommended. Make the search feature in my project work properly is very crucial for me, so I have to try every feasible method.
However, I got the LockObtainFailedException error again with just some index actions (not under load test). The senario is:
     - I start 2 local nodes (same configurations)
     - Add some objects by node 1, everthing is fine, I can search them on both of the 2 nodes.
     - Add one object by node 2, the error is thrown
--------------------
2012-01-02 15:57:01,869 [Hibernate Search: Index updates queue processor for index tc.model.TestCase-1] ERROR org.hibernate.search.exception.impl.LogErrorHandler - HSEARCH000058: Exception occurred org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.infinispan.lucene.locking.BaseLuceneLock@21d1cd0d
Primary Failure:
    Entity tc.model.TestCase Id 95 Work Type org.hibernate.search.backend.AddLuceneWork
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.infinispan.lucene.locking.BaseLuceneLock@21d1cd0d
    at org.apache.lucene.store.Lock.obtain(Lock.java:84)
    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1112)
    at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.createNewIndexWriter(IndexWriterHolder.java:125)
    at org.hibernate.search.backend.impl.lucene.IndexWriterHolder.getIndexWriter(IndexWriterHolder.java:100)
    at org.hibernate.search.backend.impl.lucene.AbstractWorkspaceImpl.getIndexWriter(AbstractWorkspaceImpl.java:114)
    at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.applyUpdates(LuceneBackendQueueTask.java:101)
    at org.hibernate.search.backend.impl.lucene.LuceneBackendQueueTask.run(LuceneBackendQueueTask.java:69)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
-------------------
     - Continue to add objects by node 1, it is still OK.
     - Add by node 2, the error occur again.

Please give me some advices on this weird behavior. I might make some configuration mistakes.

1. Jars:
     * hibernate-search-engine-4.0.0.Final.jar
     * hibernate-search-orm-4.0.0.Final.jar
     * hibernate-search-infinispan-4.0.0.Final.jar
     * infinispan-core-5.1.0.CR1.jar
     * infinispan-lucene-directory-5.0.1.FINAL.jar
     * jgroups-3.0.1.Final.jar

2. Spring bean:
     <bean id="sessionFactory" class="org.springframework.orm.hibernate4.LocalSessionFactoryBean">
        <property name="dataSource" ref="dataSource" />
        <property name="hibernateProperties">
            <props>
                <prop key="hibernate.search.default.directory_provider">infinispan</prop>
                <prop key="hibernate.search.infinispan.configuration_resourcename">hibernate-search-infinispan.xml</prop>
                <prop key="hibernate.search.worker.backend">jgroupsMaster</prop>
                <prop key="hibernate.search.worker.execution">async</prop>
                ....

3. Configuration files:
     a. hibernate-search-infinispan.xml

<?xml version="1.0" encoding="UTF-8"?>
<infinispan
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="urn:infinispan:config:5.0 http://www.infinispan.org/schemas/infinispan-config-5.0.xsd"
    xmlns="urn:infinispan:config:5.0">

    
    
    

    <global>
        
        <globalJmxStatistics
            enabled="true"
            cacheManagerName="HibernateSearch"
            allowDuplicateDomains="true" />

        
        <transport
            clusterName="HibernateSearch-Infinispan-cluster"
            distributedSyncTimeout="300000" >
            
            <properties>
                <property name="configurationFile" value="jdbc_ping.xml" />
            </properties>
        </transport>
        
        <shutdown
            hookBehavior="DONT_REGISTER" />
    </global>

    
    
    

    <default>

        <locking
            lockAcquisitionTimeout="300000"
            writeSkewCheck="false"
            concurrencyLevel="5000"
            useLockStriping="false" />

        
        <invocationBatching
            enabled="true" />

        
        <clustering
            mode="replication">

            
            <stateRetrieval
                timeout="300000"
                logFlushTimeout="300000"
                fetchInMemoryState="true"
                alwaysProvideInMemoryState="true" />

            
            <sync
                replTimeout="300000" />
        </clustering>

        <jmxStatistics
            enabled="true" />
        <eviction
            maxEntries="-1"
            strategy="NONE" />
        <expiration
            maxIdle="-1" />

    </default>

    
    
    
    
    
    
    
    
    

    
    
    
    <namedCache
        name="LuceneIndexesMetadata">
        <clustering
            mode="replication">
            <stateRetrieval
                fetchInMemoryState="true"
                logFlushTimeout="300000" />
            <sync
                replTimeout="300000" />
        </clustering>
    </namedCache>

    
    
    
    <namedCache
        name="LuceneIndexesData">
        <clustering
            mode="replication">
            <stateRetrieval
                fetchInMemoryState="true"
                logFlushTimeout="300000" />
            <sync
                replTimeout="300000" />
        </clustering>
    </namedCache>

    
    
    
    <namedCache
        name="LuceneIndexesLocking">
        <clustering
            mode="replication">
            <stateRetrieval
                fetchInMemoryState="true"
                logFlushTimeout="300000" />
            <sync
                replTimeout="300000" />
        </clustering>
    </namedCache>

</infinispan>
-----------------------------------

     b. jdbc_ping.xml

<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">
    <TCP bind_port="${jgroups.tcp.port:7800}" loopback="false" recv_buf_size="${tcp.recv_buf_size:20M}"
        send_buf_size="${tcp.send_buf_size:640K}"
        discard_incompatible_packets="true" max_bundle_size="64K"
        max_bundle_timeout="30" enable_bundling="true" use_send_queues="true"
        sock_conn_timeout="300" timer_type="new" timer.min_threads="4"
        timer.max_threads="10" timer.keep_alive_time="3000"
        timer.queue_max_size="500" thread_pool.enabled="true"
        thread_pool.min_threads="1" thread_pool.max_threads="10"
        thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false"
        thread_pool.queue_max_size="100" thread_pool.rejection_policy="discard"

        oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1"
        oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000"
        oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100"
        oob_thread_pool.rejection_policy="discard" />

    <JDBC_PING connection_driver="com.mysql.jdbc.Driver"
        connection_username="root" connection_password="root"
        connection_url="jdbc:mysql://localhost/clientdb2" level="debug" />

    <MERGE2 min_interval="10000" max_interval="30000" />
    <FD_SOCK />
    <FD timeout="3000" max_tries="3" />
    <VERIFY_SUSPECT timeout="1500" />
    <BARRIER />
    <pbcast.NAKACK use_mcast_xmit="false"
        exponential_backoff="500" discard_delivered_msgs="true" />
    <UNICAST />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
        max_bytes="4M" />
    <pbcast.GMS print_local_addr="true" join_timeout="3000"

        view_bundling="true" />
    <UFC max_credits="2M" min_threshold="0.4" />
    <MFC max_credits="2M" min_threshold="0.4" />
    <FRAG2 frag_size="60K" />
    <pbcast.STATE_TRANSFER />
</config>
Actions
5. Re: Problem with jgroupsSlave backend for Infinispan

sannegrinovero Jan 3, 2012 12:51 PM (in response to dungleonhart)
Hi,
excellent, was it easy to update all dependencies? I would expect so, but let me know if we need to clarify something in the docs, we don't want people to stay stuck on old versions.

The locking error you have now is that since version 4.0 the property exclusive_index_use is now defaulting to true. This was existing even in previous versions and was highly recommended to be set to true, but we waited to change the default for the mayor release.

So what happens is that the first node which is able to acquire the lock, is keeping it until the SearchFactory is shutdown; exactly what you're experiencing. This is because if you don't configure an alternative backend Hibernate Search is assuming (by default) to be the only user of the index (hence the option name).
You have two options:
disable it by setting exclusive_index_use=false on all indexes you need (poor choice.. see below)
configure the master/slave

Of course under load the first option will fail, and it might even fail with low load with some bad luck so I wouldn't recommend it unless you have external locks.
The good news is that using the latest versions you're not affected by HSEARCH-975 so configuring a master/slave should be rather easy.
Consider that since you are on Hibernate Search 4 now you have the option - if you need massive scalability - to have a different master node for each index.

Regarding the FLUSH issue you highlighed above, I don't think that comment means literally multicast in the sense of network packet type; I'm quite sure that JGroups is able to handle FLUSH properly even on EC2, so I'd assume that this comment should be reprhased as "send it to everyone using the best means possible", which would be multicast in most cases but is more precisely to be defined by the rest of the configured protocols.
Anyway I'll ask to some JGroups expert to confirm. thanks for looking into it!
1 of 1 people found this helpful
Actions
6. Re: Problem with jgroupsSlave backend for Infinispan

sannegrinovero Jan 3, 2012 12:48 PM (in response to sannegrinovero)

Just noticed this:
infinispan-core-5.1.0.CR1.jar
jgroups-3.0.1.Final.jar

Sorry you're going too far now with updates, Hibernate Search 4.0.0.Final is depending on Infinispan 5.0.1.FINAL and JGroups 2.12.1.3.Final
you will need Hibernate Search 4.1.x to be compatible with JGroups 3.x .. sorry for the confusion I'd suggest to always check the Maven definitions as they specify the versions we use for our tests; minor component upgrades will usually be ok, but in this case it's mayor version number, which are allowed to change APIs (and actually do!).
Actions
7. Re: Problem with jgroupsSlave backend for Infinispan

belaban Jan 4, 2012 4:17 AM (in response to dungleonhart)

A multicast in JGroups means a message to *all* cluster members, and *not* an IP multicast, so your assumption above is incorrect.
Bela
1 of 1 people found this helpful
Actions
8. Re: Problem with jgroupsSlave backend for Infinispan

dungleonhart Jan 6, 2012 7:58 AM (in response to sannegrinovero)

Hi Sanne,

Thanks a lot your answer.
But unfortunately, my leader won't allow me to upgrade to Hibernate 4... So, I have to turn back to version 3.4.1.
For the sake of our system stability, I decided to apply the manual strategy and use a scheduled task for indexing.

Best Regards,
Actions
9. Re: Problem with jgroupsSlave backend for Infinispan

dungleonhart Jan 6, 2012 8:22 AM (in response to belaban)

Hi Bela,

It's great to have your confirmation.
By the way, I'm also facing a big problem with JGroups:
     - I do load test with 4 nodes and monitor their CPUs usage percentage.
     - I see when a node suffer too heavy load and utilize up to 100% CPU usage, it throws lots of warnings and hang the CPU 100% for a while

05 Jan 2012 06:42:27 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:
[9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]
05 Jan 2012 06:42:34 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:
[9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]
05 Jan 2012 06:42:39 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:
[9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]
05 Jan 2012 06:42:44 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:
[9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]
05 Jan 2012 06:42:49 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:
[9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]
05 Jan 2012 06:42:54 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:
[9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]
05 Jan 2012 06:42:59 WARN pbcast.NAKACK - (requester=ip-10-162-55-83-37000, local_addr=ip-10-162-54-111-4971) message ip-10-162-54-111-4971::9648 not found in retransmission table of ip-10-162-54-111-4971:
[9648 : 9651 (9651) (size=3, missing=0, highest stability=9648)]
--------------------------

05 Jan 2012 06:40:22 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]
05 Jan 2012 06:40:49 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]
05 Jan 2012 06:41:02 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]
05 Jan 2012 06:41:24 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message
05 Jan 2012 06:41:33 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-156-134-94-9765 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]
05 Jan 2012 06:41:33 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message
05 Jan 2012 06:41:39 WARN pbcast.GMS - ip-10-162-55-83-37000: did not get any merge responses from partition coordinators, merge is cancelled
05 Jan 2012 06:41:39 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message
05 Jan 2012 06:41:43 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]
05 Jan 2012 06:41:46 WARN pbcast.NAKACK - ip-10-162-55-83-37000: dropped message from ip-10-146-17-132-18167 (not in table [ip-10-162-54-111-4971, ip-10-162-55-83-37000]), view=[ip-10-162-55-83-37000|27] [ip-10-162-55-83-37000, ip-10-162-54-111-4971]
05 Jan 2012 06:41:48 WARN protocols.TCP - ip-10-162-55-83-37000: no physical address for ip-10-156-134-94-9765, dropping message
--------------------

     - Those errors stop us from scale out number of nodes. It only work stably with 2 or 3 nodes.

Could you give some advices for this problem?
Please find my configuration in the first post of this thread.

Thanks a lot and Best regards,
Actions
10. Re: Problem with jgroupsSlave backend for Infinispan

belaban Jan 7, 2012 9:57 AM (in response to dungleonhart)

Did you get a stack trace to see what's going on when the CPU is pegged at 100% ? Do the retransmissions shown go on forever ?
Is this reproduceable ?

You can definitely run clusters that are bigger than 2-3 nodes :-) Can you update to the latest 2.12.x release of JGroups ?
Actions
11. Re: Problem with jgroupsSlave backend for Infinispan

belaban Jan 7, 2012 9:58 AM (in response to belaban)

N.B.: if you have a system that runs into this problem again, unless it is production, leave it in this state: there are JMX calls that can retrieve useful information from the system !
Actions

Go to original post