1 2 Previous Next 24 Replies Latest reply on Aug 17, 2016 7:45 AM by wdfink

Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jun 19, 2016 4:46 AM

Hi,

We are using infinispan server version 8.2.1 Final.

Are set up is:

1. We have two cluster nodes master (server one) & slave (server two) working in domain mode.

2. Configured to work as replicated async cache

3. Our application uses the hot rod client in round robin writing and reading from each node.

We have a behavior sometimes in which the replication does not work.

That is, in master node we have Record X and in slave node we have Record Y.

I would expect both nodes to have both Record X & Record Y.

That behavior occurred several times and we do not have a specific scenario for it. May be it's due to many writes to the cache to both nodes. we do not know.

in both console logs we see this exception, we do not know if it's related but it looks like its related:

========================================================================================

[Server:server-one] [33m [0m [33m19:33:35,304 WARN [org.jgroups.protocols.TCP] (TcpServer.Acceptor [7600],null) JGRP000006: failed accepting connection from peer: java.net.SocketException: BaseServer.TcpConnection.readPeerAddress(): cookie read by 192.168.118.51:7600 does not match own cookie; terminating connection [0m

[Server:server-one] [33m at org.jgroups.blocks.cs.TcpConnection.readPeerAddress(TcpConnection.java:256) [0m

[Server:server-one] [33m at org.jgroups.blocks.cs.TcpConnection.<init>(TcpConnection.java:54) [0m

[Server:server-one] [33m at org.jgroups.blocks.cs.TcpServer$Acceptor.handleAccept(TcpServer.java:132) [0m

[Server:server-one] [33m at org.jgroups.blocks.cs.TcpServer$Acceptor.run(TcpServer.java:117) [0m

[Server:server-one] [33m at java.lang.Thread.run(Thread.java:745) [0m

[Server:server-one] [33m [0m

[Server:server-one] [33m [0m [0m09:05:35,687 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 52) DGISPN0001: Started ___defaultcache cache from clustered container [0m

[Server:server-two] [33m [0m [33m19:31:40,350 WARN [org.jgroups.protocols.TCP] (TcpServer.Acceptor [7600],null) JGRP000006: failed accepting connection from peer: java.net.SocketException: BaseServer.TcpConnection.readPeerAddress(): cookie read by 192.168.118.52:7600 does not match own cookie; terminating connection [0m

[Server:server-two] [33m at org.jgroups.blocks.cs.TcpConnection.readPeerAddress(TcpConnection.java:256) [0m

[Server:server-two] [33m at org.jgroups.blocks.cs.TcpConnection.<init>(TcpConnection.java:54) [0m

[Server:server-two] [33m at org.jgroups.blocks.cs.TcpServer$Acceptor.handleAccept(TcpServer.java:132) [0m

[Server:server-two] [33m at org.jgroups.blocks.cs.TcpServer$Acceptor.run(TcpServer.java:117) [0m

[Server:server-two] [33m at java.lang.Thread.run(Thread.java:745) [0m

========================================================================================

After this state occurs, only restart of the whole domain (master & slave) fixes the issue and the replication starts working again.

Can you please advise on this issue? Since it's very critical for us since if replication stops working it means our high availability of our application will not work also.

Attached are console logs of both master & slave servers.
And also configuration files of the domain * hosts files

Thanks,

Eli

1. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jun 22, 2016 4:08 AM (in response to ez1234)

Any updates on this issue?
Actions
2. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

nadirx Jun 22, 2016 4:30 AM (in response to ez1234)

We have never seen this issue happen before. The error message looks like something is interfering with port 7600, although these spurious connections should be ignored. You mention high load. Is GC interfering with any of the nodes ? Can you tell us what flags you're using for the JVMs ?
Actions
3. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jun 22, 2016 10:24 AM (in response to nadirx)

Hi,

These errors in the log did not happen while the system is under load.
What we did is we inserted ~ 2 million records to the cache in in very high load.

After the insertion has finished, we started seeing this even when no load on the system was applied.
We have seen this also weithou any load on the system at all.

I used the default JVM flasg just increased the heap to 10GB. You can see in the attached configurations and in the console log.

As far as we know there was no process using this port on the machine.
Actions
4. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jun 22, 2016 10:26 AM (in response to ez1234)

domain.conf java options we use:

   JAVA_OPTS="-Xms1g -Xmx1g -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true"
   JAVA_OPTS="$JAVA_OPTS -Djboss.modules.system.pkgs=$JBOSS_MODULES_SYSTEM_PKGS -Djava.awt.headless=true"
   JAVA_OPTS="$JAVA_OPTS -Dorg.jboss.server.bootstrap.maxThreads=500"
Actions
5. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

nadirx Jun 22, 2016 10:41 AM (in response to ez1234)

You say 10GB, but the switches say 1GB. You should probably monitor the instances with a tool like visualvm/jconsole or enable GC logging (-verbose:gc) and see what's going on.
Actions
6. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jun 29, 2016 1:00 AM (in response to nadirx)

Hi Tristan,

1. I still did not understand how is the GC monitor is related to the replication issue? Or the exception in the log:

[Server:server-one] [33m [0m [33m19:33:35,304 WARN [org.jgroups.protocols.TCP] (TcpServer.Acceptor [7600],null) JGRP000006: failed accepting connection from peer: java.net.SocketException: BaseServer.TcpConnection.readPeerAddress(): cookie read by 192.168.118.51:7600 does not match own cookie; terminating connection [0m

Since this issue happens also with no load at all.

2. in the domain.xml file we have configured in the server group 10g. See below:

    <server-groups>
        <server-group name="cluster" profile="clustered">
            <jvm name="default">
                <heap size="10g" max-size="10g"/>
            </jvm>
            <socket-binding-group ref="clustered-sockets"/>
            <deployments>
                <deployment name="fn-ispn-custom-plugins.jar" runtime-name="fn-ispn-custom-plugins.jar">
                </deployment>
            </deployments>
        </server-group>
    </server-groups>

Please advise.
Actions
7. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

nadirx Jul 4, 2016 8:45 AM (in response to ez1234)

Eli Z,
I'm convinced something is trying to connect on port 7600 of your cluster. Have simply you tried changing port ?
Actions
8. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jul 6, 2016 7:48 AM (in response to nadirx)

Hi,

We have not yet tried changing ports.We will.
But do you think this issue if correct can be the reason for the replication problem we have seen on the cluster?

Thanks,
Eli
Actions
9. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jul 11, 2016 3:14 AM (in response to ez1234)

Hi Tristan

Do you assume the issue with the port if correct can be the reason for the replication problem we have seen on the cluster?

Thanks,
Eli
Actions
10. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

wdfink Jul 11, 2016 4:44 AM (in response to ez1234)

Possibly, if there is something sending messages to port 7600, infinispan/jgroups should ignore such messages but there is still a need to check the incomming message. So it will burden the instance (CPU and memory) and can cause issues.

For a stable system you should check it and configure your Infinispan or the other system correctly.
Actions
11. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jul 11, 2016 6:10 AM (in response to wdfink)

Hi,

The issues is that even after the interference on the port stops the infinispan server does not recover from this condition and adding new entries on one node does not replicate to the other.

Can you advise on this issue ?

Thanks,
Eli
Actions
12. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

wdfink Jul 11, 2016 8:16 AM (in response to ez1234)

If you start servers from scratch, did you see this warning?
WARN [org.jgroups.protocols.TCP] (TcpServer.Acceptor [7600],null) JGRP000006: failed accepting connection from peer: java.net.SocketException: BaseServer.TcpConnection.readPeerAddress(): cookie read by 192.168.118.51:7600 does not match own cookie; terminating connection

As long as you have such JGroups warnings I don't think that you have a working cluster and therefor no replication.
Both servers should start without warning and you should see that the caches are synchronzied if you start the second instance. The 'coordinator' node will log messages that a new member is detected and rebalance for a cache is started and done successfully
Actions
13. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jul 12, 2016 3:12 AM (in response to wdfink)

After restarting the servers we did not see the warning messages any more and the replication problem is gone.

But our concern is why this happened?
The servers initially started ok with no warnings, after a while we started seeing these warnings (you can see in the attached logs in the original message replication bug.zip file) and then the replication stopped working and we had to restart the whole cluster.

So we are concerned this might happen again in production environment.

So what is your proposed solution for this issue?
Actions
14. Re: Infinispan server: Cache entries are not replicated although cache is in mode replicated async

ez1234 Jul 20, 2016 9:55 AM (in response to ez1234)
Hi,

Today we have encountered this same error again and now it happened without any errors in the logs. The "interference" on the port 7600 errors did not occur.
I attached the logs.

So this error is not related to the problem of some process trying to connect on the same port.

It looks as if the replication does not work in some scenario.

What can be the cause of this bug?
Is there a workaround for this without restarting the whole cluster or one of the nodes? That is to cause the nodes to have replicated and identical data?
Since if we restart the whole cluster or just one of the nodes while the data is not replicated we will loose data.

Please advise as this rises concerns for our production environment

Thanks

replication error logs.zip 54.0 KB
Actions

1 2 Previous Next

Go to original post