1 2 Previous Next 15 Replies Latest reply on Sep 30, 2015 4:13 AM by mgeorg

How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Aug 31, 2015 7:24 AM

Hello,

how can I shutdown an embedded Infinispan node in a cluster without loosing data ?

The problem is as follows:

We would like to use Infinispan 7.2.1 as a clustered and distributed key/value store

for about 10 to 20 million keys in one cache, the value size differs from 1k to 128k byte.

At the moment, I'm testing in a cluster with three nodes and I would like to shutdown

one node and restart it again after a while, but without loosing data. In the documentation

there is no explanation how to achieve this. When testing, I'm sending data to the cluster

with an external program, generating random keys with UUID's as value. The data is send to

all nodes, and after saving it in the Infinispan key/value store, it is send back for controlling.

When shutting down the whole cluster, the data is saved on disk, so after that we can

compare the saved data against the send back data to see if data loss has occured.

This will happen when shutting down only one node and stopping the other two nodes after

a while, then for example with 4.882.752 keys in the disk store there are missing 132 keys.

When shutting down, I got exceptions like "failed to process local variable for Key='xyz': could

not lock key 'xyz' because of: Timed out waiting for topology 5" for about 90 keys, nearly 15

seconds after stopping the node (I think this is the default value for 'remote-timeout' in

'distributed-cache' configuration).

The cache configuration is as follows:

<infinispan

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:infinispan:config:7.1 http://www.infinispan.org/schemas/infinispan-config-7.1.xsd"

xmlns="urn:infinispan:config:7.1">

<stack-file name="extern-file" path="../config/jgroups-udp.xml" />

</jgroups>

<cache-container default-cache="default" statistics="false">

<distributed-cache name="KeyValueProvider" owners="2" mode="SYNC" segments="60">

<partition-handling enabled="true"/>

<state-transfer enabled="true" timeout="240000" chunk-size="10240" await-initial-transfer="true"/>

<store-as-binary keys="true" values="true"/>

</distributed-cache>

</cache-container>

</infinispan>

Code for cache declaration is:

private final EmbeddedCacheManager mCacheManager;

private final AdvancedCache<String,ExternalizableCacheObject> mCache;

Cache creation is like this, without loading data:

mCacheManager = new DefaultCacheManager(initializationFile); // initializationFile contains the name of the cache configuration file

mCache = cache.getAdvancedCache().withFlags(Flag.SKIP_CACHE_LOAD);

Cache stopping is:

mCache.stop();

mCacheManager.stop();

Is there something missing in the code or the configuration ?

Thanks for all advices and answers,

Michael

1. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

wdfink Aug 31, 2015 4:45 PM (in response to mgeorg)

If you stop the cache correct you should see a rebalancing for the remaining nodes and no messages that a node is lost. In this case you should not loose data.
But you say you store the data to disc, how is that done as you have no store defined?
So you might show the relevant part of the log for each node if you shutdown it.
Also what kind of application use the infinispan cache? Is it standalone or embedded in a server application?
Actions
2. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 1, 2015 5:00 AM (in response to wdfink)
Hi Wolf-Dieter,
first thanks for your answer.

Wolf-Dieter Fink schrieb:

If you stop the cache correct you should see a rebalancing for the remaining nodes and no messages that a node is lost. In this case you should not loose data.

I got no message that a node is lost, I got exceptions that locking from the other nodes for keys belonging to the stopped node has timed out. And this keys are missing in the stored data.
I alos see no rebalancing, may be the logging is not correctly configured.

But you say you store the data to disc, how is that done as you have no store defined?

We have written our own storing and loading routine, every node stores only the primary keys it's holding by the segment id and reads only the segments belonging to him (primary and backup, if defined).
The data storage is located on a network disc where every node has access. When we shut down one node, we first save the data of that node - knowing that this is only a snapshot - and then we call the
stop methods at the cache and the cache manager. After 15 seconds I got the described exceptions in the remaining nodes, they try to lock keys and run in timeouts because the destination node has gone.

When we shut down the whole cluster, all nodes will write their primary keys on the network storage and so the cached data is saved. The nodes are synchronized over a small multicast protocoll.
So you might show the relevant part of the log for each node if you shutdown it.

How can we configure the logging system to see the output from Infinispan ?

Also what kind of application use the infinispan cache? Is it standalone or embedded in a server application?

We are using Infinispan as an embedded key/value cache in a server application. On every node in the cluster the same software is running and all nodes build up one cache.

But the question again is: is it enough to call the "stop()" methods at the cache and the cache manager for shutting down a node gracefully or must we do some more things ?
The documentation is not clear at this point.

Michael
Actions
3. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

wdfink Sep 1, 2015 5:01 AM (in response to mgeorg)

You should see that the clustered cache start rebalance as a node goes down and the number of owners need to be 2 so all data need to be copied to a new backup/primary.
This should be seen as CLUSTER messages on the coordinator node.

About your implementation I'm not sure, I would recommend to implement a Customized cache store which should integrated with ISPN via the ISPN API.

So attaching the logs and explain a bit more seems necessary to me for further help
Actions
4. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 1, 2015 7:49 AM (in response to wdfink)
Hi Wolf-Dieter,

I have now enough log entries from Infinispan, I put the log files as attachment to this answer.

Some explanations:
The cluster nodes were started at 12:39:55 and are named c0, c1 and c2. There is no data in the file
storage, so nothing has to be read (message at 12:40:11/12 in all log files).

All three nodes are contacted by external programs which send key/value pairs to them (each node by 10 programs,
see messages at 12:46:40 in all log files).

During sending the data, the node c2 is stopped, see message at 12:48:13 in "c2.log" .
The data of this node is saved and the cache is stopped at 12:48:24.

At 12:48:39 exception logging starts for node c0 and c1.

The one thing I'm wondering about is that between 12:48:24 and 12:48:39 no message is seen that
a node has gone or topology has changed in log files "c0.log" / "c1.log" . Therefore I think there
is something missing in our code and calling the method "stop()" is not enough.

Hope this clarifies some questions.

Thanks,
Michael

logFiles.zip 37.8 KB
Actions
5. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

rvansa Sep 2, 2015 2:34 AM (in response to mgeorg)

It seems that the shutting down node does not send its leave request properly; in such case, JGroups failure-detection protocols need some time to check that the node has died and kick in only at 12:48:48 (message New view accepted: [c0-16821|2] (2) [c0-16821, c1-20448]). Note that TimeoutExceptions do not trigger failure detection - with your setting the requests are failed after 15 second while JGroups are configured for ~25 seconds.

In order to investigate what went wrong (why the node did not terminate cleanly), logs for org.jgroups packages on trace level are necessary. Thread stack dumps could be also helpful.

Anyway, even after abrupt termination of one of the nodes, all the already loaded data should persist. If the put() operation did not throw any exception, the data will be there.
Actions
6. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 3, 2015 5:48 AM (in response to rvansa)
Hi Radim,

I have reproduced the test with the "org.jgroups=TRACE" logging option set, the amount of log data is quiet a lot, so there are only 3 processes producing data this time.
You should search for "KeyValueProvider" to get the relevant places / messages. The node c2 was stopped at 11:20:15.

It seems that the shutting down node does not send its leave request properly; in such case, JGroups failure-detection protocols need some time to check that the node has died and kick in only at 12:48:48 (message New view accepted: [c0-16821|2] (2) [c0-16821, c1-20448]).

If this is true, is there a work around which we can use ?

Anyway, even after abrupt termination of one of the nodes, all the already loaded data should persist. If the put() operation did not throw any exception, the data will be there.

I forgot to mention that we follow a strictly sceme for getting the value of a key: lock(), get(), put(), unlock(). This sequence is called always and for every data key, even if it is not in the cache. So the exceptions I found in the log file are produced by the lock() call.

Okay, another try. This time I made two tests, one without data (log files c0_noData.log, c1_noData.log, c2_noData.log), where the node c2 is stopped at 10:04:36, the other test with data but only for a short period, node c2 is stopped at 11:19:18 (search for
'received shutdown signal, terminating system...'). One thing I noticed is that in the test with data I could not find a message like 'Node c2-50274 leaving cache KeyValueProvider' (which you will find in the test without data).
Hope the logs will help.

Thanks,
Michael

logFiles.zip 8.4 MB
Actions
7. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 9, 2015 11:33 AM (in response to mgeorg)

Hi Radim,

may be you didn't recognize that I added new logs to the last message, which would fit the maximum allowed size. So please look again at the logs and give me a hint what is going wrong.

Thanks a lot,
Michael
Actions
8. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

rvansa Sep 10, 2015 4:57 AM (in response to mgeorg)

Sorry, I've missed the notification. So it seems that c2 is waiting for the transactions to finish: look up 'Wait for on-going transactions to finish for 60 seconds.' - eventually it should log 'All transactions terminated' as in the no-data case. Here the node did not get to disconnecting from JGroups channel yet. Btw., it would be better to log thread name and class (logger name) as well, this way it's quite hard to track progress of some thread.

Some transactions are timing out because of missing topology 5, but I can't see exactly why. Could you try to reproduce (with data) with TRACE level on org.infinispan instead? (you can omit org.infinispan.marshall and org.infinispan.commons.marshall). JGroups trace logs are not necessary now either.
Actions
9. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 10, 2015 8:39 AM (in response to rvansa)
Hi Radim,

I've reproduced the test and appended the logs with TRACE level on org.infinispan. At line 70242 in log "c2_1.log" the termination of node c2 starts.

Thanks,
Michael

LogFilesNew_1.zip 12.7 MB

LogFilesNew_2.zip 11.6 MB
Actions
10. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 22, 2015 4:45 AM (in response to mgeorg)

Hi Radim,

do you need some more logs ? No idea what's going wrong ?

Thanks,
Michael
Actions
11. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

rvansa Sep 22, 2015 10:26 AM (in response to mgeorg)

Sorry for the late reply, I don't know how I could miss the notification for the second time.

Looking into logs and then discussing with developers, there's this issue: [ISPN-5507] Transactions committed immediately before cache stop can block shutdown - JBoss Issue Tracker
That causes a delay when stopping the node, and the exceptions on the other nodes. The stopping node plays dead and does not react to any commands, so the RPCs timeout. Please comment on the JIRA if you get it fixed soon, user reports hitting that issue can convince developers to prioritize this.

Anyway, about the data loss: can you point out some data that have been successfully stored in the cache but are not available after that?
Actions
12. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 23, 2015 5:59 AM (in response to rvansa)
Hi Radim,

I have reproduced the test again, because I have deleted the data output from the last test you have the logfiles from.
I have attached again the logfiles of the three nodes and also the output of our comparison tool. This tool compares the
cache data saved on disk against the response from each node. There you will find some keys which are not stored in the
cache but must have been processed because they are in the response file.

Thanks,
Michael

LogFiles_1.zip 4.1 MB

LogFiles_2.zip 5.9 MB
Actions
13. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

mgeorg Sep 29, 2015 10:52 AM (in response to rvansa)

Hi,

maybe I found a workaround for my problem: I have added one more line of code, before stopping the cache I stop the transport.

    mCache.getRpcManager.getTransport().stop();
    mCache.stop();
    mCacheManager.stop();

With this change I can't detect any data loss in my test situation. I hope that this will stop only the transport for this cache and not the transport for the whole node.
Can you verify this solution as a guilty and possible workaround ?

Greets,
Michael
Actions
14. Re: How to shut down an embedded Infinispan node in a clustered environment whithout loosing data ?

nadirx Sep 29, 2015 11:32 AM (in response to mgeorg)

Hi Michael,

the transport is global and shared between all caches in a container. You could think about stopping rebalancing for the cache instead.
Actions

1 2 Previous Next

Go to original post