Join problems in DIST-SYNC cluster
gernot.bauer Sep 8, 2011 11:56 AMHi!
I am experiencing problems when trying to increase the size of a dist sync cluster. My configuration is as follows:
- Infinispan 4.2.1-FINAL
- DIST-SYNC
- Hotrod
- Running on EC2 using gossip
- One BerkeleyDB per node as cache store on elastic block storage
When I restarted a cluster with data in the cache store, I gradually increased the cluster size to 4 nodes. Upon starting the 4th node, I received the following exception (IP addresses are masked):
2011-09-07 09:33:41,051 ERROR [org.infinispan.remoting.rpc.RpcManagerImpl] (Rehasher-ip-0.0.0.4-30751-Worker-1) unexpected error while replicating org.infinispan.loaders.CacheLoaderException: error loading all entries at org.infinispan.loaders.bdbje.BdbjeCacheStore.convertToCacheLoaderException(BdbjeCacheStore.java:553) at org.infinispan.loaders.bdbje.BdbjeCacheStore.loadAllKeys(BdbjeCacheStore.java:461) at org.infinispan.loaders.decorators.AbstractDelegatingStore.loadAllKeys(AbstractDelegatingStore.java:100) at org.infinispan.commands.control.RehashControlCommand.pullStateForJoin(RehashControlCommand.java:200) at org.infinispan.commands.control.RehashControlCommand.perform(RehashControlCommand.java:146) at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:142) at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:156) at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithRetry(InboundInvocationHandlerImpl.java:246) at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:129) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144) at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:577) at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:488) at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:364) at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:770) at org.jgroups.JChannel.up(JChannel.java:1484) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074) at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:429) at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:263) at org.jgroups.protocols.FRAG2.up(FRAG2.java:189) at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) at org.jgroups.protocols.pbcast.GMS.up(GMS.java:891) at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:246) at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:575) at org.jgroups.protocols.UNICAST.up(UNICAST.java:294) at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:703) at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:133) at org.jgroups.protocols.FD.up(FD.java:275) at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:275) at org.jgroups.protocols.MERGE2.up(MERGE2.java:209) at org.jgroups.protocols.Discovery.up(Discovery.java:291) at org.jgroups.protocols.TP.passMessageUp(TP.java:1102) at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1658) at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1640) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.IllegalStateException: This method can only be called after associating the current thread with a context at org.infinispan.context.InvocationContextContainerImpl.getInvocationContext(InvocationContextContainerImpl.java:148) at org.infinispan.eviction.EvictionManagerImpl.getInvocationContext(EvictionManagerImpl.java:164) at org.infinispan.eviction.EvictionManagerImpl.onEntryEviction(EvictionManagerImpl.java:146) at org.infinispan.eviction.EvictionManagerImpl.onEntryEviction(EvictionManagerImpl.java:140) at org.infinispan.container.DefaultDataContainer$DefaultEvictionListener.onEntryEviction(DefaultDataContainer.java:179) at org.infinispan.util.concurrent.BoundedConcurrentHashMap$Segment.notifyEvictionListener(BoundedConcurrentHashMap.java:1176) at org.infinispan.util.concurrent.BoundedConcurrentHashMap$Segment.get(BoundedConcurrentHashMap.java:871) at org.infinispan.util.concurrent.BoundedConcurrentHashMap.get(BoundedConcurrentHashMap.java:1429) at org.infinispan.container.DefaultDataContainer.peek(DefaultDataContainer.java:95) at org.infinispan.container.DefaultDataContainer.containsKey(DefaultDataContainer.java:130) at org.infinispan.util.ReadOnlyDataContainerBackedKeySet.contains(ReadOnlyDataContainerBackedKeySet.java:37) at org.infinispan.loaders.bdbje.BdbjeCacheStore.loadAllKeys(BdbjeCacheStore.java:458) ... 36 more 2011-09-07 09:33:41,064 ERROR [org.infinispan.distribution.JoinTask] (Rehasher-ip-0.0.0.4-30751) Caught exception! Aborting join. java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.infinispan.loaders.CacheLoaderException: error loading all entries at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:111) at org.infinispan.distribution.RehashTask.call(RehashTask.java:75) at org.infinispan.distribution.RehashTask.call(RehashTask.java:41) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: org.infinispan.CacheException: org.infinispan.loaders.CacheLoaderException: error loading all entries at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:115) at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:126) at org.infinispan.distribution.RehashTask$StateGrabber.call(RehashTask.java:139) at org.infinispan.distribution.RehashTask$StateGrabber.call(RehashTask.java:125) ... 5 more Caused by: org.infinispan.loaders.CacheLoaderException: error loading all entries at org.infinispan.loaders.bdbje.BdbjeCacheStore.convertToCacheLoaderException(BdbjeCacheStore.java:553) at org.infinispan.loaders.bdbje.BdbjeCacheStore.loadAllKeys(BdbjeCacheStore.java:461) at org.infinispan.loaders.decorators.AbstractDelegatingStore.loadAllKeys(AbstractDelegatingStore.java:100) at org.infinispan.commands.control.RehashControlCommand.pullStateForJoin(RehashControlCommand.java:200) at org.infinispan.commands.control.RehashControlCommand.perform(RehashControlCommand.java:146) at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:142) at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:156) at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithRetry(InboundInvocationHandlerImpl.java:246) at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:129) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144) at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:577) at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:488) at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:364) at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:770) at org.jgroups.JChannel.up(JChannel.java:1484) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074) at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:429) at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:263) at org.jgroups.protocols.FRAG2.up(FRAG2.java:189) at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) at org.jgroups.protocols.pbcast.GMS.up(GMS.java:891) at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:246) at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:575) at org.jgroups.protocols.UNICAST.up(UNICAST.java:294) at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:703) at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:133) at org.jgroups.protocols.FD.up(FD.java:275) at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:275) at org.jgroups.protocols.MERGE2.up(MERGE2.java:209) at org.jgroups.protocols.Discovery.up(Discovery.java:291) at org.jgroups.protocols.TP.passMessageUp(TP.java:1102) at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1658) at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1640) ... 3 more Caused by: java.lang.IllegalStateException: This method can only be called after associating the current thread with a context at org.infinispan.context.InvocationContextContainerImpl.getInvocationContext(InvocationContextContainerImpl.java:148) at org.infinispan.eviction.EvictionManagerImpl.getInvocationContext(EvictionManagerImpl.java:164) at org.infinispan.eviction.EvictionManagerImpl.onEntryEviction(EvictionManagerImpl.java:146) at org.infinispan.eviction.EvictionManagerImpl.onEntryEviction(EvictionManagerImpl.java:140) at org.infinispan.container.DefaultDataContainer$DefaultEvictionListener.onEntryEviction(DefaultDataContainer.java:179) at org.infinispan.util.concurrent.BoundedConcurrentHashMap$Segment.notifyEvictionListener(BoundedConcurrentHashMap.java:1176) at org.infinispan.util.concurrent.BoundedConcurrentHashMap$Segment.get(BoundedConcurrentHashMap.java:871) at org.infinispan.util.concurrent.BoundedConcurrentHashMap.get(BoundedConcurrentHashMap.java:1429) at org.infinispan.container.DefaultDataContainer.peek(DefaultDataContainer.java:95) at org.infinispan.container.DefaultDataContainer.containsKey(DefaultDataContainer.java:130) at org.infinispan.util.ReadOnlyDataContainerBackedKeySet.contains(ReadOnlyDataContainerBackedKeySet.java:37) at org.infinispan.loaders.bdbje.BdbjeCacheStore.loadAllKeys(BdbjeCacheStore.java:458) ... 36 more 2011-09-07 09:33:41,067 INFO [org.infinispan.distribution.JoinTask] (Rehasher-ip-0.0.0.4-30751) ip-0.0.0.4-30751 aborted join rehash after 130 milliseconds!
When I stop this node and try to restart it again, I see the following log output, the last lines continuing forever (well, at least for a couple of minutes - after that, I stopped the node):
2011-09-07 09:37:07,970 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:08,369 DEBUG [org.infinispan.loaders.CacheLoaderManagerImpl] (InfinispanServer-Main) Preloaded 140 keys in 461 milliseconds 2011-09-07 09:37:09,053 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:10,115 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:11,217 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:12,330 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:13,422 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:15,144 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:16,536 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:17,359 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:18,321 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:19,853 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:21,295 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:22,437 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null 2011-09-07 09:37:24,220 DEBUG [org.infinispan.distribution.JoinTask] (Rehasher-ip-0-0-0-4-2899) Retrieved old consistent hash address list null
If I restart the node, I get the same messages over and over (there's no difference whether or not I delete the local cache store in between).
I've attached the Infinispan and JGroups config I use on my nodes.
Does anyone have an idea of what is happening here? I'm willing to debug the cluster if I have to, but I'm not even sure where to start looking. Any help is appreciated.
-
ec2InfinispanJGroups.xml 2.0 KB
-
ec2InfinispanConfig.xml 5.7 KB