Wlidfly 18: IndexOutOfBoundsException on unknown fork stack id
yurine Dec 9, 2019 1:17 PMI'm investigating a wierd error related to JGroups while upgrading from Wildfly 11.0.0.Final to Wildfly 18.0.1.Final.
I have ReplicatedHashMap instances bound to ForkChannel instances. ForkChannel instances are created by application code.
The error happens in the following scenario:
1. Node slave-1 is running. It's calling methods of ReplicatedHashMap and everything works fine.
2. Node slave-2 is starting. Application is not deployed yet, but the node is already joined the jgroups cluster, the view is changed to [slave1:slave-1|1] (2) [slave1:slave-1, slave2:slave-2]
3. slave-1 calls some method of ReplicatedHashMap (with RequestOptions.timeout set to 30 seconds). Two responses are received nearly immediately, one is from slave-1 and it's fine, and another is from slave-2 and it contains an exception, IndexOutOfBoundsException (in Rsp.getException). This step can repeat several times before step 4.
4. Application is deployed on slave-2. It created ForkChannel and start receiving remote calls for ReplicatedHashMap. From this moment everything works fine again. IndexOutOfBoundsException doesn't happen any more.
The exception in step 3 is the following:
Caused by: java.lang.IndexOutOfBoundsException: pos=0, limit=0
at org.jgroups.util.ByteArrayDataInputStream.checkBounds(ByteArrayDataInputStream.java:279)
at org.jgroups.util.ByteArrayDataInputStream.<init>(ByteArrayDataInputStream.java:28)
at org.jgroups.blocks.RequestCorrelator.replyFromBuffer(RequestCorrelator.java:463)
at org.jgroups.blocks.RequestCorrelator.handleResponse(RequestCorrelator.java:409)
at org.jgroups.blocks.RequestCorrelator.dispatch(RequestCorrelator.java:364)
at org.jgroups.blocks.RequestCorrelator.receiveMessageBatch(RequestCorrelator.java:327)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:589)
at org.jgroups.JChannel.up(JChannel.java:796)
at org.jgroups.fork.ForkProtocolStack.up(ForkProtocolStack.java:162)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.protocols.FORK.up(FORK.java:169)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.protocols.FRAG3.up(FRAG3.java:191)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:411)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1041)
at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:850)
at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:816)
at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:487)
at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:687)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:215)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.stack.Protocol.up(Protocol.java:338)
at org.jgroups.protocols.TP.passBatchUp(TP.java:1301)
at org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.passBatchUp(MaxOneThreadPerSender.java:284)
at org.jgroups.util.SubmitToThreadPool$BatchHandler.run(SubmitToThreadPool.java:136)
at org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.run(MaxOneThreadPerSender.java:273)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.jboss.as.clustering.jgroups.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:52)
at java.lang.Thread.run(Thread.java:748)
In Wildfly 11 the same scenario differs in step 3: there's a 30 seconds timeout and then there's a response with Rsp.wasReceived returning false.
Which looks reasonable and logical and it's ok for me. I log this event as a WARN and it's ok for me because it's easy to understand while observing logs and seeing cluster view change due to a starting node.
Unlike original ReplicatedHashMap I do process exceptions received in Rsp (because once I had to debug to see an exception which was received in Rsp from a message addressee and didn't show up anywhere).
That's why I interpret this situation as an error for my logic, consider this a failure and abort/rollback higher-level operations.
First I thought something has changed or I was doing something wrong from the beginning, so I verified my configuration and code several times. Also tried to find something on this on the Internet, but didn't find anything similar, which made me continue searching for some my mistake.
It was confusing that local error appears in Rsp. I was expecting that only exceptions from remote side will be delivered within Rsp object. It wasn't easy to understand where the exception actually happened - here on a caller side or on a remote side.
Enabling TRACE for jgroups didn't help much, but shown that slave-2 actually receives a message and sends some response, but slave-1 fails to process it. Unfortunately FORK doesn't write TRACE logs and I couldn't see which layer actually generates a problematic response.
Then I got confused that I don't see the following warning in both Wildfly 18 and in Wildfly 11:
log.warn("%s: fork-stack for id=%s not found; discarding message", local_addr, forkStackId) // it's in org.jgroups.protocols.FORK
So I had to do some tough debugging and found that the problematic response is generated by Wildfly's own UnknownForkHandler in JChannelFactory (which also doesn't log anything, while this could help save time).
It doesn't look like the exception can be caused by my configuration or code. So far it rather looks like some bug in Wildfly JGroups extension, doesn't it?
I wish I'd be wrong and I would be helpful if someone could explain how can I fix this in my configuration/code.
And I wonder if there is any workaround for this except just ignoring exceptions in Rsp.
Thanks in advance.
/cc pferraro