Additional details based on further testing
Based on what I'm seeing in Spring's TRACE level logging, I am definitely sending 1000 messages using JmsTemplate*. The logging also seems to show that the spring JMS listener on one node is receiving exactly 1/3 of all of the requests, while two of the other nodes are receiving 1/6th of all requests. This corresponds with what I am seeing in my own application level logging, as well as what eventually ends up persisted to database. The remaining 1/3 of all messages seem to just go out into the void.
When I look at the JBoss JMX console for the JMS queue in question each node, the "MessagesAdded" attribute seems to record that all three nodes receive exactly 1/3 of the messages, but on 2 of the nodes, there's a disconnect between that value and the number of objects I actually see through logging and values that my application persist.
When I parse the available data, I see this pattern emerge, where A, B, and C represent the node that is processing each message:
A, B, A, C, A, B, A, C, A, B, A, etc...
Which node is A, which is B, and which is C seems to largely be the luck of the draw between cluster restarts.
*(I am aware that JMS Template is regarded as an antipattern. However, this is pre-existing code and I am doing what I can to mitigate the problem by using a cached connection factory.)
Second update based on additional testing. I would be greatful to hear any suggestions.
Using a small test harness I've reproduced something that is almost what I'm seeing in the full system.
I have a stand alone, nonclustered server (configuration attached), a class acting as a listener on a queue (attached), and a class that produces messages and sends them to the same queue (attached).
Here's what I'm seeing:
- Start standalone server.
- Launch listener.
- Launch producer and allow it to complete.
- Listener receives all 100 messages.
- Stop listener.
- Start listener again.
- Launch producer again.
- Listener will only receive every other message.
This is close to what I'm seeing on 2 of the 3 nodes in my cluster. Do I have a configuration mistake or something?
It seems your acceptor is only hearing localhost.
Set a real ip on the acceptor and connector. Make it the same to what you are using on the static cluster.
Sorry about the late response. My testbed got co-opted for other purposes.
On all three hosts I removed the system property reference for the local node and replaced it with a string literal in the connector and acceptor. (configs are attached).
After restarting the system I reproduced my test scenario in my main application and saw the same behavior that I had before.
All three connectors on each host reference a specific, string literal bind address and a specific, string literal port. All three nodes can communicate with each other, but on two of the nodes exactly one half of the incoming messages are getting lost.
Any other ideas?
I ended up fixing this by systematically changing each configuration setting, one by one, until the problem went away.
The offending configuration element was <forward-when-no-consumers>true</forward-when-no-consumers>. Setting that value to false fixed the problem.