Hi Folks, I've got a production 2.2.5_Final system that has one subscriber queue stalled and three subscriber queues apparently working for the same topic (filters for each subscriber are different). By "stalled" I mean that Hornet does not appear to be delivering any message to the durable consumers when they connected and asking for messages. Through the JMX console I see the message counters on the "stalled" subscriber queue going up each day but the consumer/clients are saying they're not getting messages (and are claiming "no changes on my end this time mate!"). Messages are deposited to the topic by a remote Jboss app that accesses the topic connection factory via a remote naming connection.
Becuase I'm in prod I can't easily change major stuff like the version of Hornet or Jboss so I'm looking to get out of the bind we've got with growing undelivered messages, assuming they are valid and without losing them along the way and putting in place process/other changes that will help avoid the problem in the future.
After running printPages and getting 1.9 million lines, relating mostly to and doing some fancy stuff like grep | wc I've learned:
1. approx 1.9 million lines /msgs spread over 2381 pages
2. approx 1.3 million lines/msgs are marked with PG_TX_FOUND
3. approx 5000 lines/msgs are maked with the ACK flag, the rest do not
My questions are:
1. Do the above general counts from the output of printPages for the problematic topic/address tell anyone anything that might help us understand what's gone wrong or is continuing to go wrong.
2. Regarding the logic in printPages.java, around whether to print the PG_TX_NOT_FOUND :
if (msg.getTransactionID() >= 0 && !pgTXs.contains(msg.getTransactionID()))
a. What is the above code actually saying / checking for in laymans terms? Is it flagging messages that are somehow still caught up in some sort of XA / transaction (eg. in flight, must be rolled back)
c. Does the ongoing addition of messages that printPages reports as PG_TX_NOT_FOUND indicate a problem with config or code execution that is still current or is it likely a point in time failure/corruption that is preventing things from returngin to normal?
3. I'm ultimately trygin to work out the current state according to hornet.....do I have messages in the paging files that contain undelivered messages and is their anything I can do kickstart/clean up and not lose the messages if they are in a holding pattern?