HornetQ, XAResources and Oracle database locking issues on RHQ 4.9
genman Jan 22, 2014 8:05 PMI have an HA configuration with quite a lot of agent hosts (2,000). Seen periodically are these weird HornetQ errors. I think this has something to do with XAResource and the database. I see a lot of database locks when this happens. Basically the whole server becomes stuck holding up transactions and I have to force a restart.
Logs:
03:06:13,640 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:2ee0 in state RUN 05:08:29,284 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:2f10 in state RUN 05:35:53,539 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:36f8 in state RUN 02:03:32,031 WARN [org.jboss.as.ejb3] (EJB default - 5) JBAS014143: Timer aaf75dff-caeb-4172-9681-fe047faf2cdd is still active, skipping overlapping scheduled execution at: Wed Jan 22 02:03:32 UTC 2014 05:35:53,540 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:39ff in state RUN 05:36:09,992 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b41 in state RUN 05:36:09,992 WARN [org.jboss.as.ejb3] (EJB default - 5) JBAS014143: Timer e1ae3fd6-93b8-4e89-a24e-b4ce0587d0ca is still active, skipping overlapping scheduled execution at: Wed Jan 22 05:36:09 UTC 2014 05:36:14,704 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b68 in state RUN 05:36:21,907 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b69 in state RUN 05:36:21,907 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b76 in state RUN 05:36:21,907 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b95 in state RUN 05:36:21,908 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b9b in state RUN 05:36:21,907 WARN [org.hornetq.core.client] (hornetq-failure-check-thread) HQ212107: Connection failure has been detected: HQ119034: Did not receive data from invm:0. It is likely the client has exited o r crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check u ser manual for more information. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
Now see the database locks: (notice the lock time in seconds ...) This is from a query on Oracle
['Object', 'Terminal', 'Machine', 'Locker', 'Wait', 'Seconds', 'Lockmode', 'Object Type', 'Session ID', 'Serial', 'sid'] ('RHQ.RHQ_AFFINITY_GROUP', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145) ('RHQ.RHQ_AGENT', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145) ('RHQ.RHQ_FAILOVER_LIST', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11) ('RHQ.RHQ_PARTITION_DETAILS', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11) ('RHQ.RHQ_PARTITION_EVENT', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11) ('RHQ.RHQ_SERVER', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145)
I have a couple of thoughts and questions.
1. Is this related at all to distributed transaction support?
2. Is there a way to use simple database connections rather than XA ones?
3. Is there a bug in JBoss EAP related to this at all?
I'm sort of thinking co-locating Cassandra and RHQ on the same machine isn't a good idea, as I suspect the CPU usage spike Cassandra causes may result in unexpected timeouts on the RHQ side.