2 Replies Latest reply on Jan 28, 2014 2:04 PM by genman

HornetQ, XAResources and Oracle database locking issues on RHQ 4.9

genman Jan 22, 2014 8:05 PM

I have an HA configuration with quite a lot of agent hosts (2,000). Seen periodically are these weird HornetQ errors. I think this has something to do with XAResource and the database. I see a lot of database locks when this happens. Basically the whole server becomes stuck holding up transactions and I have to force a restart.

Logs:

03:06:13,640 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:2ee0 in state  RUN
05:08:29,284 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:2f10 in state  RUN
05:35:53,539 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:36f8 in state  RUN
02:03:32,031 WARN  [org.jboss.as.ejb3] (EJB default - 5) JBAS014143: Timer aaf75dff-caeb-4172-9681-fe047faf2cdd is still active, skipping overlapping scheduled execution at: Wed Jan 22 02:03:32 UTC 2014
05:35:53,540 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:39ff in state  RUN
05:36:09,992 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b41 in state  RUN
05:36:09,992 WARN  [org.jboss.as.ejb3] (EJB default - 5) JBAS014143: Timer e1ae3fd6-93b8-4e89-a24e-b4ce0587d0ca is still active, skipping overlapping scheduled execution at: Wed Jan 22 05:36:09 UTC 2014
05:36:14,704 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b68 in state  RUN
05:36:21,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b69 in state  RUN
05:36:21,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b76 in state  RUN
05:36:21,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b95 in state  RUN
05:36:21,908 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b9b in state  RUN
05:36:21,907 WARN  [org.hornetq.core.client] (hornetq-failure-check-thread) HQ212107: Connection failure has been detected: HQ119034: Did not receive data from invm:0. It is likely the client has exited o
r crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check u
ser manual for more information. The connection will now be closed. [code=CONNECTION_TIMEDOUT]

Now see the database locks: (notice the lock time in seconds ...) This is from a query on Oracle

['Object', 'Terminal', 'Machine', 'Locker', 'Wait', 'Seconds', 'Lockmode', 'Object Type', 'Session ID', 'Serial', 'sid']
('RHQ.RHQ_AFFINITY_GROUP', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145)
('RHQ.RHQ_AGENT', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145)
('RHQ.RHQ_FAILOVER_LIST', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11)
('RHQ.RHQ_PARTITION_DETAILS', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11)
('RHQ.RHQ_PARTITION_EVENT', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11)
('RHQ.RHQ_SERVER', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145)

I have a couple of thoughts and questions.

1. Is this related at all to distributed transaction support?

2. Is there a way to use simple database connections rather than XA ones?

3. Is there a bug in JBoss EAP related to this at all?

I'm sort of thinking co-locating Cassandra and RHQ on the same machine isn't a good idea, as I suspect the CPU usage spike Cassandra causes may result in unexpected timeouts on the RHQ side.

1. Re: HornetQ, XAResources and Oracle database locking issues on RHQ 4.9

tsegismont Jan 28, 2014 11:24 AM (in response to genman)

Hi Elias,

I'm not sure for the database locks, but the HornetQ message is quite common on servers under heavy load (the server is so busy that it cannot send the HornetQ heartbeat in time).

So:
1. This is not related to XA transactions, I think.
2. XA transactions are required as we use more than one "resource" in the same transaction (queue + database)
3. I don't think so.

Thomas
1 of 1 people found this helpful
Actions
2. Re: HornetQ, XAResources and Oracle database locking issues on RHQ 4.9

genman Jan 28, 2014 2:04 PM (in response to tsegismont)
My feeling has been something related with the server being so busy causes the following:
More garbage collection pauses cause more HTTP requests to hang
More memory is used by more hung requests.
More database locks are created and hang.
More requests are waiting for database locks
Server eventually locks up, out of memory.
I've had 20GB of heap fill up and I run out of memory. I don't know if it is caused by Cassandra doing major compression, but I'm going to keep Cassandra away from RHQ server if possible.
Actions

Go to original post