2 Replies Latest reply on Jan 28, 2014 2:04 PM by genman

    HornetQ, XAResources and Oracle database locking issues on RHQ 4.9

    genman


      I have an HA configuration with quite a lot of agent hosts (2,000). Seen periodically are these weird HornetQ errors. I think this has something to do with XAResource and the database. I see a lot of database locks when this happens. Basically the whole server becomes stuck holding up transactions and I have to force a restart.

       

      Logs:

       

      03:06:13,640 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:2ee0 in state  RUN
      05:08:29,284 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:2f10 in state  RUN
      05:35:53,539 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:36f8 in state  RUN
      02:03:32,031 WARN  [org.jboss.as.ejb3] (EJB default - 5) JBAS014143: Timer aaf75dff-caeb-4172-9681-fe047faf2cdd is still active, skipping overlapping scheduled execution at: Wed Jan 22 02:03:32 UTC 2014
      05:35:53,540 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:39ff in state  RUN
      05:36:09,992 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b41 in state  RUN
      05:36:09,992 WARN  [org.jboss.as.ejb3] (EJB default - 5) JBAS014143: Timer e1ae3fd6-93b8-4e89-a24e-b4ce0587d0ca is still active, skipping overlapping scheduled execution at: Wed Jan 22 05:36:09 UTC 2014
      05:36:14,704 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b68 in state  RUN
      05:36:21,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b69 in state  RUN
      05:36:21,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b76 in state  RUN
      05:36:21,907 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b95 in state  RUN
      05:36:21,908 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff11b24825:57ca474d:52df1c4c:3b9b in state  RUN
      05:36:21,907 WARN  [org.hornetq.core.client] (hornetq-failure-check-thread) HQ212107: Connection failure has been detected: HQ119034: Did not receive data from invm:0. It is likely the client has exited o
      r crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check u
      ser manual for more information. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
      
      


      Now see the database locks: (notice the lock time in seconds ...) This is from a query on Oracle


      ['Object', 'Terminal', 'Machine', 'Locker', 'Wait', 'Seconds', 'Lockmode', 'Object Type', 'Session ID', 'Serial', 'sid']
      ('RHQ.RHQ_AFFINITY_GROUP', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145)
      ('RHQ.RHQ_AGENT', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145)
      ('RHQ.RHQ_FAILOVER_LIST', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11)
      ('RHQ.RHQ_PARTITION_DETAILS', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11)
      ('RHQ.RHQ_PARTITION_EVENT', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 11346, 'ROW EXCLUSIVE', 'TABLE', 11, 22221, 11)
      ('RHQ.RHQ_SERVER', 'rhq', '-rhq001', 'RHQ', 'ACTIVE', 13504, 'ROW EXCLUSIVE', 'TABLE', 1145, 9745, 1145)
      

       

      I have a couple of thoughts and questions.

       

      1. Is this related at all to distributed transaction support?

      2. Is there a way to use simple database connections rather than XA ones?

      3. Is there a bug in JBoss EAP related to this at all?

       

      I'm sort of thinking co-locating Cassandra and RHQ on the same machine isn't a good idea, as I suspect the CPU usage spike Cassandra causes may result in unexpected timeouts on the RHQ side.

        • 1. Re: HornetQ, XAResources and Oracle database locking issues on RHQ 4.9
          tsegismont

          Hi Elias,

           

          I'm not sure for the database locks, but the HornetQ message is quite common on servers under heavy load (the server is so busy that it cannot send the HornetQ heartbeat in time).

           

          So:

          1. This is not related to XA transactions, I think.

          2. XA transactions are required as we use more than one "resource" in the same transaction (queue + database)

          3. I don't think so.

           

          Thomas

          1 of 1 people found this helpful
          • 2. Re: HornetQ, XAResources and Oracle database locking issues on RHQ 4.9
            genman

            My feeling has been something related with the server being so busy causes the following:

            • More garbage collection pauses cause more HTTP requests to hang
            • More memory is used by more hung requests.
            • More database locks are created and hang.
            • More requests are waiting for database locks
            • Server eventually locks up, out of memory.

            I've had 20GB of heap fill up and I run out of memory. I don't know if it is caused by Cassandra doing major compression, but I'm going to keep Cassandra away from RHQ server if possible.