12 Replies Latest reply on Sep 16, 2013 4:17 PM by genman

    ORA-02049; ORA-00060 - timeout deleting resources

    genman

      RHQ 4.5.1.

       

      I have resources (platforms) that are old and from servers that haven't connected in about a month.

       

      But I can't seem to uninventory them. It typically ends up causing this:

       

      Caused by: java.sql.BatchUpdateException: ORA-02049: timeout: distributed transaction waiting for lock
      
              at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10296)
              at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:216)
              at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:476)
              at org.jboss.resource.adapter.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:774)
              at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)
              at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:247)
      

       

      Why is the performance for removing resources so bad? I've noticed that just doing a delete can cause the agents to drop from the server as well.

       

      We also see deadlocks doing deletes.

       

      DEADLOCK DETECTED ( ORA-00060 )

       

      [Transaction Deadlock]

       

      The following deadlock is not an ORACLE error. It is a

      deadlock due to user error in the design of an application

      or from issuing incorrect ad-hoc SQL. The following

      information may aid in determining the deadlock:

       

      Deadlock graph:

                             ---------Blocker(s)--------  ---------Waiter(s)---------

      Resource Name          process session holds waits  process session holds waits

      TM-00012c0f-00000000        43     632    SX   SSX       95     765    SX   SSX

      TM-00012c0f-00000000        95     765    SX   SSX       43     632    SX   SSX

       

      session 632: DID 0001-002B-0005EF41     session 765: DID 0001-005F-0003489D

      session 765: DID 0001-005F-0003489D     session 632: DID 0001-002B-0005EF41

       

       

      ----- Information for the OTHER waiting sessions -----

      Session 765:

        sid: 765 ser: 58345 audsid: 86018249 user: 84/RHQ

          flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-

          flags2: (0x40009) -/-/INC

        pid: 95 O/S info: user: oracle, term: UNKNOWN, ospid: 5894

          image: oracle@host

        client details:

          O/S info: user: rhq, term: unknown, ospid: 1234

          machine: rhq001 program: JDBC Thin Client

          application name: JDBC Thin Client, hash value=2546894660

       

        current SQL:

        delete from RHQ_INSTALLED_PACKAGE where ID=:1

       

       

      Information for THIS session:

       

      ----- Current SQL Statement for this session (sql_id=30cb89ztrsk07) -----

       

      delete from RHQ_INSTALLED_PACKAGE where ID=:1

       

       

      I could create some bugs for this but I wanted to know if there were some hints about working around this issue first.

        • 1. Re: ORA-02049; ORA-00060 - timeout deleting resources
          mazz

          >  delete from RHQ_INSTALLED_PACKAGE where ID=:1

           

          IIRC Installed packages is from resource's that reported discovered content (I don't think, though I could be wrong, that this table is touched by create-child actions). We have disabled, by default, all content discovery (has been like this for several releases now) because we've found rarely people care to see this information (like the RPMs installed on a linux box or the jars installed in the JBossAS /lib directory), the reporting and server-storage was inefficient and this was mainly implemented as part of a set of content features that we ended up not implementing.

           

          Perhaps what you should try is first uninventorying the servers under the platforms you want to get rid of, then, once that is done, see if you can remove the platform. Perhaps spreading it out like that would help.

           

          Our recommendation for people (as our defaults now show) is to turn off content discovery (I think this was supported on platform resoruces as well as JBossAS 4 (and possibly 5) resources).

           

          FWIW, the content/bundle subsystems are going to go under a heavy re-design/re-implementation - see https://docs.jboss.org/author/display/RHQ/Bundles+vs.+Content+-+On+Its+Head - hopefully, this will correct many of the shortcomings in the current content subsystem.

          • 2. Re: ORA-02049; ORA-00060 - timeout deleting resources
            mazz

            John Mazzitelli wrote:

            (I don't think, though I could be wrong, that this table is touched by create-child actions).

            I was wrong - quickly looking at the code and it appears InstalledPackages are created when manually creating child resources (the create-child stuff).

            • 3. Re: ORA-02049; ORA-00060 - timeout deleting resources
              genman

              I've been removing the servers one by one using the CLI and it is very slow. Like 1-2 minutes per server, which is sort of ridiculous. These are resources that are about 6 months old, so I suspect there is some synchronous purging process that's holding things up. I have a theory also that some disabled plugin resources in Inventory were causing trouble.

               

              I really don't know what's going on behind the scenes. It would work better IMHO if resources were simply marked as deleted then their data deleted at some other time.

               

              Anyway, what ended up happening during about 30 minutes of doing uninventory, about 30% of the agents were being marked as unavailable. But they were still reporting updating metrics and functional. The agents were getting this:

              2013-02-28 10:56:04,065 INFO  [ClientCommandSenderTask Timer Thread #132] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoin

              t}Communicator is changing endpoint from [InvokerLocator [servlet://vp25q03ad-hadoop098.iad.apple.com:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet

              ]] to [InvokerLocator [servlet://vp25q03ad-hadoop098.iad.apple.com:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]

              2013-02-28 10:56:04,068 WARN  [ClientCommandSenderTask Timer Thread #132] (AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server.  Error returned was 400

              2013-02-28 10:56:04,070 WARN  [ClientCommandSenderTask Timer Thread #132] (FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [1]. Exception that triggered the failover: [org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server.  Error returned was 400]

              2013-02-28 10:56:04,071 ERROR [ClientCommandSenderTask Thread #3] (ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=vp25q03ad-oracle026.iad.apple.com, rhq.externalizable-strategy=AGENT, rhq.security-token=LAFj2sR9h+uwSllLvdxY/wWOw8vPEvOWIUmpUH5F2wT4HQqLBCu38R//yBzdkcfFCTo=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasurementReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}]]. Cause: org.jboss.remoting.transport.http.WebServerError:Error received when calling on web server.  Error returned was 400. Cause: org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server.  Error returned was 400

              The only fix that seemed to work was to restart them.

               

              I suspect something to do with availability not getting recorded in the database got them caught in this state. I don't know the SQL going on behind the scenes but I suspect some deletes are holding things up.

              • 4. Re: ORA-02049; ORA-00060 - timeout deleting resources
                mazz

                > It would work better IMHO if resources were simply marked as deleted then their data deleted at some other time.

                 

                That is how it works. See:

                 

                org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResourceInNewTransaction(int)

                 

                and

                 

                org.rhq.enterprise.server.scheduler.jobs.AsyncResourceDeleteJob

                • 5. Re: ORA-02049; ORA-00060 - timeout deleting resources
                  jayshaughnessy

                  When you say delete do you mean uninventory (remove from inventory) or actual resource deletion (physical removal from the agent machine) ?

                   

                  If I read it correctly you're saying you get a deadlock during uninventory of an obsolete platform.  And a different deadlock when trying to delete a server resource.  Or is it always related to that rhq_installed_package error?

                   

                  Mazz is right, uninventory should be pretty fast as it does mark the resources for async cleanup later on.  Although, there are some things that can slow it down and must happen up front, not in the async job, specifically, group withdrawal. Resources in a lot of groups can slow things down.

                   

                  As for delete, that can take time because it goes out and actually performs the agent-side removal.  Assuming the agent is up and connected.

                   

                  Of course both uninventory and delete are going to uninventory child resources as well,  So if the family tree is huge that would also play into it.

                   

                  I don't see how disabled resources could cause any issue, that's really just an availability status.  But I can't really explain some of what you're seeing.  I see maybe one suspicious thing in the delete resource impl, nothing really in univentory.

                  • 6. Re: ORA-02049; ORA-00060 - timeout deleting resources
                    genman

                    I'm actually just calling uninventory. The deadlock I may be unrelated as my DBA caught it at some point when I was probably not doing any deletes.

                     

                    What I do see are a lot of sessions (29!) doing:

                     

                    DELETE

                    FROM

                        RHQ_RESOURCE_ERROR

                    WHERE

                        RESOURCE_ID=:1

                    AND ERROR_TYPE=:2

                     

                    and a few sessions (2) doing:

                     

                    INSERT

                    INTO

                        RHQ_RESOURCE_ERROR

                        (

                            TIME_OCCURRED,

                            RESOURCE_ID,

                            ERROR_TYPE,

                            SUMMARY,

                            DETAIL,

                            ID

                        )

                        VALUES

                        (

                            :1 ,

                            :2 ,

                            :3 ,

                            :4 ,

                            :5 ,

                            :6

                        )

                    These sessions pretty much dominate the database connection activity (30/40 connections).

                     

                    The locktype is TYPE=AE . I don't know what this means exactly. This is Oracle 11g.

                     

                    One issue is we updated our JDK and deleted the old JRE on a lot of hosts without restarting JBoss. Then we're seeing a lot of these:

                    org.rhq.core.pluginapi.inventory.InvalidPluginConfigurationException: Failed to start component for resource Resource[id=65615, uuid=c83f5bd9-c4d7-4363-85be-1194741f34c1, type={JBossAS}JBossAS Server, key=/usr/local/jboss/server/default8480, name=AS hostname:1499 default8480, parent=hostname, version=AS 4.2.2.GA].

                              at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1773)

                              at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:3036)

                              at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2781)

                              at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2787)

                              at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1105)

                              at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:1079)

                              at org.rhq.core.pc.inventory.AutoDiscoveryExecutor.call(AutoDiscoveryExecutor.java:129)

                              at org.rhq.core.pc.inventory.AutoDiscoveryExecutor.run(AutoDiscoveryExecutor.java:91)

                              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)

                              at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

                              at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)

                              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

                              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)

                              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)

                              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

                              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

                              at java.lang.Thread.run(Thread.java:662)

                    Caused by: org.rhq.core.pluginapi.inventory.InvalidPluginConfigurationException: javaHomePath connection property ('/usr/java/jre1.6.0_31') does not exist.

                              at org.rhq.plugins.jbossas.JBossASServerComponent.validateJavaHomePathProperty(JBossASServerComponent.java:626)

                              at org.rhq.plugins.jbossas.JBossASServerComponent.validatePluginConfiguration(JBossASServerComponent.java:572)

                              at org.rhq.plugins.jbossas.JBossASServerComponent.start(JBossASServerComponent.java:218)

                              at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)

                              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

                              at java.lang.reflect.Method.invoke(Method.java:597)

                              at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:634)

                              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

                              at java.util.concurrent.FutureTask.run(FutureTask.java:138)

                              ... 3 more

                    Probably the plugin needs to not validate the home directory.

                    • 7. Re: ORA-02049; ORA-00060 - timeout deleting resources
                      genman

                      From the RHQ side I see doing uninventory:

                       

                      2013-03-01 22:26:42,325 ERROR [org.rhq.enterprise.server.resource.ResourceManagerBean] Bulk named query delete error for 'ResourceConfigurationUpdate.deleteByResources2' for [11507]: org.hibernate.exception.SQLGrammarException: could not execute update query

                      2013-03-01 22:27:42,353 WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 2049, SQLState: 42000

                      2013-03-01 22:27:42,353 WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 2049, SQLState: 42000

                      2013-03-01 22:27:42,353 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-02049: timeout: distributed transaction waiting for lock

                      2013-03-01 22:27:42,353 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-02049: timeout: distributed transaction waiting for lock

                       

                      Plus:

                      2013-03-01 22:31:56,618 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[uninventoryResource], targetInterfaceName=org.rhq.core.clientapi.agent.discovery.DiscoveryAgentService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException

                      2013-03-01 22:31:56,618 WARN  [org.rhq.enterprise.server.resource.ResourceManagerBean]  Unable to inform agent of inventory removal for resource [11508]

                      java.lang.reflect.UndeclaredThrowableException

                                at $Proxy826.uninventoryResource(Unknown Source)

                                at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResource(ResourceManagerBean.java:368)

                                at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResources(ResourceManagerBean.java:259)

                                at sun.reflect.GeneratedMethodAccessor1155.invoke(Unknown Source)

                                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

                                at java.lang.reflect.Method.invoke(Method.java:597)

                                at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:112)

                       

                      Plus a ton of other weird errors.

                       

                      DB Visualizer shows this during the delete:

                       

                      ObjectTerminalLockerProgramWaitLockmodeObject TypeSession IDSerialSID
                      "RHQ.RHQ_AGENT""rhq""RHQ""JDBC Thin Client""ACTIVE""ROW EXCLUSIVE""TABLE"26913861269
                      "RHQ.RHQ_RESOURCE""rhq""RHQ""JDBC Thin Client""ACTIVE""ROW EXCLUSIVE""TABLE"26913861269

                       

                      More locks, the ones for the session are bolded:

                       

                      SIDSERIAL#PROCESSUSERNAMETERMINALTYPELMODELREQUESTOBJECT
                      4631"9459""SYS (orac)""UNKNOWN""TYPE=XR""NULL""NONE""ID1=4 ID2=0"
                      4631"9459""SYS (orac)""UNKNOWN""TYPE=RD""NULL""NONE""ID1=1 ID2=0"
                      4631"9459""SYS (orac)""UNKNOWN""CONTROL FILE""RS""NONE""ID1=0 ID2=0"
                      4301"9457""SYS (orac)""UNKNOWN""REDO THREAD ""X""NONE""THREAD=1"
                      4631"9459""SYS (orac)""UNKNOWN""TYPE=RS""RS""NONE""ID1=25 ID2=1"
                      26913861"1234""RHQ""UNKNOWN""DML/DATA ENQ""RX""NONE""RHQ.RHQ_RESOURCE"
                      4961"9461""SYS (orac)""UNKNOWN""TEMP SEGMENT""RX""NONE""NEW BLOCK ALLOCATION"
                      2321"9445""SYS (orac)""UNKNOWN""TYPE=KD""X""NONE""ID1=0 ID2=0"
                      3311"9451""SYS (orac)""UNKNOWN""TYPE=PW""RX""NONE""ID1=1 ID2=0"
                      26913861"1234""RHQ""UNKNOWN""TRANSAC ENQ""X""NONE""RS+SLOT#720910 WRP#12227"
                      56533893"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      60015407"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      66416741"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      761961"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      17025625"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      49960707"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      2991005"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      26913861"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      33415013"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      13459283"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      76325323"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      36529893"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      1379247"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      714011"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      23922011"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      7322063"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      69621802"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      7063953"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      102401"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      62854563"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      53354543"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      33517229"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      20612315"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      40136549"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      46627501"1234""RHQ""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      5621"9465""SYS (orac)""UNKNOWN""TYPE=AE""S""NONE""ID1=100 ID2=0"
                      26913861"1234""RHQ""UNKNOWN""DML/DATA ENQ""RX""NONE""RHQ.RHQ_AGENT"
                      2321"9445""SYS (orac)""UNKNOWN""TYPE=KT""S""NONE""ID1=12537 ID2=0"

                       

                      [http-0.0.0.0-7080-132] DELETE FROM RHQ_RESOURCE_GROUP_RES_EXP_MAP WHERE RESOURCE_ID IN ( ? )

                      [http-0.0.0.0-7080-132] DELETE FROM RHQ_RESOURCE_GROUP_RES_IMP_MAP WHERE RESOURCE_ID IN ( ? )

                      [http-0.0.0.0-7080-132] update RHQ_AGENT set LAST_AVAILABILITY_PING=?, BACKFILLED=0 where NAME=?

                      [http-0.0.0.0-7080-132] update RHQ_AGENT set LAST_AVAILABILITY_PING=?, BACKFILLED=0 where NAME=?

                      [http-0.0.0.0-7080-132] update RHQ_RESOURCE set INVENTORY_STATUS=?, AGENT_ID=null, PARENT_RESOURCE_ID=null, RESOURCE_KEY='deleted' where ID in (?)

                       

                      ...

                       

                      Here's the stack trace (thread name is different but same idea)

                      "http-0.0.0.0-7080-33" daemon prio=10 tid=0x00007fbee8041000 nid=0x41eb waiting on condition [0x00007fbe9f2ec000]

                         java.lang.Thread.State: TIMED_WAITING (parking)

                              at sun.misc.Unsafe.park(Native Method)

                              - parking to wait for  <0x00000007cd75a8f0> (a java.util.concurrent.FutureTask$Sync)

                              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)

                              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011)

                              at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303)

                              at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:227)

                              at java.util.concurrent.FutureTask.get(FutureTask.java:91)

                              at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.run(ClientCommandSenderTask.java:143)

                              at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:647)

                              at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:407)

                              at $Proxy826.uninventoryResource(Unknown Source)

                              at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResource(ResourceManagerBean.java:368)

                              at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResources(ResourceManagerBean.java:259)

                       

                      I think the jist of it is:

                      1) RHQ should not wait (forever) to try to uninventory from platforms that are down

                      2) RHQ should not hold database locks while calling a remote resource, especially one that is down

                      • 8. Re: ORA-02049; ORA-00060 - timeout deleting resources
                        genman

                        It's not *forever* but to uninventory a machine that you can't connect to (not the 'connection refused' immediately case) should not take as long as it does. For example, to remove a CPU(!) from a machine that's down, it takes 60+ seconds (maybe 3 minutes?) to try to connect to that agent that hasn't connected in months? Why not bypass all that if the agent isn't actually UP? To me, I would have done the delete, send a message to JMS, then in an MDB send the request to uninventory. That way you don't block the database. Let an MDB do the call.

                        • 9. Re: ORA-02049; ORA-00060 - timeout deleting resources
                          pilhuhn

                          All those calls to enable/disable/delete/... basically go

                           

                                  try {

                                      AgentClient agentClient = agentManager.getAgentClient(agent);

                                      ResourceFactoryAgentService resourceFactoryAgentService = agentClient.getResourceFactoryAgentService();

                                      resourceFactoryAgentService.deleteResource(request);

                           

                           

                                      return persistedHistory;

                                  } catch (CannotConnectException e) {

                           

                          so we try to get the AgentClient, create a remote-POJO and then use the remoted agent service to do the work.

                          And this is (unless told differently) waiting up to org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory#m_timeoutMillis

                           

                          I did not yet find out what this value really is, but just trying to open the socket may take some time, that gets worse for group removals where the resources are on different agents.

                           

                          Perhaps we can shortcut the logic by checking if the agent is currently marked as available or not and don't try to send when it is no up.

                           

                          Currently when I have shut down the agent "snert" (and the server knows about it ) I see this in the log.

                           

                          19:18:16,307 INFO  [org.rhq.enterprise.server.core.AgentManagerBean] (http-/0.0.0.0:7080-1) Agent with name [snert] just went down

                          19:18:40,443 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] (http-/0.0.0.0:7080-5) {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[uninventoryResource], targetInterfaceName=org.rhq.core.clientapi.agent.discovery.DiscoveryAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://172.31.7.7:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection refused. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://172.31.7.7:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]

                          19:18:40,444 WARN  [org.rhq.enterprise.server.resource.ResourceManagerBean] (http-/0.0.0.0:7080-5)  Unable to inform agent of inventory removal for resource [10019]: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://172.31.7.7:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]

                                    at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.transport(MicroSocketClientInvoker.java:838) [jboss-remoting-2.5.4.SP4.jar:]

                                    at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:169) [jboss-remoting-2.5.4.SP4.jar:]

                                    at org.jboss.remoting.Client.invoke(Client.java:2084) [jboss-remoting-2.5.4.SP4.jar:]

                                    at org.jboss.remoting.Client.invoke(Client.java:879) [jboss-remoting-2.5.4.SP4.jar:]

                                    at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.rawSend(JBossRemotingRemoteCommunicator.java:514) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutCallbacks(JBossRemotingRemoteCommunicator.java:456) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:475) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:496) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1091) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]

                                    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_25]

                                    at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_25]

                                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_25]

                                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_25]

                                    at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_25]

                          Caused by: java.net.ConnectException: Connection refused

                                    at java.net.PlainSocketImpl.socketConnect(Native Method) [rt.jar:1.7.0_25]

                                    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) [rt.jar:1.7.0_25]

                                    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) [rt.jar:1.7.0_25]

                                    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) [rt.jar:1.7.0_25]

                                    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [rt.jar:1.7.0_25]

                                    at java.net.Socket.connect(Socket.java:579) [rt.jar:1.7.0_25]

                                    at org.jboss.remoting.transport.socket.SocketClientInvoker$1.run(SocketClientInvoker.java:303) [jboss-remoting-2.5.4.SP4.jar:]

                                    at java.security.AccessController.doPrivileged(Native Method) [rt.jar:1.7.0_25]

                                    at org.jboss.remoting.transport.socket.SocketClientInvoker.connect(SocketClientInvoker.java:299) [jboss-remoting-2.5.4.SP4.jar:]

                                    at org.jboss.remoting.transport.socket.SocketClientInvoker.createSocket(SocketClientInvoker.java:210) [jboss-remoting-2.5.4.SP4.jar:]

                                    at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.getConnection(MicroSocketClientInvoker.java:1172) [jboss-remoting-2.5.4.SP4.jar:]

                                    at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.transport(MicroSocketClientInvoker.java:816) [jboss-remoting-2.5.4.SP4.jar:]

                                    ... 17 more

                          • 10. Re: ORA-02049; ORA-00060 - timeout deleting resources
                            pilhuhn
                            • 11. Re: ORA-02049; ORA-00060 - timeout deleting resources
                              genman

                              A connection refused error comes back quickly, it's an issue when the router/host is down. Maybe manually update the IP address to resolve to a host that is firewalled to reproduce?

                               

                              I would also argue things would be better if the database isn't locked while inventory removal is going on. It's dangerous to start a transaction while doing blocking I/O like this.

                               

                              Post 4.5 (4.7?) the issue is mitigated somewhat since we can permanently purge agents from the database, which is really the issue here.

                              • 12. Re: ORA-02049; ORA-00060 - timeout deleting resources
                                genman

                                I continue to see this problem when the agent is on a IP address that is inaccessible and it causes the database to time out, causing availability reports to disappear and the other hosts to appear down as well. (This is 4.5. Later versions have a fix for availability.)

                                 

                                Why not do a ping or something before bothering to call the agent, or call the agent in a new thread? I added a suggested code change in the BZ issue above.