12 Replies Latest reply on Sep 16, 2013 4:17 PM by genman

ORA-02049; ORA-00060 - timeout deleting resources

genman Feb 28, 2013 1:33 PM

RHQ 4.5.1.

I have resources (platforms) that are old and from servers that haven't connected in about a month.

But I can't seem to uninventory them. It typically ends up causing this:

Caused by: java.sql.BatchUpdateException: ORA-02049: timeout: distributed transaction waiting for lock

        at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10296)
        at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:216)
        at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:476)
        at org.jboss.resource.adapter.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:774)
        at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)
        at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:247)

Why is the performance for removing resources so bad? I've noticed that just doing a delete can cause the agents to drop from the server as well.

We also see deadlocks doing deletes.

DEADLOCK DETECTED ( ORA-00060 )

[Transaction Deadlock]

The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:

Deadlock graph:
                       ---------Blocker(s)-------- ---------Waiter(s)---------
Resource Name          process session holds waits process session holds waits
TM-00012c0f-00000000        43     632    SX   SSX       95     765    SX   SSX
TM-00012c0f-00000000        95     765    SX   SSX       43     632    SX   SSX

session 632: DID 0001-002B-0005EF41     session 765: DID 0001-005F-0003489D
session 765: DID 0001-005F-0003489D     session 632: DID 0001-002B-0005EF41

----- Information for the OTHER waiting sessions -----
Session 765:
sid: 765 ser: 58345 audsid: 86018249 user: 84/RHQ
    flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
pid: 95 O/S info: user: oracle, term: UNKNOWN, ospid: 5894
    image: oracle@host
client details:
    O/S info: user: rhq, term: unknown, ospid: 1234
    machine: rhq001 program: JDBC Thin Client
    application name: JDBC Thin Client, hash value=2546894660

current SQL:
delete from RHQ_INSTALLED_PACKAGE where ID=:1

Information for THIS session:

----- Current SQL Statement for this session (sql_id=30cb89ztrsk07) -----

delete from RHQ_INSTALLED_PACKAGE where ID=:1

I could create some bugs for this but I wanted to know if there were some hints about working around this issue first.

1. Re: ORA-02049; ORA-00060 - timeout deleting resources

mazz Feb 28, 2013 1:56 PM (in response to genman)

> delete from RHQ_INSTALLED_PACKAGE where ID=:1

IIRC Installed packages is from resource's that reported discovered content (I don't think, though I could be wrong, that this table is touched by create-child actions). We have disabled, by default, all content discovery (has been like this for several releases now) because we've found rarely people care to see this information (like the RPMs installed on a linux box or the jars installed in the JBossAS /lib directory), the reporting and server-storage was inefficient and this was mainly implemented as part of a set of content features that we ended up not implementing.

Perhaps what you should try is first uninventorying the servers under the platforms you want to get rid of, then, once that is done, see if you can remove the platform. Perhaps spreading it out like that would help.

Our recommendation for people (as our defaults now show) is to turn off content discovery (I think this was supported on platform resoruces as well as JBossAS 4 (and possibly 5) resources).

FWIW, the content/bundle subsystems are going to go under a heavy re-design/re-implementation - see https://docs.jboss.org/author/display/RHQ/Bundles+vs.+Content+-+On+Its+Head - hopefully, this will correct many of the shortcomings in the current content subsystem.
Actions
2. Re: ORA-02049; ORA-00060 - timeout deleting resources

mazz Feb 28, 2013 1:59 PM (in response to mazz)

John Mazzitelli wrote:
(I don't think, though I could be wrong, that this table is touched by create-child actions).
I was wrong - quickly looking at the code and it appears InstalledPackages are created when manually creating child resources (the create-child stuff).
Actions
3. Re: ORA-02049; ORA-00060 - timeout deleting resources

genman Feb 28, 2013 4:03 PM (in response to mazz)

I've been removing the servers one by one using the CLI and it is very slow. Like 1-2 minutes per server, which is sort of ridiculous. These are resources that are about 6 months old, so I suspect there is some synchronous purging process that's holding things up. I have a theory also that some disabled plugin resources in Inventory were causing trouble.

I really don't know what's going on behind the scenes. It would work better IMHO if resources were simply marked as deleted then their data deleted at some other time.

Anyway, what ended up happening during about 30 minutes of doing uninventory, about 30% of the agents were being marked as unavailable. But they were still reporting updating metrics and functional. The agents were getting this:
2013-02-28 10:56:04,065 INFO [ClientCommandSenderTask Timer Thread #132] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoin
t}Communicator is changing endpoint from [InvokerLocator [servlet://vp25q03ad-hadoop098.iad.apple.com:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet
]] to [InvokerLocator [servlet://vp25q03ad-hadoop098.iad.apple.com:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2013-02-28 10:56:04,068 WARN [ClientCommandSenderTask Timer Thread #132] (AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server. Error returned was 400
2013-02-28 10:56:04,070 WARN [ClientCommandSenderTask Timer Thread #132] (FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [1]. Exception that triggered the failover: [org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server. Error returned was 400]
2013-02-28 10:56:04,071 ERROR [ClientCommandSenderTask Thread #3] (ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=vp25q03ad-oracle026.iad.apple.com, rhq.externalizable-strategy=AGENT, rhq.security-token=LAFj2sR9h+uwSllLvdxY/wWOw8vPEvOWIUmpUH5F2wT4HQqLBCu38R//yBzdkcfFCTo=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasurementReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}]]. Cause: org.jboss.remoting.transport.http.WebServerError:Error received when calling on web server. Error returned was 400. Cause: org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server. Error returned was 400
The only fix that seemed to work was to restart them.

I suspect something to do with availability not getting recorded in the database got them caught in this state. I don't know the SQL going on behind the scenes but I suspect some deletes are holding things up.
Actions
4. Re: ORA-02049; ORA-00060 - timeout deleting resources

mazz Feb 28, 2013 4:15 PM (in response to genman)

> It would work better IMHO if resources were simply marked as deleted then their data deleted at some other time.

That is how it works. See:

org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResourceInNewTransaction(int)

and

org.rhq.enterprise.server.scheduler.jobs.AsyncResourceDeleteJob
Actions
5. Re: ORA-02049; ORA-00060 - timeout deleting resources

jayshaughnessy Feb 28, 2013 4:43 PM (in response to genman)

When you say delete do you mean uninventory (remove from inventory) or actual resource deletion (physical removal from the agent machine) ?

If I read it correctly you're saying you get a deadlock during uninventory of an obsolete platform. And a different deadlock when trying to delete a server resource. Or is it always related to that rhq_installed_package error?

Mazz is right, uninventory should be pretty fast as it does mark the resources for async cleanup later on. Although, there are some things that can slow it down and must happen up front, not in the async job, specifically, group withdrawal. Resources in a lot of groups can slow things down.

As for delete, that can take time because it goes out and actually performs the agent-side removal. Assuming the agent is up and connected.

Of course both uninventory and delete are going to uninventory child resources as well, So if the family tree is huge that would also play into it.

I don't see how disabled resources could cause any issue, that's really just an availability status. But I can't really explain some of what you're seeing. I see maybe one suspicious thing in the delete resource impl, nothing really in univentory.
Actions
6. Re: ORA-02049; ORA-00060 - timeout deleting resources

genman Mar 1, 2013 4:27 PM (in response to jayshaughnessy)

I'm actually just calling uninventory. The deadlock I may be unrelated as my DBA caught it at some point when I was probably not doing any deletes.

What I do see are a lot of sessions (29!) doing:

DELETE
FROM
    RHQ_RESOURCE_ERROR
WHERE
    RESOURCE_ID=:1
AND ERROR_TYPE=:2

and a few sessions (2) doing:

INSERT
INTO
    RHQ_RESOURCE_ERROR
    (
        TIME_OCCURRED,
        RESOURCE_ID,
        ERROR_TYPE,
        SUMMARY,
        DETAIL,
        ID
    )
    VALUES
    (
        :1 ,
        :2 ,
        :3 ,
        :4 ,
        :5 ,
        :6
    )
These sessions pretty much dominate the database connection activity (30/40 connections).

The locktype is TYPE=AE . I don't know what this means exactly. This is Oracle 11g.

One issue is we updated our JDK and deleted the old JRE on a lot of hosts without restarting JBoss. Then we're seeing a lot of these:
org.rhq.core.pluginapi.inventory.InvalidPluginConfigurationException: Failed to start component for resource Resource[id=65615, uuid=c83f5bd9-c4d7-4363-85be-1194741f34c1, type={JBossAS}JBossAS Server, key=/usr/local/jboss/server/default8480, name=AS hostname:1499 default8480, parent=hostname, version=AS 4.2.2.GA].
          at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1773)
          at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:3036)
          at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2781)
          at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2787)
          at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1105)
          at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:1079)
          at org.rhq.core.pc.inventory.AutoDiscoveryExecutor.call(AutoDiscoveryExecutor.java:129)
          at org.rhq.core.pc.inventory.AutoDiscoveryExecutor.run(AutoDiscoveryExecutor.java:91)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
          at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
          at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:662)
Caused by: org.rhq.core.pluginapi.inventory.InvalidPluginConfigurationException: javaHomePath connection property ('/usr/java/jre1.6.0_31') does not exist.
          at org.rhq.plugins.jbossas.JBossASServerComponent.validateJavaHomePathProperty(JBossASServerComponent.java:626)
          at org.rhq.plugins.jbossas.JBossASServerComponent.validatePluginConfiguration(JBossASServerComponent.java:572)
          at org.rhq.plugins.jbossas.JBossASServerComponent.start(JBossASServerComponent.java:218)
          at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:634)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          ... 3 more
Probably the plugin needs to not validate the home directory.
Actions

7. Re: ORA-02049; ORA-00060 - timeout deleting resources

genman Mar 1, 2013 6:39 PM (in response to genman)

From the RHQ side I see doing uninventory:

2013-03-01 22:26:42,325 ERROR [org.rhq.enterprise.server.resource.ResourceManagerBean] Bulk named query delete error for 'ResourceConfigurationUpdate.deleteByResources2' for [11507]: org.hibernate.exception.SQLGrammarException: could not execute update query
2013-03-01 22:27:42,353 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 2049, SQLState: 42000
2013-03-01 22:27:42,353 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 2049, SQLState: 42000
2013-03-01 22:27:42,353 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-02049: timeout: distributed transaction waiting for lock
2013-03-01 22:27:42,353 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-02049: timeout: distributed transaction waiting for lock

Plus:

2013-03-01 22:31:56,618 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[uninventoryResource], targetInterfaceName=org.rhq.core.clientapi.agent.discovery.DiscoveryAgentService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException
2013-03-01 22:31:56,618 WARN [org.rhq.enterprise.server.resource.ResourceManagerBean] Unable to inform agent of inventory removal for resource [11508]
java.lang.reflect.UndeclaredThrowableException
          at $Proxy826.uninventoryResource(Unknown Source)
          at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResource(ResourceManagerBean.java:368)
          at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResources(ResourceManagerBean.java:259)
          at sun.reflect.GeneratedMethodAccessor1155.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:112)

Plus a ton of other weird errors.

DB Visualizer shows this during the delete:

Object	Terminal	Locker	Program	Wait	Lockmode	Object Type	Session ID	Serial	SID
"RHQ.RHQ_AGENT"	"rhq"	"RHQ"	"JDBC Thin Client"	"ACTIVE"	"ROW EXCLUSIVE"	"TABLE"	269	13861	269
"RHQ.RHQ_RESOURCE"	"rhq"	"RHQ"	"JDBC Thin Client"	"ACTIVE"	"ROW EXCLUSIVE"	"TABLE"	269	13861	269

More locks, the ones for the session are bolded:

SID	SERIAL#	PROCESS	USERNAME	TERMINAL	TYPE	LMODE	LREQUEST	OBJECT
463	1	"9459"	"SYS (orac)"	"UNKNOWN"	"TYPE=XR"	"NULL"	"NONE"	"ID1=4 ID2=0"
463	1	"9459"	"SYS (orac)"	"UNKNOWN"	"TYPE=RD"	"NULL"	"NONE"	"ID1=1 ID2=0"
463	1	"9459"	"SYS (orac)"	"UNKNOWN"	"CONTROL FILE"	"RS"	"NONE"	"ID1=0 ID2=0"
430	1	"9457"	"SYS (orac)"	"UNKNOWN"	"REDO THREAD "	"X"	"NONE"	"THREAD=1"
463	1	"9459"	"SYS (orac)"	"UNKNOWN"	"TYPE=RS"	"RS"	"NONE"	"ID1=25 ID2=1"
269	13861	"1234"	"RHQ"	"UNKNOWN"	"DML/DATA ENQ"	"RX"	"NONE"	"RHQ.RHQ_RESOURCE"
496	1	"9461"	"SYS (orac)"	"UNKNOWN"	"TEMP SEGMENT"	"RX"	"NONE"	"NEW BLOCK ALLOCATION"
232	1	"9445"	"SYS (orac)"	"UNKNOWN"	"TYPE=KD"	"X"	"NONE"	"ID1=0 ID2=0"
331	1	"9451"	"SYS (orac)"	"UNKNOWN"	"TYPE=PW"	"RX"	"NONE"	"ID1=1 ID2=0"
269	13861	"1234"	"RHQ"	"UNKNOWN"	"TRANSAC ENQ"	"X"	"NONE"	"RS+SLOT#720910 WRP#12227"
565	33893	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
600	15407	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
664	16741	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
761	961	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
170	25625	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
499	60707	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
299	1005	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
269	13861	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
334	15013	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
134	59283	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
763	25323	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
365	29893	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
137	9247	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
71	4011	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
239	22011	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
732	2063	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
696	21802	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
70	63953	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
102	401	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
628	54563	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
533	54543	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
335	17229	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
206	12315	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
401	36549	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
466	27501	"1234"	"RHQ"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
562	1	"9465"	"SYS (orac)"	"UNKNOWN"	"TYPE=AE"	"S"	"NONE"	"ID1=100 ID2=0"
269	13861	"1234"	"RHQ"	"UNKNOWN"	"DML/DATA ENQ"	"RX"	"NONE"	"RHQ.RHQ_AGENT"
232	1	"9445"	"SYS (orac)"	"UNKNOWN"	"TYPE=KT"	"S"	"NONE"	"ID1=12537 ID2=0"

[http-0.0.0.0-7080-132] DELETE FROM RHQ_RESOURCE_GROUP_RES_EXP_MAP WHERE RESOURCE_ID IN ( ? )

[http-0.0.0.0-7080-132] DELETE FROM RHQ_RESOURCE_GROUP_RES_IMP_MAP WHERE RESOURCE_ID IN ( ? )

[http-0.0.0.0-7080-132] update RHQ_AGENT set LAST_AVAILABILITY_PING=?, BACKFILLED=0 where NAME=?

[http-0.0.0.0-7080-132] update RHQ_RESOURCE set INVENTORY_STATUS=?, AGENT_ID=null, PARENT_RESOURCE_ID=null, RESOURCE_KEY='deleted' where ID in (?)

...

Here's the stack trace (thread name is different but same idea)

"http-0.0.0.0-7080-33" daemon prio=10 tid=0x00007fbee8041000 nid=0x41eb waiting on condition [0x00007fbe9f2ec000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x00000007cd75a8f0> (a java.util.concurrent.FutureTask$Sync)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303)
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:227)
        at java.util.concurrent.FutureTask.get(FutureTask.java:91)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.run(ClientCommandSenderTask.java:143)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:647)
        at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:407)
        at $Proxy826.uninventoryResource(Unknown Source)
        at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResource(ResourceManagerBean.java:368)
        at org.rhq.enterprise.server.resource.ResourceManagerBean.uninventoryResources(ResourceManagerBean.java:259)

I think the jist of it is:

1) RHQ should not wait (forever) to try to uninventory from platforms that are down

2) RHQ should not hold database locks while calling a remote resource, especially one that is down

8. Re: ORA-02049; ORA-00060 - timeout deleting resources

genman Mar 2, 2013 12:45 AM (in response to genman)

It's not *forever* but to uninventory a machine that you can't connect to (not the 'connection refused' immediately case) should not take as long as it does. For example, to remove a CPU(!) from a machine that's down, it takes 60+ seconds (maybe 3 minutes?) to try to connect to that agent that hasn't connected in months? Why not bypass all that if the agent isn't actually UP? To me, I would have done the delete, send a message to JMS, then in an MDB send the request to uninventory. That way you don't block the database. Let an MDB do the call.
Actions
9. Re: ORA-02049; ORA-00060 - timeout deleting resources

pilhuhn Jun 29, 2013 1:21 PM (in response to genman)

All those calls to enable/disable/delete/... basically go

        try {
            AgentClient agentClient = agentManager.getAgentClient(agent);
            ResourceFactoryAgentService resourceFactoryAgentService = agentClient.getResourceFactoryAgentService();
            resourceFactoryAgentService.deleteResource(request);

            return persistedHistory;
        } catch (CannotConnectException e) {

so we try to get the AgentClient, create a remote-POJO and then use the remoted agent service to do the work.
And this is (unless told differently) waiting up to org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory#m_timeoutMillis

I did not yet find out what this value really is, but just trying to open the socket may take some time, that gets worse for group removals where the resources are on different agents.

Perhaps we can shortcut the logic by checking if the agent is currently marked as available or not and don't try to send when it is no up.

Currently when I have shut down the agent "snert" (and the server knows about it ) I see this in the log.

19:18:16,307 INFO [org.rhq.enterprise.server.core.AgentManagerBean] (http-/0.0.0.0:7080-1) Agent with name [snert] just went down
19:18:40,443 ERROR [org.rhq.enterprise.communications.command.client.ClientCommandSenderTask] (http-/0.0.0.0:7080-5) {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[uninventoryResource], targetInterfaceName=org.rhq.core.clientapi.agent.discovery.DiscoveryAgentService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://172.31.7.7:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000] -> java.net.ConnectException:Connection refused. Cause: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://172.31.7.7:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]
19:18:40,444 WARN [org.rhq.enterprise.server.resource.ResourceManagerBean] (http-/0.0.0.0:7080-5) Unable to inform agent of inventory removal for resource [10019]: org.jboss.remoting.CannotConnectException: Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://172.31.7.7:16163/?backlog=200&clientMaxPoolSize=304&enableTcpNoDelay=true&maxPoolSize=303&numAcceptThreads=1&rhq.communications.connector.rhqtype=agent&socketTimeout=60000]
          at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.transport(MicroSocketClientInvoker.java:838) [jboss-remoting-2.5.4.SP4.jar:]
          at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:169) [jboss-remoting-2.5.4.SP4.jar:]
          at org.jboss.remoting.Client.invoke(Client.java:2084) [jboss-remoting-2.5.4.SP4.jar:]
          at org.jboss.remoting.Client.invoke(Client.java:879) [jboss-remoting-2.5.4.SP4.jar:]
          at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.rawSend(JBossRemotingRemoteCommunicator.java:514) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutCallbacks(JBossRemotingRemoteCommunicator.java:456) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:475) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:496) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1091) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55) [rhq-enterprise-comm-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT]
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_25]
          at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_25]
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_25]
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_25]
          at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_25]
Caused by: java.net.ConnectException: Connection refused
          at java.net.PlainSocketImpl.socketConnect(Native Method) [rt.jar:1.7.0_25]
          at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) [rt.jar:1.7.0_25]
          at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) [rt.jar:1.7.0_25]
          at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) [rt.jar:1.7.0_25]
          at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [rt.jar:1.7.0_25]
          at java.net.Socket.connect(Socket.java:579) [rt.jar:1.7.0_25]
          at org.jboss.remoting.transport.socket.SocketClientInvoker$1.run(SocketClientInvoker.java:303) [jboss-remoting-2.5.4.SP4.jar:]
          at java.security.AccessController.doPrivileged(Native Method) [rt.jar:1.7.0_25]
          at org.jboss.remoting.transport.socket.SocketClientInvoker.connect(SocketClientInvoker.java:299) [jboss-remoting-2.5.4.SP4.jar:]
          at org.jboss.remoting.transport.socket.SocketClientInvoker.createSocket(SocketClientInvoker.java:210) [jboss-remoting-2.5.4.SP4.jar:]
          at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.getConnection(MicroSocketClientInvoker.java:1172) [jboss-remoting-2.5.4.SP4.jar:]
          at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.transport(MicroSocketClientInvoker.java:816) [jboss-remoting-2.5.4.SP4.jar:]
          ... 17 more
Actions
10. Re: ORA-02049; ORA-00060 - timeout deleting resources

pilhuhn Jun 29, 2013 1:22 PM (in response to pilhuhn)

See also https://bugzilla.redhat.com/show_bug.cgi?id=918207
Actions
11. Re: ORA-02049; ORA-00060 - timeout deleting resources

genman Jul 1, 2013 3:01 PM (in response to pilhuhn)

A connection refused error comes back quickly, it's an issue when the router/host is down. Maybe manually update the IP address to resolve to a host that is firewalled to reproduce?

I would also argue things would be better if the database isn't locked while inventory removal is going on. It's dangerous to start a transaction while doing blocking I/O like this.

Post 4.5 (4.7?) the issue is mitigated somewhat since we can permanently purge agents from the database, which is really the issue here.
Actions
12. Re: ORA-02049; ORA-00060 - timeout deleting resources

genman Sep 16, 2013 4:17 PM (in response to genman)

I continue to see this problem when the agent is on a IP address that is inaccessible and it causes the database to time out, causing availability reports to disappear and the other hosts to appear down as well. (This is 4.5. Later versions have a fix for availability.)

Why not do a ping or something before bothering to call the agent, or call the agent in a new thread? I added a suggested code change in the BZ issue above.
Actions

Go to original post