6 Replies Latest reply on Dec 12, 2012 9:50 AM by dfradkov

RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

dfradkov Dec 5, 2012 3:27 PM

Hello,

We have an an issue where all monitored agents report that components are down ("Host down" alert). We were able to trace "Host down" alerts to the ligit network outages. However once outage is resolved RHQ never receives the component up alerts. We checked agents and all of them are up and running. There are no obvious exceptions or messages in agent logs that would explain why alerts are not sent.

We have to manually restart every single agent on every single monitored server. Once we do that we receive "Host up" alert.

Any ideas why "Host up" alert is never being sent by the agent? Or is it a server issue?

1. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

tsegismont Dec 10, 2012 5:27 AM (in response to dfradkov)

Hi,

How your resources appear in RHQ server after outage is resolved? UP? DOWN? Just to figure out if the problem come from availabilty report or from the alert subsystem.
Is the problem occuring on all or only some resource types?
How are your availability check intervals configured?

Thanks
Actions
2. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

dfradkov Dec 10, 2012 10:04 AM (in response to tsegismont)

Hi,

How your resources appear in RHQ server after outage is resolved? UP? DOWN? Just to figure out if the problem come from availabilty report or from the alert subsystem.
Resources appears down.
Is the problem occuring on all or only some resource types?
We have modified DNS entries on the RHQ server and it seems that this problem now affects only some servers. It used to affect all servers. We have to restart agent when this happens then resource appears to be up on the RHQ Dashboard.
How are your availability check intervals configured?
Metric collections time varies from one minute to twenty minutes.

Thanks.
Actions
3. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

tsegismont Dec 10, 2012 12:34 PM (in response to dfradkov)

>Resources appears down
Ok so alerting sub system may not be involved.

>We have modified DNS entries on the RHQ server and it seems that this problem now affects only some servers.
Sounds weird. Any particular error message in server/agent? Can you tell me precisely which resource types continue to show status "down"?
Actions
4. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

dfradkov Dec 10, 2012 4:07 PM (in response to tsegismont)

Well, there are couple of exceptions the server log.

partial stack trace for couple of errors thrown in the last two days

2012-12-09 00:33:07,470 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 12899, SQLState: 72000
2012-12-09 00:33:07,473 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-12899: value too large for column "RHQ"."RHQ_PACKAGE_VERSION"."LICENSE_NAME" (actual: 863, maximum: 255)

2012-12-09 00:33:07,473 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 12899, SQLState: 72000
2012-12-09 00:33:07,473 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-12899: value too large for column "RHQ"."RHQ_PACKAGE_VERSION"."LICENSE_NAME" (actual: 863, maximum: 255)

2012-12-09 00:33:07,473 ERROR [org.hibernate.event.def.AbstractFlushingEventListener] Could not synchronize database state with session
org.hibernate.exception.GenericJDBCException: Could not execute JDBC batch update
        at org.hibernate.exception.SQLStateConverter.handledNonSpecificException(SQLStateConverter.java:103)
        at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:91)
        at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
        at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:254)
        at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:237)
        at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:141)
        at org.hibernate.event.def.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:298)
        at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:27)
        at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1000)
        at org.hibernate.impl.SessionImpl.managedFlush(SessionImpl.java:338)
        at org.hibernate.ejb.AbstractEntityManagerImpl$1.beforeCompletion(AbstractEntityManagerImpl.java:515)
<--------------------------------------------------------snippet----------------------------------------------------------------------------------------------------->
Caused by: java.sql.BatchUpdateException: ORA-12899: value too large for column "RHQ"."RHQ_PACKAGE_VERSION"."LICENSE_NAME" (actual: 863, maximum: 255)

        at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10345)
        at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:230)
        at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:476)
        at org.jboss.resource.adapter.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:774)
        at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)
        at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:247)
        ... 167 more


2012-12-09 22:11:20,967 WARN [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null
org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (java.sql.SQLRecoverableException: IO Error: Socket read timed out)
        at org.jboss.resource.adapter.jdbc.local.LocalManagedConnectionFactory.createManagedConnection(LocalManagedConnectionFactory.java:190)
        at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.createConnectionEventListener(InternalManagedConnectionPool.java:619)
        at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.getConnection(InternalManagedConnectionPool.java:264)
        at org.jboss.resource.connectionmanager.JBossManagedConnectionPool$BasePool.getConnection(JBossManagedConnectionPool.java:575)
        at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:347)
        at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:332)
        at org.jboss.resource.connectionmanager.BaseConnectionManager2.allocateConnection(BaseConnectionManager2.java:402)
        at org.jboss.resource.connectionmanager.BaseConnectionManager2$ConnectionManagerProxy.allocateConnection(BaseConnectionManager2.java:849)
        at org.jboss.resource.adapter.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:89)
        at org.quartz.utils.JNDIConnectionProvider.getConnection(JNDIConnectionProvider.java:160)
        at org.quartz.utils.DBConnectionManager.getConnection(DBConnectionManager.java:112)
        at org.quartz.impl.jdbcjobstore.JobStoreCMT.getNonManagedTXConnection(JobStoreCMT.java:164)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.doRecoverMisfires(JobStoreSupport.java:3108)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.manage(JobStoreSupport.java:3887)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.run(JobStoreSupport.java:3907)
Caused by: java.sql.SQLRecoverableException: IO Error: Socket read timed out
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:458)
        at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:546)
        at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:236)
        at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
        at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)
        at org.jboss.resource.adapter.jdbc.local.LocalManagedConnectionFactory.createManagedConnection(LocalManagedConnectionFactory.java:172)
        ... 14 more
Caused by: oracle.net.ns.NetException: Socket read timed out
        at oracle.net.ns.Packet.receive(Packet.java:339)
        at oracle.net.ns.NSProtocol.connect(NSProtocol.java:296)
        at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1102)
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:320)
        ... 19 more
Can you tell me precisely which resource types continue to show status "down"?
Servers and server agents.
Actions
5. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

tsegismont Dec 11, 2012 6:10 AM (in response to dfradkov)

These errors should not affect the availabilty sub-system.

Please make sure all your monitored servers have correct DNS forward and reverse mapping.
Actions
6. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

dfradkov Dec 12, 2012 9:50 AM (in response to tsegismont)

We are planning to update all DNS entries hopefully that will help as I am running out of ideas.
Actions

Go to original post