6 Replies Latest reply on Dec 12, 2012 9:50 AM by dfradkov

    RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert

    dfradkov

      Hello,

       

      We have an an issue where all monitored agents report that components are down ("Host down" alert). We were able to trace "Host down" alerts to the ligit network outages. However once outage is resolved RHQ never receives the component up alerts. We checked agents and all of them are up and running. There are no obvious exceptions or messages in agent logs that would explain why alerts are not sent.

       

      We have to manually restart every single agent on every single monitored server. Once we do that we receive "Host up" alert.

       

      Any ideas why "Host up" alert is never being sent by the agent? Or is it a server issue?

        • 1. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert
          tsegismont

          Hi,

           

          How your resources appear in RHQ server after outage is resolved? UP? DOWN? Just to figure out if the problem come from availabilty report or from the alert subsystem.

          Is the problem occuring on all or only some resource types?

          How are your availability check intervals configured?

           

          Thanks

          • 2. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert
            dfradkov

            Hi,

             

            How your resources appear in RHQ server after outage is resolved? UP? DOWN? Just to figure out if the problem come from availabilty report or from the alert subsystem.

            Resources appears down.

            Is the problem occuring on all or only some resource types?

            We have modified DNS entries on the RHQ server and it seems that this problem now affects only some servers. It used to affect all servers. We have to restart agent when this happens then resource appears to be up on the RHQ Dashboard.

            How are your availability check intervals configured?

            Metric collections time varies from one minute to twenty minutes.

             

            Thanks.

            • 3. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert
              tsegismont

              >Resources appears down

              Ok so alerting sub system may not be involved.

               

              >We have modified DNS entries on the RHQ server and it seems that this problem now affects only some servers.

              Sounds weird. Any particular error message in server/agent? Can you tell me precisely which resource types continue to show status "down"?

              • 4. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert
                dfradkov

                Well, there are couple of exceptions the server log.

                 

                partial stack trace for couple of errors thrown in the last two days

                 

                2012-12-09 00:33:07,470 WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 12899, SQLState: 72000

                2012-12-09 00:33:07,473 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-12899: value too large for column "RHQ"."RHQ_PACKAGE_VERSION"."LICENSE_NAME" (actual: 863, maximum: 255)

                 

                2012-12-09 00:33:07,473 WARN  [org.hibernate.util.JDBCExceptionReporter] SQL Error: 12899, SQLState: 72000

                2012-12-09 00:33:07,473 ERROR [org.hibernate.util.JDBCExceptionReporter] ORA-12899: value too large for column "RHQ"."RHQ_PACKAGE_VERSION"."LICENSE_NAME" (actual: 863, maximum: 255)

                 

                2012-12-09 00:33:07,473 ERROR [org.hibernate.event.def.AbstractFlushingEventListener] Could not synchronize database state with session

                org.hibernate.exception.GenericJDBCException: Could not execute JDBC batch update

                        at org.hibernate.exception.SQLStateConverter.handledNonSpecificException(SQLStateConverter.java:103)

                        at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:91)

                        at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)

                        at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:254)

                        at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:237)

                        at org.hibernate.engine.ActionQueue.executeActions(ActionQueue.java:141)

                        at org.hibernate.event.def.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:298)

                        at org.hibernate.event.def.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:27)

                        at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1000)

                        at org.hibernate.impl.SessionImpl.managedFlush(SessionImpl.java:338)

                        at org.hibernate.ejb.AbstractEntityManagerImpl$1.beforeCompletion(AbstractEntityManagerImpl.java:515)

                <--------------------------------------------------------snippet----------------------------------------------------------------------------------------------------->

                Caused by: java.sql.BatchUpdateException: ORA-12899: value too large for column "RHQ"."RHQ_PACKAGE_VERSION"."LICENSE_NAME" (actual: 863, maximum: 255)

                 

                        at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10345)

                        at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:230)

                        at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:476)

                        at org.jboss.resource.adapter.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:774)

                        at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:48)

                        at org.hibernate.jdbc.AbstractBatcher.executeBatch(AbstractBatcher.java:247)

                        ... 167 more

                   

                 

                 

                2012-12-09 22:11:20,967 WARN  [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null

                org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (java.sql.SQLRecoverableException: IO Error: Socket read timed out)

                        at org.jboss.resource.adapter.jdbc.local.LocalManagedConnectionFactory.createManagedConnection(LocalManagedConnectionFactory.java:190)

                        at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.createConnectionEventListener(InternalManagedConnectionPool.java:619)

                        at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.getConnection(InternalManagedConnectionPool.java:264)

                        at org.jboss.resource.connectionmanager.JBossManagedConnectionPool$BasePool.getConnection(JBossManagedConnectionPool.java:575)

                        at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:347)

                        at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:332)

                        at org.jboss.resource.connectionmanager.BaseConnectionManager2.allocateConnection(BaseConnectionManager2.java:402)

                        at org.jboss.resource.connectionmanager.BaseConnectionManager2$ConnectionManagerProxy.allocateConnection(BaseConnectionManager2.java:849)

                        at org.jboss.resource.adapter.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:89)

                        at org.quartz.utils.JNDIConnectionProvider.getConnection(JNDIConnectionProvider.java:160)

                        at org.quartz.utils.DBConnectionManager.getConnection(DBConnectionManager.java:112)

                        at org.quartz.impl.jdbcjobstore.JobStoreCMT.getNonManagedTXConnection(JobStoreCMT.java:164)

                        at org.quartz.impl.jdbcjobstore.JobStoreSupport.doRecoverMisfires(JobStoreSupport.java:3108)

                        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.manage(JobStoreSupport.java:3887)

                        at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.run(JobStoreSupport.java:3907)

                Caused by: java.sql.SQLRecoverableException: IO Error: Socket read timed out

                        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:458)

                        at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:546)

                        at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:236)

                        at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)

                        at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)

                        at org.jboss.resource.adapter.jdbc.local.LocalManagedConnectionFactory.createManagedConnection(LocalManagedConnectionFactory.java:172)

                        ... 14 more

                Caused by: oracle.net.ns.NetException: Socket read timed out

                        at oracle.net.ns.Packet.receive(Packet.java:339)

                        at oracle.net.ns.NSProtocol.connect(NSProtocol.java:296)

                        at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1102)

                        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:320)

                        ... 19 more

                Can you tell me precisely which resource types continue to show status "down"?

                Servers and server agents.

                • 5. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert
                  tsegismont

                  These errors should not affect the availabilty sub-system.

                   

                  Please make sure all your monitored servers have correct DNS forward and reverse mapping.

                  • 6. Re: RHQ Server 4.4.0 fails to discover that agent is up after receiving component down alert
                    dfradkov

                    We are planning to update all DNS entries hopefully that will help as I am running out of ideas.