4.13: After RHQ agent operation restart, auth-token is lost
pathduck Jan 6, 2015 6:37 AMHello,
I have noticed with 4.13, that when I do an operational restart of RHQ Agents through the RHQ UI, the property rhq.agent.security-token is lost from prefs.xml and the agent cannot authenticate to the server.
The following is logged:
2015-01-06 10:55:08,793 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down... 2015-01-06 10:55:08,796 INFO [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is being shutdown... 2015-01-06 10:55:08,796 INFO [RHQ Primary Server Switchover Thread] (org.rhq.enterprise.agent.AgentMain)- {PrimaryServerSwitchoverThread.stopped}The primary server switchover thread has stopped. 2015-01-06 10:55:08,796 INFO [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container shutdown will wait up to 300 seconds for shut down background threads to terminate. 2015-01-06 10:55:08,809 INFO [RHQ Agent Restart Thread] (rhq.core.pc.content.ContentManager)- Shutting down Content Manager... 2015-01-06 10:55:08,819 INFO [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementManager)- Shutting down measurement collection... 2015-01-06 10:55:09,538 INFO [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- All shut down background threads have terminated (less than 1 second elapsed). 2015-01-06 10:55:09,539 INFO [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is now shutdown. 2015-01-06 10:55:10,446 ERROR [ClientCommandSenderTask Thread #1] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=b27apvl016.preprod.local, rhq.externalizable-strategy=AGENT, rhq.security-token=Nr+JcEVqoTQ2v5QUuNi14pNT3DvQv+EprvDBZpJ06+RWlNwPOcvM6lw9z0Kt7Izf4BM=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[operationSucceeded], targetInterfaceName=org.rhq.core.clientapi.server.operation.OperationServerService}]]. Cause: java.lang.InterruptedException:null. Cause: java.lang.InterruptedException 2015-01-06 10:55:10,448 WARN [ClientCommandSenderTask Thread #1] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.queuing-failed-command}The command that failed has its guaranteed-delivery flag set so it is being queued again 2015-01-06 10:55:44,236 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down... 2015-01-06 10:55:44,238 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands 2015-01-06 10:55:44,239 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down 2015-01-06 10:55:44,244 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.global-concurrency-limit-disabled}Global concurrency limit has been disabled - there is no limit to the number of incoming commands allowed 2015-01-06 10:55:44,262 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.started}Service container started - ready to accept incoming commands 2015-01-06 10:55:44,262 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.no-auto-detect}Server auto-detection is not enabled - starting the poller immediately 2015-01-06 10:56:44,263 WARN [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4] 2015-01-06 10:56:44,264 WARN [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[3] 2015-01-06 10:56:44,264 WARN [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[2] 2015-01-06 10:56:44,264 WARN [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[1] 2015-01-06 10:56:44,264 WARN [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[0] 2015-01-06 10:56:44,265 FATAL [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed! at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744) at org.rhq.enterprise.agent.AgentManagement$2$1.run(AgentManagement.java:156) at org.rhq.enterprise.agent.AgentManagement$2$1.run(AgentManagement.java:148) at java.security.AccessController.doPrivileged(Native Method) at org.rhq.enterprise.agent.AgentManagement$2.run(AgentManagement.java:148) at java.lang.Thread.run(Thread.java:662) 2015-01-06 10:56:44,265 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down... 2015-01-06 10:56:44,265 INFO [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is already shut down. 2015-01-06 10:56:44,266 WARN [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failed-to-shutdown-component}Agent failed to shutdown component [Plugin Container]. Cause: java.lang.NullPointerException:null 2015-01-06 10:56:54,267 ERROR [RHQ Agent Restart Thread] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=b27apvl016.preprod.local, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{invocation=NameBasedInvocation[agentIsShuttingDown], targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException 2015-01-06 10:56:54,268 WARN [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.server-shutdown-notification-failure}Agent failed to notify the server of the pending shutdown. Cause: java.lang.reflect.UndeclaredThrowableException:null -> java.util.concurrent.TimeoutException:null 2015-01-06 10:56:54,268 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.ServerPollingThread)- {ServerPollingThread.server-online}The server has come back online; client has been told to start sending commands again 2015-01-06 10:56:54,269 ERROR [ClientCommandSenderTask Thread #0] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.guaranteed-delivery=true, rhq.agent-name=b27apvl016.preprod.local, rhq.security-token=Nr+JcEVqoTQ2v5QUuNi14pNT3DvQv+EprvDBZpJ06+RWlNwPOcvM6lw9z0Kt7Izf4BM=, rhq.retry=1, rhq.externalizable-strategy=AGENT, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[operationSucceeded], targetInterfaceName=org.rhq.core.clientapi.server.operation.OperationServerService}]]. Cause: java.lang.InterruptedException:null. Cause: java.lang.InterruptedException 2015-01-06 10:56:54,270 WARN [ClientCommandSenderTask Thread #0] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.queuing-failed-command}The command that failed has its guaranteed-delivery flag set so it is being queued again 2015-01-06 10:57:54,273 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down... 2015-01-06 10:57:54,276 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands 2015-01-06 10:57:54,276 INFO [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down 2015-01-06 11:00:09,978 INFO [ClientCommandSenderTask Timer Thread #2] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]] to [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]] 2015-01-06 11:00:09,978 WARN [ClientCommandSenderTask Timer Thread #2] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: java.lang.NullPointerException 2015-01-06 11:00:09,980 WARN [ClientCommandSenderTask Timer Thread #2] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [1]. Exception that triggered the failover: [org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s)] 2015-01-06 11:00:09,985 INFO [ClientCommandSenderTask Timer Thread #1] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]] to [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]] 2015-01-06 11:00:09,990 WARN [ClientCommandSenderTask Timer Thread #1] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: java.lang.NullPointerException 2015-01-06 12:20:39,028 INFO [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.identify-version}Version=[RHQ 4.13.0], Build Number=[07169c1], Build Date=[Dec 2, 2014 11:50 PM] 2015-01-06 12:20:39,084 INFO [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.global-concurrency-limit-disabled}Global concurrency limit has been disabled - there is no limit to the number of incoming commands allowed 2015-01-06 12:20:39,228 INFO [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.started}Service container started - ready to accept incoming commands 2015-01-06 12:20:39,228 INFO [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.no-auto-detect}Server auto-detection is not enabled - starting the poller immediately 2015-01-06 12:20:40,524 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [b27apvl016.preprod.local] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration] 2015-01-06 12:20:40,525 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-retry}Will retry the agent registration request soon... 2015-01-06 12:21:10,588 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [b27apvl016.preprod.local] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration] 2015-01-06 12:21:10,589 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-retry}Will retry the agent registration request soon... 2015-01-06 12:21:39,244 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4] 2015-01-06 12:21:39,245 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[3] 2015-01-06 12:21:39,245 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[2] 2015-01-06 12:21:39,245 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[1] 2015-01-06 12:21:39,246 WARN [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[0] 2015-01-06 12:21:39,247 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed! at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744) at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:461) 2015-01-06 12:21:39,250 INFO [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
I also noticed that after agent was stopped, a file 'prefs.tmp' existed in the same folder, $HOME/.java/.userPrefs/rhq-agent/default. This was removed after starting it again.
Not sure if this applies when doing the agent restart with the shell script.
It does not happen every time apparently, so hard to nail down, for instance I restarted all agents on 15 servers and 4 of them failed registration after restart and I had to manually register again.
Potentially this can mean a *lot* of work if you have a lot of platforms and a lot of them are suddenly without agents.
I want to check if this is a known bug or if I should make a BZ.
Stian