6 Replies Latest reply on Jan 9, 2015 5:35 AM by Stian Lund

    4.13: After RHQ agent operation restart, auth-token is lost

    Stian Lund Expert

      Hello,

      I have noticed with 4.13, that when I do an operational restart of RHQ Agents through the RHQ UI, the property rhq.agent.security-token is lost from prefs.xml and the agent cannot authenticate to the server.

       

      The following is logged:

       

      2015-01-06 10:55:08,793 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
      2015-01-06 10:55:08,796 INFO  [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is being shutdown...
      2015-01-06 10:55:08,796 INFO  [RHQ Primary Server Switchover Thread] (org.rhq.enterprise.agent.AgentMain)- {PrimaryServerSwitchoverThread.stopped}The primary server switchover thread has stopped.
      2015-01-06 10:55:08,796 INFO  [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container shutdown will wait up to 300 seconds for shut down background threads to terminate.
      2015-01-06 10:55:08,809 INFO  [RHQ Agent Restart Thread] (rhq.core.pc.content.ContentManager)- Shutting down Content Manager...
      2015-01-06 10:55:08,819 INFO  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementManager)- Shutting down measurement collection...
      2015-01-06 10:55:09,538 INFO  [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- All shut down background threads have terminated (less than 1 second elapsed).
      2015-01-06 10:55:09,539 INFO  [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is now shutdown.
      2015-01-06 10:55:10,446 ERROR [ClientCommandSenderTask Thread #1] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=b27apvl016.preprod.local, rhq.externalizable-strategy=AGENT, rhq.security-token=Nr+JcEVqoTQ2v5QUuNi14pNT3DvQv+EprvDBZpJ06+RWlNwPOcvM6lw9z0Kt7Izf4BM=, rhq.guaranteed-delivery=true, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[operationSucceeded], targetInterfaceName=org.rhq.core.clientapi.server.operation.OperationServerService}]]. Cause: java.lang.InterruptedException:null. Cause: java.lang.InterruptedException
      2015-01-06 10:55:10,448 WARN  [ClientCommandSenderTask Thread #1] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.queuing-failed-command}The command that failed has its guaranteed-delivery flag set so it is being queued again
      2015-01-06 10:55:44,236 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down...
      2015-01-06 10:55:44,238 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands
      2015-01-06 10:55:44,239 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down
      2015-01-06 10:55:44,244 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.global-concurrency-limit-disabled}Global concurrency limit has been disabled - there is no limit to the number of incoming commands allowed
      2015-01-06 10:55:44,262 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.started}Service container started - ready to accept incoming commands
      2015-01-06 10:55:44,262 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.no-auto-detect}Server auto-detection is not enabled - starting the poller immediately
      2015-01-06 10:56:44,263 WARN  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4]
      2015-01-06 10:56:44,264 WARN  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[3]
      2015-01-06 10:56:44,264 WARN  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[2]
      2015-01-06 10:56:44,264 WARN  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[1]
      2015-01-06 10:56:44,264 WARN  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[0]
      2015-01-06 10:56:44,265 FATAL [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort
      org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed!
          at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744)
          at org.rhq.enterprise.agent.AgentManagement$2$1.run(AgentManagement.java:156)
          at org.rhq.enterprise.agent.AgentManagement$2$1.run(AgentManagement.java:148)
          at java.security.AccessController.doPrivileged(Native Method)
          at org.rhq.enterprise.agent.AgentManagement$2.run(AgentManagement.java:148)
          at java.lang.Thread.run(Thread.java:662)
      2015-01-06 10:56:44,265 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
      2015-01-06 10:56:44,265 INFO  [RHQ Agent Restart Thread] (org.rhq.core.pc.PluginContainer)- Plugin container is already shut down.
      2015-01-06 10:56:44,266 WARN  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failed-to-shutdown-component}Agent failed to shutdown component [Plugin Container]. Cause: java.lang.NullPointerException:null
      2015-01-06 10:56:54,267 ERROR [RHQ Agent Restart Thread] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=b27apvl016.preprod.local, rhq.externalizable-strategy=AGENT, rhq.timeout=10000}]; params=[{invocation=NameBasedInvocation[agentIsShuttingDown], targetInterfaceName=org.rhq.core.clientapi.server.core.CoreServerService}]]. Cause: java.util.concurrent.TimeoutException:null. Cause: java.util.concurrent.TimeoutException
      2015-01-06 10:56:54,268 WARN  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.server-shutdown-notification-failure}Agent failed to notify the server of the pending shutdown. Cause: java.lang.reflect.UndeclaredThrowableException:null -> java.util.concurrent.TimeoutException:null
      2015-01-06 10:56:54,268 INFO  [RHQ Server Polling Thread] (enterprise.communications.command.client.ServerPollingThread)- {ServerPollingThread.server-online}The server has come back online; client has been told to start sending commands again
      2015-01-06 10:56:54,269 ERROR [ClientCommandSenderTask Thread #0] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.guaranteed-delivery=true, rhq.agent-name=b27apvl016.preprod.local, rhq.security-token=Nr+JcEVqoTQ2v5QUuNi14pNT3DvQv+EprvDBZpJ06+RWlNwPOcvM6lw9z0Kt7Izf4BM=, rhq.retry=1, rhq.externalizable-strategy=AGENT, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[operationSucceeded], targetInterfaceName=org.rhq.core.clientapi.server.operation.OperationServerService}]]. Cause: java.lang.InterruptedException:null. Cause: java.lang.InterruptedException
      2015-01-06 10:56:54,270 WARN  [ClientCommandSenderTask Thread #0] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.queuing-failed-command}The command that failed has its guaranteed-delivery flag set so it is being queued again
      2015-01-06 10:57:54,273 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutting-down}Service container shutting down...
      2015-01-06 10:57:54,276 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.shutdown}Service container shut down - no longer accepting incoming commands
      2015-01-06 10:57:54,276 INFO  [RHQ Agent Restart Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shut-down}Agent has been shut down
      2015-01-06 11:00:09,978 INFO  [ClientCommandSenderTask Timer Thread #2] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]] to [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]]
      2015-01-06 11:00:09,978 WARN  [ClientCommandSenderTask Timer Thread #2] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: java.lang.NullPointerException
      2015-01-06 11:00:09,980 WARN  [ClientCommandSenderTask Timer Thread #2] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [1]. Exception that triggered the failover: [org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s)]
      2015-01-06 11:00:09,985 INFO  [ClientCommandSenderTask Timer Thread #1] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]] to [InvokerLocator [servlet://b27apvl004.preprod.local:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet/?generalizeSocketException=true]]
      2015-01-06 11:00:09,990 WARN  [ClientCommandSenderTask Timer Thread #1] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: java.lang.NullPointerException
      
      2015-01-06 12:20:39,028 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.identify-version}Version=[RHQ 4.13.0], Build Number=[07169c1], Build Date=[Dec 2, 2014 11:50 PM]
      2015-01-06 12:20:39,084 INFO  [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.global-concurrency-limit-disabled}Global concurrency limit has been disabled - there is no limit to the number of incoming commands allowed
      2015-01-06 12:20:39,228 INFO  [main] (org.rhq.enterprise.communications.ServiceContainer)- {ServiceContainer.started}Service container started - ready to accept incoming commands
      2015-01-06 12:20:39,228 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.no-auto-detect}Server auto-detection is not enabled - starting the poller immediately
      2015-01-06 12:20:40,524 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [b27apvl016.preprod.local] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration]
      2015-01-06 12:20:40,525 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-retry}Will retry the agent registration request soon...
      2015-01-06 12:21:10,588 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [b27apvl016.preprod.local] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration]
      2015-01-06 12:21:10,589 ERROR [RHQ Agent Registration Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.agent-registration-retry}Will retry the agent registration request soon...
      2015-01-06 12:21:39,244 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[4]
      2015-01-06 12:21:39,245 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[3]
      2015-01-06 12:21:39,245 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[2]
      2015-01-06 12:21:39,245 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[1]
      2015-01-06 12:21:39,246 WARN  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-registration-failed-retry}A server appears to be up, but the agent is not yet registered. Will wait again. wait counter=[0]
      2015-01-06 12:21:39,247 FATAL [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.startup-error}The agent encountered an error during startup and must abort
      org.rhq.core.clientapi.server.core.AgentRegistrationException: The agent cannot register with the server. Admin intervention needed!
          at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:744)
          at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:461)
      2015-01-06 12:21:39,250 INFO  [main] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.shutting-down}Agent is being shut down...
      
      
      

       

      I also noticed that after agent was stopped, a file 'prefs.tmp' existed in the same folder, $HOME/.java/.userPrefs/rhq-agent/default. This was removed after starting it again.

      Not sure if this applies when doing the agent restart with the shell script.

       

      It does not happen every time apparently, so hard to nail down, for instance I restarted all agents on 15 servers and 4 of them failed registration after restart and I had to manually register again.

       

      Potentially this can mean a *lot* of work if you have a lot of platforms and a lot of them are suddenly without agents.

       

      I want to check if this is a known bug or if I should make a BZ.

       

      Stian

        • 1. Re: 4.13: After RHQ agent operation restart, auth-token is lost
          Stian Lund Expert

          Attaching RHQ Agent logs with DEBUG enabled during a restart from RHQ UI and loss of auth-token.

          • 2. Re: 4.13: After RHQ agent operation restart, auth-token is lost
            mazz Master

            Would you happen to be running multiple agents on the same machine? Or running agents on machines that have a shared file system (where, say, your home directory is NFS mounted)? The FAQ talks about this - can't share the same preferences location for more than one agent.

             

            Just a thought.

             

            The fact it doesn't happen every time makes it seem to me that something is up with the preference persistence store.

            • 3. Re: 4.13: After RHQ agent operation restart, auth-token is lost
              Stian Lund Expert

              Hi, thanks for the reply.

               

              This is a very basic setup and there are no shared file systems involved.

               

              - Agent installed in /opt/rhq/rhq-agent

              - Running as user with ownership of /opt/rhq

              - prefs.xml saved in /home/user/.java... with write for the user.

              - There are different partitions (logical volumes) but /home is on /. /opt is it's own LV.

               

              However it's running on Vmware so I hope it's not a subtle bug in Vmware that causes this.

               

              I am pretty sure this has only been happening with RHQ 4.13 and I did restart of agents all the time in 4.12 and earlier without this happening.

               

              Also, why would the prefs.xml file need to be touched at all during agent restart?

               

              Stian

              • 4. Re: 4.13: After RHQ agent operation restart, auth-token is lost
                Stian Lund Expert

                At the moment I am struggling with consistently reproducing this.

                 

                I am able to reproduce it if I restart all the 16 agents in the same operation but this leaves a mess to clean up afterwards. For instance I could be left with 12/16 agents down...

                 

                So I have tried creating a small group of four agents and restart this every five minutes. But at the moment I've not seen it again.

                 

                Also note this is not something that just happened once or twice. This has happened for me every time I've restarted all the agent for an RHQ environment, on a number of agents.

                 

                I suspect maybe since it seems to occur with restarts of many at once, this has possibly something to do with threads and concurrency.

                 

                Stian

                • 5. Re: 4.13: After RHQ agent operation restart, auth-token is lost
                  mazz Master

                  > I suspect maybe since it seems to occur with restarts of many at once, this has possibly something to do with threads and concurrency.

                   

                  Since each agent will want to access its own preferences store (on the file system) but each agent is on its own file system, I'm not sure how concurrency in the RHQ server or agent could be causing the problem (unless the concurrency problem is in the file system implementation itself, which I would find hard to believe too )

                   

                  I looked at the agent code and didn't see anything pop out at me that would cause this. I seems somehow the preferences store is getting corrupted. Did you look in the prefs.xml file itself under ~/.java? Is that file completely empty? or does it have everything except for the rhq.agent.security-token entry?

                  • 6. Re: 4.13: After RHQ agent operation restart, auth-token is lost
                    Stian Lund Expert

                    The file is still there, under /home/jbossadm/.java/.userPrefs/rhq-agent/default, it's just missing the token.

                      <entry key="rhq.agent.register-with-server-at-startup" value="true"/>

                      <entry key="rhq.agent.server-auto-detection" value="false"/>

                      <entry key="rhq.agent.server.alias" value="rhqserver"/>

                     

                    Usually it's below 'register-with-server'.

                     

                    Do you guys have any installations with like 10 or more agents you could try to reproduce by just attempting to restart all the agents at once.

                     

                    I have been unable to repro with small groups, there is a larger chance of it occuring the larger amount of agents being restarted it seems.

                    I just did a restart of 15 agents and five of them failed to register.

                     

                    Stian