3 Replies Latest reply on Jan 28, 2010 6:46 PM by mazz

    Issues communicating between Jopr agent and it's server

    letme know Newbie

      Hi there,

       

      We are facing an issue while communicating between JOPR agent and server.  Below are the details. Any kind of help is appreciated.

       

                      First time Agent setup was not smooth and had to make agent discovery with >discovery –f command which force to discover. Once server imported the configuration, inventory application will show this "agent up" for 15 min (agents max quite time allowed) and then after application will show "agent down" because of no communication from agent. Every time we hit command >discovery –f server will show client up for 15 min.  

       

      Jopr Server IP : 10.151.6.200

      Jopr agent IP   :10.151.6.147 - Working

      Jopr agent IP   :10.151.6.152 - Working

      Jopr agent IP   :10.151.4.140 - Working

      Jopr agent IP   :10.151.4.42 – issue described above

      Jopr agent IP   :10.151.4.57 – issue described above

       

      Environment : Windows 2003 / Jopr 2.3.0 / Jopr agent 1.3.0 / JBossAS 4.0.3SP1 / Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03), Java HotSpot(TM) Client VM (build 1.5.0_14-b03, mixed mode)

       

      Agent Log : no log

       

      Server Log:

       

      2010-01-27 18:28:49,096 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] Agent [XXX][1.3.0(5072)] would like to connect to this server

      2010-01-27 18:28:49,158 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] Agent [XXX] has connected to this server at Wed Jan 27 18:28:49 CST 2010

      2010-01-27 18:28:50,392 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] Got agent registration request for existing agent: XXX[10.151.4.23:16163][1.3.0(5072)] - Will not regenerate a new token

      2010-01-27 18:29:28,033 INFO  [org.rhq.enterprise.server.discovery.DiscoveryServerServiceImpl] Processed AV:[XXX][4][full] - need full=[false] in (31)ms

      2010-01-27 18:29:29,189 INFO  [org.rhq.enterprise.server.discovery.DiscoveryServerServiceImpl] Processed AV:[XXX][12][full] - need full=[false] in (47)ms

      2010-01-27 18:29:42,986 INFO  [org.rhq.enterprise.server.cloud.instance.CacheConsistencyManagerBean] gsd1w168s.CORP.ROOT.GLOBAL took [47]ms to reload cache for 1 agents

      2010-01-27 18:46:42,792 WARN  [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [XXX] since [Wed Jan 27 18:29:29 CST 2010]. Will be backfilled since we suspect it is down

      2010-01-27 18:44:42,793 WARN  [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [XXX] since [Wed Jan 27 18:29:29 CST 2010]. Will be backfilled since we suspect it is down

      2010-01-27 18:45:42,792 WARN  [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [XXX] since [Wed Jan 27 18:29:29 CST 2010]. Will be backfilled since we suspect it is down

        • 1. Re: Issues communicating between Jopr agent and it's server
          mazz Master

          in this log message:

           

          2010-01-27 18:28:50,392 INFO  [org.rhq.enterprise.server.core.CoreServerServiceImpl] Got agent registration request for existing agent: XXX[10.151.4.23:16163][1.3.0(5072)] - Will not regenerate a new token

           

          the IP address "10.151.4.23" does not match any of those that you listed. So, that's something to investigate - unsure what that would indicate, but the fact it is an IP that you didn't mention raised a red flag.

           

          This message:

           

          2010-01-27 18:46:42,792 WARN  [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [XXX] since [Wed Jan 27 18:29:29 CST 2010]. Will be backfilled since we suspect it is down

           

          means there is no availability report that was received from that agent which successfully made it to the database.

           

          You need to get your agent log for agent XXX on IP 10.151.4.23 and look for any error messages in there. Make sure your agents can resolve your Server's public endpoint addresses (Administration>HighAvailability>Server - and look at the servers endpoint addresses, make sure all agents can resolve and see those hosts).

           

          Execute the agent prompt command "failover -l" and see what the agent's failover list says and make sure all the servers in the list are resolveable/reachable from the agent machine (you can get this list by looking at <agent-install-dir>/data/failover-list.dat too)

          • 2. Re: Issues communicating between Jopr agent and it's server
            letme know Newbie

            Thanks for the reply John. We made sure the endpoint addresses are indeed reachable from the agent to server and vice-versa. But see the below error messages in the agent.log! Could you suggest as to where the issue could be?

             

            *******************************

            2010-01-23 14:35:47,460 ERROR [ResourceDiscoveryComponent.invoker.daemon-1] (org.rhq.plugins.perftest.ScenarioManager)- Cannot find scenario name. Make sure the property on.perftest.scenario is set.

            2010-01-23 14:42:27,905 ERROR [WorkerThread#0[10.151.6.200:1644]] (jboss.remoting.transport.socket.ServerThread)- Worker thread initialization failure

            java.lang.reflect.InvocationTargetException

            at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

            at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

            at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

            at java.lang.reflect.Constructor.newInstance(Constructor.java:494)

            at org.jboss.remoting.transport.socket.ServerThread.createServerSocketWrapper(ServerThread.java:755)

            at org.jboss.remoting.transport.socket.ServerThread.dorun(ServerThread.java:375)

            at org.jboss.remoting.transport.socket.ServerThread.run(ServerThread.java:173)

            Caused by: java.io.StreamCorruptedException: invalid stream header

            at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:764)

            at java.io.ObjectInputStream.<init>(ObjectInputStream.java:277)

            at org.jboss.remoting.loading.ObjectInputStreamWithClassLoader.<init>(ObjectInputStreamWithClassLoader.java:95)

            at org.jboss.remoting.serialization.impl.java.JavaSerializationManager.createInput(JavaSerializationManager.java:54)

            at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:72)

            at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:55)

            at org.jboss.remoting.transport.socket.ClientSocketWrapper.createInputStream(ClientSocketWrapper.java:179)

            at org.jboss.remoting.transport.socket.ClientSocketWrapper.createStreams(ClientSocketWrapper.java:162)

            at org.jboss.remoting.transport.socket.ClientSocketWrapper.<init>(ClientSocketWrapper.java:66)

            at org.jboss.remoting.transport.socket.ServerSocketWrapper.<init>(ServerSocketWrapper.java:46)

            ... 7 more

            2010-01-25 02:03:40,625 ERROR [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.time-not-synced}The server and agent clocks are not in sync. Server=[1264406665034][January 25, 2010 2:04:25 AM CST], Agent=[1264406620625][January 25, 2010 2:03:40 AM CST]

            2010-01-27 18:16:25,901 INFO  [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://xxxxxxxxx:80/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://xxxxxxxxx:80/jboss-remoting-servlet-invoker/ServerInvokerServlet]]

            2010-01-27 18:16:25,901 WARN  [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server.  Error returned was 400

            2010-01-27 18:16:25,901 WARN  [RHQ Server Polling Thread] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [1]. Exception that triggered the failover: [org.jboss.remoting.transport.http.WebServerError: Error received when calling on web server.  Error returned was 400]

            2010-01-27 18:16:25,901 WARN  [RHQ Server Polling Thread] (enterprise.communications.command.client.ServerPollingThread)- {ServerPollingThread.server-offline}The server has gone offline; client has been told to stop sending commands

            2010-01-27 18:17:32,089 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.jboss.remoting.CannotConnectException:Can not connect http client invoker. Connection refused: connect. -> java.net.ConnectException:Connection refused: connect. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker. Connection refused: connect.

            *************************************
            • 3. Re: Issues communicating between Jopr agent and it's server
              mazz Master

              > Caused by: java.io.StreamCorruptedException: invalid stream header

              > at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:764)

               

              that is very odd - you had some corruption in the data flowing between server and agent. We have an inside joke around here to "check your MTU" whenever any network problems occur but seriously, make sure your networking configuration is OK. I once had a problem when I tried to send simple data between a virtual machine guest and a host machine and the corruption occurred because my network adapter's MTU setting was too high (hence the joke that we blame everyone on MTU ). I doubt that's the cause of this problem, but it is odd that you are getting an exception saying the streamed data got corrupted.

               

              > 2010-01-25 02:03:40,625 ERROR [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.time-not-synced}The

              > server and agent clocks are not in sync. Server=[1264406665034][January 25, 2010 2:04:25 AM CST], Agent=[1264406620625]

              > [January 25, 2010 2:03:40 AM CST]

               

              Probably unrelated, but you will get odd problems to occur later if not now when your clocks between your server and agent are not in sync. Use NTP to make sure the machines on your network all think its the same time (or at least are in sync within a few seconds of each other). See http://rhq-project.org/display/JOPR2/RHQ+Server+Installation+Preparation#RHQServerInstallationPreparation-Synchronizedmachineclocks

               

              > 2010-01-27 18:16:25,901 INFO  [RHQ Server Polling Thread]

              > (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-

              > endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://xxxxxxxxx:80/jboss-remoting-servlet-invoker

              > /ServerInvokerServlet]] to [InvokerLocator [servlet://xxxxxxxxx:80/jboss-remoting-servlet-invoker/ServerInvokerServlet]]

               

              This is part of the HA support - if the agent fails to communicate successfully to a server, the agent will attempt to "failover" to another server you have in your "HA Server Cloud". This message is telling you the comm system inside the agent detected a comm failure and is trying to switch over from the one server to that second server.  I think you only have 1 server, but even though you aren't using HA (which stands for High Availability) the agent will still run through this failover code - it just will have nothing to failover to and will bomb out fairly quickly (as you see it did here).

               

              I see you are going over port 80 - which is different than the default 7080. I assume you did this so your agents can tunnel through a firewall over the standard HTTP port 80. Make sure that firewall isn't screwing up your data traffic. If you did not mean for your agent traffic to go over port 80, then this is probably the problem. In the GUI, look at Administration>High Availability>Server and make sure all the server's public endpoints and ports are as you expect them to be.

               

              > 2010-01-27 18:16:25,901 WARN  [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}

              > Failed to failover to another server. Cause: org.jboss.remoting.transport.http.WebServerError: Error received when calling

              > on web server.  Error returned was 400

               

              This is why the agent thinks it can't communicate with the server. The message transport spit out an error - the agent's comm code tried to talk to the server (at the port you see in the previous log message) but it got back error 400. Find out what that's all about - I never saw a 400 come out of the server - you sure that XXX:80 endpoint isn't some web server you have installed?