11 Replies Latest reply on Oct 16, 2013 9:10 AM by mazz

    RHQ Server 4.9 setting MAINTENANCE mode automatically

    genman

      I have a bad Cassandra node which I'm attempting to remove due to bad hardware.

       

      However, RHQ is somehow flipping MAINT mode on my servers, which is causing the agents to not act correctly.

       

      I'm not sure the reason for this, as it's causing lots of grief. Should I file a bug? Propose a fix?

        • 1. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
          john.sanda

          Do you have other nodes up with which your servers can communicate?

          • 2. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
            mazz

            This was done purposefully That code went in about a month or two ago - if the storage cluster is down or cannot be communicated to via the server, the server will go into MAINTENANCE mode. It will go back to NORMAL mode when the storage node cluster comes back online.

             

            Stefan wrote that code, he can tell you more about what situations trigger this and when it should go back to NORMAL.

            • 3. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
              mazz

              Oh, and, it should NOT be the case that if just one of your N storage nodes are down, that this happens. If you only have one down but you have other storage nodes up, running, and available in the cluster, the server should see it can still talk to the cluster and it won't go to MAINT mode. At least that's how I think it works. Again, Stefan would know more and can explain it better.

              • 4. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                nstefan

                Elias,

                 

                How many nodes do you have in your storage cluster? Can you please check how much disk does each storage node use? When the server goes into maintenance mode, how many storage nodes are still up?

                 

                 

                Thank you,

                Stefan

                • 5. Re: Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                  genman

                  Yeah, I had two storage nodes in the cluster and two RHQ servers. One was timing out, so maybe it was affecting the overall performance.

                   

                  18:19:44,217 ERROR [org.rhq.server.metrics.MetricsServer] (New I/O worker #90) An error occurred while inserting raw data MeasurementDataNumeric[name=Native.SwapInfo.free, value=
                  4.294959104E9, scheduleId=466492, timestamp=1381515505284]: com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error occured server side: com.google.common.ut
                  il.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
                  

                   

                  And:

                   

                  18:19:30,707 ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-79) JBAS014134: EJB Invocation failed on component StorageNodeManagerBean for method public abstract org.rhq
                  .core.domain.util.PageList org.rhq.enterprise.server.cloud.StorageNodeManagerLocal.getStorageNodeComposites(): javax.ejb.EJBException: com.datastax.driver.core.exceptions.ReadTim
                  eoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
                  

                   

                  The other undesirable MAINT issue was agents were connecting and disconnecting, getting shuffled around. I would have expected the agents to simply stop trying to connect at all.

                   

                  The other issue, which I'd like if it were a FAQ, is adding a storage node I sometimes get an endless spew of errors like this when deploying: (This is not the case of the above condition.)

                   

                   INFO [HANDSHAKE-/17.176.208.118] 2013-10-11 23:36:42,727 OutboundTcpConnection.java (line 399) Handshaking version with /17.176.xxx
                   INFO [HANDSHAKE-/17.176.208.118] 2013-10-11 23:36:42,727 OutboundTcpConnection.java (line 408) Cannot handshake version with /17.176.xxx
                  

                   

                  Anyway, you might have luck reproducing the problem with some bad hard disks you have lying around ;-)

                  • 6. Re: Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                    john.sanda

                    In some stress testing I have been doing lately, I have also been hitting read timeouts on writes as you have. The timeout occurs during an authorization check which is a non-trivial amount of overhead. You wind up doing reads on a lot of writes which we definitely want to avoid. I opened https://bugzilla.redhat.com/show_bug.cgi?id=1017372 to help alleviate this some. Increasing the value of the permissions_validity_in_ms property in cassandra.yaml should help some. What is the heap size for your storage nodes? Increasing the heap can help as well. What kind of read performance are you getting with your disks? What does hdparm report, e.g,

                     

                    $ sudo hdparm -t /dev/sda

                     

                    The handshake messages should only happen when you deploy a node. The reason for that is because the storage nodes are configured to only gossip with known, trusted nodes. Nodes gossip (internode communication) every second. You will see these messages during the deployment process. Once the node has completed the bootstrap phase of the deployment, you will stop seeing it.

                     

                    Your suggestion of a FAQ is an excellent one, and in fact, it is near the top of my TODO list. I will be added a FAQ page to https://docs.jboss.org/author/display/RHQ/RHQ+Storage+Cluster+Administration in the next few days.

                    • 7. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                      genman

                      It looks like the hard drives (there are 6 or so) are reading fast, so it's not that. It could be a bad network card or something as I have seen the shell prompt hang sometimes when not doing anything.

                       

                      My heap size is around 5GB. I have about 32GB on this host so it's not crammed for memory.

                       

                      I haven't had much luck reliably deploying new storage nodes. Sometimes it works fine, but other times I get into this space. It looks like it failed to run 'Announce' on the first node. So thanks to your hint, that it is a trust issue, I figured out how to manually run the Announce command. Do you handle the case when the storage node gets erased (forcibly) then installed again in the same place? It seems like 'Deploy' works on the new node, but fails to Announce on the old node(s)? (It looked like the admin 'Announce' timed out.)

                       

                      One weird thing I see is, if the agent is already installed, installing a new storage node installs another agent (though in my case fails), but somehow I get a 'second' agent with the address of 'null'. So basically there's two agents to one machine, one I installed (via my RPM) and the other rouge one. It'd be great if the installer could sort of see there's already a working agent installed and not give me grief.

                       

                      The other minor grief is why rhqctl (and other RHQ scripts) can't figure out where the JVM is. On EL6 (and 5) Java is in /usr/java/default. Or /etc/alternatives/java if you will. See also: https://bugzilla.redhat.com/show_bug.cgi?id=788704

                      • 8. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                        genman

                        Okay, outside of some issues I've seen, I do need to know how to install using, say, 6 separate disks for the data dir. It seems data dir can only be one element now.

                        • 9. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                          john.sanda

                          Elias Ross wrote:

                           

                          I haven't had much luck reliably deploying new storage nodes. Sometimes it works fine, but other times I get into this space. It looks like it failed to run 'Announce' on the first node.

                          The announce operation should be a very fast operation. It updates the rhq-storage-auth.conf file and makes a JMX call to force the storage node to reload the file. I would be surprised if it times out. It will fail though if the node (on which the operation is running) is down.

                           

                           

                          Elias Ross wrote:

                           

                          Do you handle the case when the storage node gets erased (forcibly) then installed again in the same place?

                          How are you re-installing? If you go through the normal install process, I would expect the deployment process to get skipped assuming the node has the same IP address as it did before. You should see a message in server.log like,

                           

                          INFO  [org.rhq.enterprise.server.cloud.StorageNodeManagerBean] (http-jsanda-slow-2.bc.jonqe.lab.eng.bos.redhat.com/10.16.23.59:7080-2) StorageNode[id=1001, address=10.16.23.59, cqlPort=9142, operationMode=INSTALLED, mtime=1381143130656] is an existing storage node. No cluster maintenance is necessary.

                           

                          I do not think that the node will be able to join the cluster. The fastest way to check is go to <rhq-server-dir>/rhq-storage/bin and run ./nodetool -p 7299 status. If the node is not part of the cluster, then you can do the following manual config changes to get it to join:

                          1. In cassandra.yaml, edit the seeds property to list the IP addresses of other storage nodes (do not include the node being re-installed)
                          2. Update rhq-storage-auth.conf so that it includes the IP addresses of all storage nodes
                          3. Stop the storage node
                          4. Purge the data directory and commit log
                          5. Restart the node

                           

                          This should force the node to bootstrap into the cluster without going through the deployment process.

                           

                          Elias Ross wrote:

                           

                          Okay, outside of some issues I've seen, I do need to know how to install using, say, 6 separate disks for the data dir. It seems data dir can only be one element now.

                          Hopefully we will add support for this in RHQ 4.10. You can make the change manually for now. You need to edit the data_file_directories property in cassandra.yaml. It should look like,

                           

                          data_file_directories:

                              - /data/dir/1

                              - /data/dir/2

                              - /data/dir/3

                          • 10. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                            genman

                            My issues are probably to do with the agents not running correctly.

                             

                            When I install the storage node on a host, I already have my RHQ agent running, but it tries to install and run another agent. If it can't install, it seems to do something odd anyway, causing another (rouge) agent process to be discovered. Can the installer not do this?

                             

                            Thanks for the hints on working with the internal node tools.

                            • 11. Re: RHQ Server 4.9 setting MAINTENANCE mode automatically
                              mazz

                              OK, this is an interesting use case I don't think RHQ handles right now.

                               

                              To confirm: You already had a managed box with an agent running on it from an earlier version, and now you are going to put a Storage Node on it?

                               

                              I will say that if you install a Storage Node, you must install an agent with it USING RHQCTL. Its expected that rhqctl install --storage will be laying down its own agent. You shouldn't be running another agent on that box. But I don't think we contemplated the use case where you have managed resources on a box where an agent already exists and is managing those resources, and you then put a storage node on that same box.

                               

                              We might have to think about this. But for right now, I'm pretty sure you can't have a separate agent installed and then expect rhqctl install --storage to "reuse" or upgrade that agent.

                               

                              Perhaps you need to do rhqctl upgrade --from-agent-dir /your/old/agent --storage?? Perhaps that will work - maybe rhqctl will see that you already have the agent and upgrade it but with no storage, perhaps it will just lay a new installation of it. This is something to test.  And if it doesn't work, I think we might need to fix this to make that work. Might need to write this use-case up in separate BZ.