0 Replies Latest reply on Dec 19, 2008 6:46 AM by mazz

    HA repartitioning

    mazz

      Someone had a question regarding the HA functionality of Jopr after watching the demo at https://docs.jbosson.redhat.com/confluence/display/JON2/Demo-HighAvailability

      The question was, "When taking a server out of 'maintenance mode' back into 'normal' mode, are agents going to start talking to that new server automatically?"

      The short answer is yes. But I'll explain in details here since this is an important concept.

      Yes, some agents (not all) are reassigned and will start talking to that server automatically, within some amount of time. Here's what happens:

      When you take a server out of "maintenance mode", you are re-establishing it as a "normal" server within the cloud. As such, our partitioning algorithm will immediately kick off and re-arrange the agent partitioning - failover lists are re-calculated and some agents will be assigned this "new" server (the one you just took out of MM) as their primary server. Thus, the server will begin taking the load of some agents as the agents switch over to use this "new" server.

      However, the agents won't immediately switch over to using this "new" server unless you do something specific (see below). Instead, what happens is the agents continue happily talking to the servers they were talking to. But ever hour (its an hour by default, but it's configurable) each agent will ask the server it is current talking to, "hey, what's my failover list? is it different from before? do I have a new primary server I should be talking to". The agent's current server will answer that by sending down to the agent the new failover list that now includes the "new" server. At the top of the failover list (i.e. the primary server) is the one the agent will switch to (if its different than the one it is currently talking to). In other words, every hour, the agent will check to see if its talking to its primary server and if it isn't, the agent will switch to it.

      Therefore, within an hour of taking a server out of MM into NORMAL mode, all agents will be switched over and the "new" server will be completely integrated into the cloud and taking the full share of its load.

      Now, if you don't want to wait this hour, there is an agent operation you can execute via the Operations tab in the UI for RHQ Agent resources ("downloadFailoverList" or something like that) - this tells the agent, "don't wait for an hour to pass, immediately ask the server for your failover list and switch to your primary server now". The agent will then switch to the primary server, and not wait for its hourly job to run.