1 2 Previous Next 16 Replies Latest reply on Oct 7, 2010 10:42 AM by kabirkhan

    ServerManager-Server(-ProcessManager) communication

    kabirkhan

      A change in how messages are passed between the ServerManager (SM) and Server instances was mentioned on IRC. Currently the SM and Server processes connect to the ProcessManager (PM) which then routes the messages to the appropriate process. The new idea is that Servers connect directly to SM, leaving the PM solely in charge of starting, stopping and reconnecting processes.

       

      I think the flow for how that would work is something like the following. Stuff I'm unsure about is in bold

       

      1. PM starts up and listens on a socket on a port (PPM) for connections from the processes it manages.
      2. PM starts SM passing in PPM, PM’s host address and ‘ServerManager’ as the name
        1. SM records PPM , PM’s host address and ‘ServerManager’ in a file
        2. SM opens a socket on a different port (PSM) which listens for connections from the Server processes and from the Domain Controller (DC).
        3. SM initiates communication with PM, by connecting to port PPM. The first command it sends is 'STARTED ServerManager <PSM> <SM_ADDRESS> ', which helps PM associate the socket with the correct ManagedProcess.
        4. For each Server configured in SM:
          1. SM tells PM to add Server
          2. SM tells PM to start Server process
            1. PM launches the Server process, passing in PPM, PM_ADDRESS, PSM, SM_ADDRESS and the SERVER_NAME
            2. Server initiates communication with PM, by connecting to port PPM. The first command it sends is 'STARTED <SERVER_NAME>', which helps PM associate the socket with the correct ManagedProcess.
            3. Server starts listening for commands on the PM socket.
            4. Server initiates communication with SM, by connecting to port PSM. The first command it sends is 'AVAILABLE <SERVER_NAME>', which helps SM associate the socket with the correct Server proxy.
            5. Server starts listening for commands on the SM socket
          3. SM sends the ‘START serverConfig’ message to the server via the Server’s socket
            1. Server parses the serverConfig, starts up and sends to SM either
              1. ‘STARTED’ if successful.
              2. ‘START_FAILED’ if failed
                1. SM tells PM to stop process???
                2. SM tells PM to remove process???
      3. While a ManagedProcess is registered as started in PM
        1. PM regularly pings process on the processes socket (Or should SM instead perhaps pick up on when the Server socket is closed?)
          1. Server or SM process sends ping back
        2. If a reply is not received from the process or the processes socket is closed:
          1. For Server processes, PM stops the ManagedProcess  and sends ‘DOWN <SERVER_NAME> to SM on the PM-SM socket
            1. SM does 2.4.2 and and 2.4.3 for the Server according to its respawn policy if it has not initiated shutdown of that server
              1. After more retries than the respawn policy max
                1. SM tells PM to stop process
                2. SM tells PM to remove process
          2. For ServerManager???
      4. To shutdown a server
        1. SM sends ‘SHUTDOWN’ to server.
          1. Server closes down
          2. Server sends ‘STOPPED’ to SM.
          3. SM tells PM to stop the Server process
          4. SM tells PM to remove the Server process
      5. Closing down everything
        1. Shutdown hook in PM sends 'SHUTDOWN' message to SM
          1. For each server
            1. do step 4
          2. SM sends 'STOPPED' command to PM
          3. PM stops and removes SM process
      6. Restarting SM
        1. SM process is stopped by
          1. Message from PM?
          2. Message from DC?
          3. Process is killed
        2. SM is down...
        3. SM process is started
          1. SM reads PPM, PM address and process name from file (or should it be restarted via PM? In which case this could be passed in as in 2.)
          2. See 2.2
          3. See 2.3
          4. Some differentiator is needed to not do 2.4. SM sends “RESTARTED” command to PM
            1. For each Server process PM sends ‘SM_RESTARTED <PSM> <SM_ADDRESS>’
              1. Server reconnects to SM as in 2.4.2.4
              2. SM sends STATUS to Server
              3. Server responds with STARTED, START_FAILED etc.

       

       

      I'm not really clear on what initiates 6 and what the steps should be there

      I think SM should be responsible for the respawning of servers rather than PM which is what does that at the moment.

       

                • 1. Re: ServerManager-Server(-ProcessManager) communication
                  brian.stansberry

                  Agreed on SM should be responsible for respawning servers. But I'll comment further in another post.

                   

                  Re:

                   

                  2.4.3.2.1 and .2 -- those seem correct. Otherwise there is a kind of zombie process. I suppose 2.4.3.2.2 (the remove) could be skipped if the SM knew it was going to restart the process.

                   

                  3.1 I like the idea of the PM being responsible for failure detection. For one thing it has the stdio streams as a fallback to confirm server failure. Contrived example: admin restarts interface lo so all the socket connections break. But the servers are all still working and can reconnect.

                   

                  3.2.2 I think the PM should restart the SM. The PM has no interface beyond whatever it reads in from the command line. And the servers expect to be managed via the SM. So if the SM is allowed to die and not restart, the entire set of processes can't be managed except via a kill command.

                   

                  The original design was to always have a running SM. We then split out the PM as a separate process just to

                  • Allow the SM to be upgraded/patched w/o requiring Servers to shut down. The PM would be dead simple with no dependencies and thus wouldn't need to be patched.
                  • Make the Servers more reliable by having the process that consumes their stdio streams as simple as possible. So bugs in SM would not causes Servers to crash.

                   

                  5 Besides the shutdown hook in the PM, the other reasons for closing down everything are

                  • SM receives a command from the DC telling it to do so
                  • If the SM exposes a remote managment interface, it receives a command via that telling it to do so

                   

                  6.1.1 I don't think this applies. The PM has no remote interface that would let it receive an instruction to restart the SM. And I don't think any internal state change in the PM would trigger such a thing.

                   

                  6.3.1 IMO it would be restarted by the PM.

                   

                  6.3.4 Could the differentiator be a simple param passed by the PM to the SM on the command line?

                   

                  Something to think about though is the SM needs get its internal state consistent with the actual state of affairs. What if it thinks there are 3 servers but actually there are 4? (That would be odd.) Or there are only two -- what triggers it to realize it never got a new connection from the 3rd? Right now the SM doesn't keep much state, but as we flesh it out it probably will (e.g. a copy of each Server's Standalone config object). Following a restart it probably shouldn't just assume it's state is consistent with the Server. That can be checked easily enough by getting the result of Standalone.elementHash() from the server and comparing it to its own value.

                   

                  One kinda hacky thought is if the PM knows the ManagedProcess is already started, it could ignore 2.4.1 and when it receives 2.4.2 it sends an SM_RESTARTED message to the Server (instead of starting it). That tells the server to connect to the SM.

                   

                  Another thing to think about with 2.4 is whether we want to support concurrent startup of servers. Also, we need some mechanism to control the order of server start across the domain. That could just be following the order of server-group elements in the domain.xml, and then the order of server elements in the host.xml.

                  • 2. Re: ServerManager-Server(-ProcessManager) communication
                    brian.stansberry

                    Re: respawning Servers, why would a Server die?

                     

                    1) User kills the Server process via an OS tool. Why would they do that?

                     

                    a) By mistake.  Restarting is helpful in this case.
                    b) Because the user doesn't understand our admin interfaces. Here  restarting is helpful since they will need to learn to use the  management interfaces if they actually want the server to stay dead.

                    c)  Because the DC is dead and can't be revived or the SM is misbehaving, so the user can't shut  down or restart the server via the normal admin interface. Here we don't know whether the user wanted a restart.

                     

                    2) Some other fatal error occurs. Should we let a human try and figure out what happened? Maybe bringing the server back up will just increase instability in the overall domain.

                     

                    Looking at that, I see no clear answer whether restarting makes sense. So, how about punting? Add a auto-restart flag to the <server> element. Default is true.

                    • 3. Re: ServerManager-Server(-ProcessManager) communication
                      kabirkhan

                      The respawn policy should be configurable: https://jira.jboss.org/browse/JBAS-8390

                      • 4. Re: ServerManager-Server(-ProcessManager) communication
                        kabirkhan

                        I have implemented the following but still need to test some corner cases

                         

                        1. PM starts up and listens on a socket on a port (PPM) for connections from the processes it manages.
                        2. PM starts SM passing in PPM, PM’s host address and ‘ServerManager’ as the name
                          1. SM opens a socket on a different port (PSM) which listens for connections from the Server processes and from the Domain Controller (DC).
                          2. SM initiates communication with PM, by connecting to port PPM. The first command it sends is 'CONNECTED ServerManager', which helps PM associate the socket with the correct ManagedProcess.
                          3. For each Server configured in SM:
                            1. SM tells PM to add Server
                            2. SM tells PM to start Server process
                              1. PM launches the Server process, passing in PPM, PM_ADDRESS, PSM, SM_ADDRESS and the SERVER_NAME
                              2. Server initiates communication with PM, by connecting to port PPM. The first command it sends is 'CONNECTED <SERVER_NAME>', which helps PM associate the socket with the correct ManagedProcess.
                              3. Server starts listening for commands on the PM socket.
                              4. Server initiates communication with SM, by connecting to port PSM. The first command it sends is 'CONNECTED <SERVER_NAME>', which helps SM associate the socket with the correct Server proxy.
                              5. Server sends the ‘SERVER_AVAILABLE’ command on the SM socket
                              6. Server starts listening for commands on the SM socket
                            3. SM sends the ‘START_SERVER serverConfig’ message to the server via the Server’s socket
                              1. Server parses the serverConfig, starts up and sends to SM either
                                1. ‘SERVER_STARTED’ if successful.
                                2. ‘SERVER_START_FAILED’ if failed
                                  1. SM tells PM to stop process
                                  2. If Server auto-restart=true and number retries is < respawn_policy_max SM repeats 2.3.2 and 2.3.3 - The respawn policy should be configurable (https://jira.jboss.org/browse/JBAS-8390)
                                  3. Otherwise tell PM to remove process
                        3. While a ManagedProcess is registered as started in PM
                          1. Processes are connected to the PM socket
                          2. The ManagedProcess monitors whether the process is still alive (with a thread doing Process.waitFor())
                            1. If a Server process goes down PM stops the process and sends ‘DOWN <SERVER_NAME>’ to SM on the PM-SM connection.
                              1. SM respawns the server process according to the rules in 2.3.3.2.
                            2. If the ServerManager process goes down PM respawns it as in 6.3
                        4. To shut down a server
                          1. SM sends ‘STOP_SERVER’ to server.
                            1. Server closes down
                            2. Server sends ‘SERVER_STOPPED’ to SM.
                            3. SM tells PM to stop the Server process
                            4. SM tells PM to remove the Server process
                        5. Closing down everything
                          1. Message to shutdown comes from
                            1. SM gets SHUTDOWN command from DC or management interface
                            2. Shutdown hook in PM
                              1. Send ‘SHUTDOWN_SERVERS’ command to SM
                                1. For each server do 4 to close it down
                              2. PM sends 'SHUTDOWN' message to SM which closes down SM as in 6
                        6. Restarting SM
                          1. SM process is stopped by
                            1. Message from DC
                            2. Process is killed
                          2. SM is down...
                          3. SM process is started
                            1. PM starts SM passing in PPM, PM’s host address and ‘ServerManager’ as the name along with the -restarted-server-manager flag.
                            2. See 2.1
                            3. See 2.2
                            4. SM sends the ‘RECONNECT_SERVERS <SM_ADDRESS> <PSM>’ command to PM
                              1. For each Server process PM sends ‘RECONNECT_SERVER_MANAGER <SM_ADDRESS> <PSM>’
                                1. Server reconnects to SM as in 2.3.2.4
                                2. Server sends ‘SERVER_RECONNECT_STATUS <Current_State>’ to SM
                                  1. if the server is not in the starting, started, stopping or stopped state (I added some basic state management) SM does 2.3.3
                        • 5. Re: ServerManager-Server(-ProcessManager) communication
                          kabirkhan
                          Once this is all properly tested and I get the domain controller to work in my branch I wonder what to do next. We have two sets of connections at the moment:
                          1) Server processes connect to SM via a socket
                          2) Server/SM processes connect to PM via a socket
                          If 1) goes down I was originally going to suggest routing the commands via 2) instead, but I think it makes a lot more sense to send the RECONNECT_SERVERS command via 2 instead? The simplest way would be in SM to stop and start the SM listening socket whenever the direct connection to ANY server goes down.
                          If 2) goes down I am not sure what to do. I have thrown away all the stdio stuff that was originally there. But if 2) falls over I should probably switch to stdio for that process and route the commands through there and also invent some commands to make the process reconnect to the process managers socket. Does anybody know if there are any problems with suddenly starting to consume a processes output and starting to write to its input and then stopping doing that again?

                          Once this is all properly tested and I get the domain controller to work in my branch I wonder what to do next. We have two sets of connections at the moment:

                           

                          1) Server processes connect to SM via a socket

                          2) Server/SM processes connect to PM via a socket

                           

                          If 1) goes down I was originally going to suggest routing the commands via 2) instead, but I think it makes a lot more sense to send the RECONNECT_SERVERS command via 2 instead? The simplest way would be in SM to stop and start the SM listening socket whenever the direct connection to ANY server goes down.

                           

                          If 2) goes down I am not sure what to do. I have thrown away all the stdio stuff that was originally there. But if 2) falls over I should switch to stdio for that process and route the commands through there and also invent some commands to make the process reconnect to the PM socket. Does anybody know if there are any problems with suddenly starting to consume a processes standard output and starting to write to its standard input and then stopping doing that again?

                          • 6. Re: ServerManager-Server(-ProcessManager) communication
                            brian.stansberry

                            I agree it makes more sense to handle a failure of 1) by using the PM to tell the server(s) to reconnect. Stopping and starting the SM's listening socket gives me a bit of a queasy feeling though; it could interrupt other on-going communication. Is the advantage mainly simplicity (stop and start the socket triggers the existing 6.3.4 logic above instead of requiring the SM to send a command to the PM)?

                             

                            If 2) goes down using stdio if needed to coordinate commands to trigger reconnecting sounds fine. I don't know any reason why using stdio again would be a problem. The PM needs to consume each process' stdout and stderr anyway.

                             

                            I don't think it's good though to try and route commands over stdio besides the ones needed to get the socket communication going away. Otherwise we face the problem of dealing with junk sent via stdout that led to our using sockets. One thing though is the PM can safely use each child process' stdin to send commands to that process.

                            • 7. Re: ServerManager-Server(-ProcessManager) communication
                              kabirkhan

                              Brian Stansberry wrote:

                               

                              I agree it makes more sense to handle a failure of 1) by using the PM to tell the server(s) to reconnect. Stopping and starting the SM's listening socket gives me a bit of a queasy feeling though; it could interrupt other on-going communication. Is the advantage mainly simplicity (stop and start the socket triggers the existing 6.3.4 logic above instead of requiring the SM to send a command to the PM)?

                              I suggested restarting the listener for two reasons. It sounded simpler, and I am not sure in what situations the SM listener would be well and truly broken. But I can definitely send a command to PM instead (and then if we find scenarios where we need to restart SM's listener I'll deal with those later).

                              The PM needs to consume each process' stdout and stderr anyway.

                              Not sure what you mean here? We still consume stderr for logging but I have done away with the stdout consumption. If you mean for monitoring whether the process is still alive, that is now done by a simple Process.waitFor().

                              I don't think it's good though to try and route commands over stdio besides the ones needed to get the socket communication going away. Otherwise we face the problem of dealing with junk sent via stdout that led to our using sockets. One thing though is the PM can safely use each child process' stdin to send commands to that process.

                              I agree. To keep things simple:

                              a) PM will only listen for input from processes via the process sockets

                              b) PM will only send data to processes via stdin

                               

                              If the socket goes down PM will send a message to the process via its stdin to reconnect to its socket. For b) I am currently using the socket to push commands from PM->Process and was going to switch over to stdin when the socket goes down, but it will be a lot simpler, with less switching between different communication mechanisms, if we just use stdin all the time.

                              • 8. Re: ServerManager-Server(-ProcessManager) communication
                                brian.stansberry

                                Kabir Khan wrote:

                                 

                                Brian Stansberry wrote:

                                 

                                I agree it makes more sense to handle a failure of 1) by using the PM to tell the server(s) to reconnect. Stopping and starting the SM's listening socket gives me a bit of a queasy feeling though; it could interrupt other on-going communication. Is the advantage mainly simplicity (stop and start the socket triggers the existing 6.3.4 logic above instead of requiring the SM to send a command to the PM)?

                                I suggested restarting the listener for two reasons. It sounded simpler, and I am not sure in what situations the SM listener would be well and truly broken. But I can definitely send a command to PM instead (and then if we find scenarios where we need to restart SM's listener I'll deal with those later).

                                 

                                Sounds good.

                                 

                                The PM needs to consume each process' stdout and stderr anyway.

                                Not sure what you mean here? We still consume stderr for logging but I have done away with the stdout consumption. If you mean for monitoring whether the process is still alive, that is now done by a simple Process.waitFor().

                                 

                                A key thing the PM needs to do is handle the last sentence in this bit from the java.lang.Process class javadoc:

                                 

                                "The created subprocess does not have its own terminal or  console. All its standard io (i.e. stdin, stdout, stderr)  operations  will be redirected to the parent process through three streams  (getOutputStream()getInputStream()getErrorStream()).  The parent process uses these streams to feed input to and get output  from the subprocess. Because some native platforms only provide  limited buffer size for standard input and output streams, failure  to promptly write the input stream or read the output stream of  the subprocess may cause the subprocess to block, and even deadlock."

                                 

                                So, whatever comes out of stdout or stderr needs to be consumed promptly.

                                 

                                Using stdin all the time for PM -> Process sounds good.

                                • 9. Re: ServerManager-Server(-ProcessManager) communication
                                  kabirkhan

                                  Brian Stansberry wrote:

                                   

                                  A key thing the PM needs to do is handle the last sentence in this bit from the java.lang.Process class javadoc:

                                   

                                  "The created subprocess does not have its own terminal or  console. All its standard io (i.e. stdin, stdout, stderr)  operations  will be redirected to the parent process through three streams  (getOutputStream()getInputStream()getErrorStream()).  The parent process uses these streams to feed input to and get output  from the subprocess. Because some native platforms only provide  limited buffer size for standard input and output streams, failure  to promptly write the input stream or read the output stream of  the subprocess may cause the subprocess to block, and even deadlock."

                                  Thanks, that is kind of what I was worried about regarding stdout/-err. I should have read the javadoc!

                                  • 10. Re: ServerManager-Server(-ProcessManager) communication
                                    kabirkhan

                                    Kabir Khan wrote:

                                    But I can definitely send a command to PM instead (and then if we find scenarios where we need to restart SM's listener I'll deal with those later).

                                    When SM detects that a server's socket connection is down it now sends 'RECONNECT_SERVER <SERVER_NAME> <SM_ADDRESS> <PSM>' to PM which then tells the server to reconnect (it results in 6.3.4.1 for the server process)

                                    • 11. Re: ServerManager-Server(-ProcessManager) communication
                                      kabirkhan

                                      Jason,

                                       

                                      Back from holidays. In my topic branch I am going to revert the work you did for http://github.com/jbossas/jboss-as/commit/7f3f5d0f86c4c329d0dd4a041df6ebc9889c5975 since now all ManagedProcess->actual process communications happen via the process's stdin

                                      • 12. Re: ServerManager-Server(-ProcessManager) communication
                                        jason.greene

                                        Actually I meant to post about that, but never got around to it. Basically the problem is that we need to use stdin to ship an early logging context (any any other pre boostrap data) as soon as the process starts. Brain, David, and I last week talked about the points above here about switching all communications to STDIN, and really the advantages aren't that much better over a TCP connection over loopback. If that connection drops both parties will know right away.

                                        • 13. Re: ServerManager-Server(-ProcessManager) communication
                                          kabirkhan

                                          I got rid of the ProcessManager send/broadcast methods taking a String list.  I think I can get rid of the methods taking byte[] as well? Nothing is using them as DC<->SM and SM<->Server use direct communication, and PM<->Process should use the PM protocol. Any objections?

                                          • 14. Re: ServerManager-Server(-ProcessManager) communication
                                            dmlloyd

                                            As far as I'm concerned, any unused methods can get removed without confirmation.  If someone is using such a method locally, they can reinstate it as part of their merge.  Forgiveness > permission in this case. 

                                            1 2 Previous Next