1 2 Previous Next 27 Replies Latest reply on Dec 30, 2014 10:23 AM by jbertram Go to original post
      • 15. Re: Re: JMS clustering replication mode : Cluster doesn't start with security disabled
        abhiram123

        I made a test package which is kind of the blue print of our deployment. I am able to reproduce the same scenarios with this test. The zip file contains 4 folders , one for each server. I have tested this out using 4 different machines (2 - Cluster, 1 - MDB , 1 - Topic publisher ). I recommend you use the same server topology.

         

        Server 1 : Live

        Edit the JBoss Home directory and the host name in the start-live.bat file and run the batch file.

         

        Server 2 : Backup

        Edit the JBoss Home directory and the host name in the start-backup.bat file and run the batch file.

         

        Server 3 : MDB

        Edit the JBoss Home directory, host name and connection parameters start-mdb.bat file and run the batch file.

         

        Server 4 : Topic publisher

        Edit the PROVIDER_URL in the messenger.properties file and run publisher.bat. It publishes the messages to the topics mentioned in the properties file.

         

        Let me know your findings.

        • 16. Re: Re: JMS clustering replication mode : Cluster doesn't start with security disabled
          gaohoward

          Thanks you. I'll give it a try.

          • 17. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
            abhiram123

            Hey, any updates? We are blocked on our GA release because of this.

            • 18. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
              gaohoward

              Hi,

               

              I tried to simplify your test case using standalone hornetq servers. It shows that the fail-back works on my environment. I tried both the latest 2.3.21 and 2.3.0.CR1 and they all work as expected. Note for replication mode it is expected that the backup will stop after failback.

              The periodical warnings are caused by the XA recovery manager. You need to remove the pooled-connection-factory from your backup config:

               

                              <pooled-connection-factory name="hornetq-ra">
                                  <transaction mode="xa"/>
                                  <connectors>
                                      <connector-ref connector-name="in-vm"/>
                                  </connectors>
                                  <entries>
                                      <entry name="java:/JmsXA"/>
                                  </entries>
                              </pooled-connection-factory>

               

              If you are using community version i strongly recommend you to upgrade hornetq to its latest. There are a lot bug fixes since CR1.

               

              Howard

              • 19. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                abhiram123

                I have already mentioned that the fail-back works as expected in case of server shutdown but not in case of network disconnect. Have you tested the network failure scenario? Network failure is the most probable thing to happen in case of customer environment. Why is it expected that the backup server should stop after fail-back? Is it a bug in 2.3.0.CR1 and fixed in the later versions? We can't expect the customer to restart the backup server every time the failover occurs. There is no mention of network failure in any of the documentations regarding failover. Is it expected that the same node id message gets printed continuously if network failure occurs on live server and once it comes back into the network? I request you to test the network failure scenario as that is the case we are having problems with.

                Also as I mentioned in my previous comment , I have tried updating the HornetQ version to the latest one but without any success. I have only copied the jars I have mentioned in one of my previous comment. Do I need to update the JBoss version as well?

                • 20. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                  gaohoward

                  I only tested a simple version of the case. It's not easy for me to get the real test environment like yours. I didn't test the network failure but I think to a replication mode backup the behavior is the same. The only difference could be that in case of network failure and backup becomes live, if your old live is still running, you need manually shutdown it and let the clients failover to the new live server. You can restart the live after the network is restored.

                  The backup has to stop after fail back, because (here I quote the user manual)

                   

                  "otherwise the live server has no means to know whether there was a fail-over or not, and if there was if the server that took its duties is still running or not"

                   

                  So I'm curious that why you keep getting "same node id messages" when your network is having a problem, but not if you shutdown the live? Either way the backup always gets the connection failure error. Can you upload the server logs of the two scenarios for comparison?

                   

                  Regarding the last question I don't think you need update jboss version.

                  • 21. Re: Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                    abhiram123

                    I have the DEBUG level logs for all the three servers (Live, Backup and MDB) for both the server shutdown and network failure scenarios. I have done the following steps :

                     

                    Server shutdown

                     

                    1) Started Live, Backup, MDB and Publisher utility in that order  -  Backup was announced and MDB started receiving messages.

                    2) Killed the Live Server - Backup became live after sometime and MDB started receiving messages once the backup became live.

                    3) Started the original Live server - Original Live server became live and backup server was stopped , MDB still receiving messages.

                     

                    Network Failure

                     

                    1) Started Live, Backup, MDB and Publisher utility in that order  -  Backup was announced and MDB started receiving messages.

                    2) Live server was taken out of network - Backup became live after sometime and MDB started receiving messages once the backup became live.

                    3) Enable the network on the original Live Server - Same node id message gets logged continuously on all the three servers but the MDB is still receiving the messages.

                    • 22. Re: Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                      gaohoward

                      Thanks for the logs. I'll take a look at them. Meanwhile can you tell me during Network Failure scenario, have you ever stopped the live server or you just let the live server running all the time?

                       

                      Howard

                      • 23. Re: Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                        abhiram123

                        I didn't stop the live server during that time. We test on Windows VMs so I cant even access the live server for the time its out of network.

                        • 24. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                          gaohoward

                          Unfortunately in such a situation you have to manually shutdown the live and restart it. Currently what hornetq live does in case of network failure is just stop the replication and start it again when network failure is recovered.

                          One way to deal with it is to avoid such a split brain situation by setting a proper (longer) connection ttl and failure check period (ping interval) on the cluster connection. (However there is a bug regarding this so you need to use the latest to have it work).

                           

                          Howard

                          • 25. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                            rsinghal

                            Thanks Howard for helping us out. Are you suggesting that HornetQ/JBoss is designed this way or this is a bug? I assume that if someone is setting up live-backup for high availability then it should continue to work despite multiple network failures/glitches and it should not require manual intervention to restart one of the node.

                             

                            Please point us to JBoss version which have all these fixes so that it work seamlessly.

                             

                            Thanks,

                            Ravi

                            • 26. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                              gaohoward

                              Yes this is what is implemented so far. I wouldn't say it's a bug here but I think it's possible we could improve this in the future. There seems to be no simple way to handle such a 'split brain' situation fully automatically and HornetQ is trying to best deal with it, for example using a quorum algorithm. As I suggested you'd better set up a connection ttl and ping to a value suitable for your actual network situation (for example if the network is slow and has long glitches you probably use a bigger connection ttl/ping values).

                               

                              If you are using community releases for jboss and hornetq I suggest you update both to its latest stable releases. Only EAP releases comes with a specific version combination because they have strictly procedures for the integration.

                               

                              Good luck

                              • 27. Re: JMS clustering replication mode : Cluster doesn't start with security disabled
                                jbertram

                                There are ways to mitigate "split brain" situations, but that requires a larger cluster.  See the paragraph regarding "split brain" at the end of section 39.1.2.

                                 

                                Your's is the pathological case where there is only 1 live and 1 backup and they get disconnected from each other long enough that the connection-ttl elapses.  You may consider switching to shared-storage if you find the network connection between the live and backup server is not reliable.

                                1 2 Previous Next