Bridge reconnection stopping node rejoining clu...| JBoss.org Content Archive (Read Only)

15. Re: Bridge reconnection stopping node rejoining cluster

grantlittle Nov 15, 2010 10:58 PM (in response to gaohoward)

Hi Yong,

Thanks for the update.

In the first scenario I'm guessing that the bridge which previously had a connection to the backup server is re-attempting to re-establish that connection regularly. Once it re-establishes that connection (after the backup is restarted) it makes the backup server go active (before the live server is started). Once the live server is started it attempts to connect to the backup which is marked as active and therefore not a backup server!

In the live/backup non clustered scenario (my second scenario) I'm guessing that the client (producer) that is attempting to re-establish a connection to live but can't so then tries the backup which causes it to go active before the live server is started and establishes a connection with the backup.

This sounds like it could be an ackward one to fix as you need the ability for clients to connect to the backup if they are unsuccessful in attempting to connect to the live (to ensure a reliable production environment), but at the same time this feature stops the live server starting up. Not sure what the best approach is.

I read somewhere on the forums that the way the backup must always be started before the live might be changing. If what I've said is actually happening then these changes may have an affect on this situation. Do you know what the changes are in regards to this?

16. Re: Bridge reconnection stopping node rejoining cluster

gaohoward Nov 16, 2010 12:52 AM (in response to grantlittle)

Hi Grant,

I think you are right. I create a JIRA to track this.

https://jira.jboss.org/browse/HORNETQ-572

I think it should be simple change to reverse the startup sequence of backup/live. But I'm not sure how much it will affect other use-cases for the moment.

Thanks

17. Re: Bridge reconnection stopping node rejoining cluster

grantlittle Nov 16, 2010 1:28 AM (in response to gaohoward)

Hi Yong,

I have just had a look at the code, what I think is happening is when a packet of type CREATESESSION is received by the backup server (which is due to a bridge connection trying to re-establish a connection or a client attempting to re-establish a connection with the live server being down) it checks if the server is a backup, which it is. It then activates the server. This then means when the live comes up it can no longer connect to the backup.

I'm not sure how easy it is to change the startup order (or if this is in the plan). But what I suggest as a more simple approach for now is that the backup will not accept session connection attempts for a specified time which allows the live server to connect. Once the live server has established a connection then the backup will be set to allow for new sessions, or if the live server doesn't connect within the specified time then the backup will also start to accept connections.

I'm currently writing a patch to do this which I will attach to the JIRA when its finished.

I definitely feel that there are better ways of doing this but we can't use HornetQ as is with this defect in a production system so this is better than nothing for now. If the HornetQ developers have a better approach then that is great. I don't know the code base that well so I will leave it in their hands.

Grant

18. Re: Bridge reconnection stopping node rejoining cluster

grantlittle Nov 17, 2010 2:12 AM (in response to grantlittle)

It appears that ths JIRA has been rejected and closed as a design feature. Can someone explain why this was the chosen design?

It unfortunately requires us to have a complete application layer outage to stop the many producers and consumers we have from attempting to reconnect to HornetQ when trying to restart the live/backup pair (after a failover).

Having to restart an entire application layer to simply allow for queue reconnection doesn't seem like a very good HA solution.

We have written a workaround patch to help alleviate some of the exhibited problems with this issue. It is far from ideal and I believe there are better ways of doing it but hopefully the new HA solutions being proposed in HornetQ will fix some of these issues. I was going to attach it to the JIRA but as it is closed it appears I can't attach it anymore. I will add it here in case it helps someone out.

ReplicationCantConnect.patch.zip 2.2 KB

19. Re: Bridge reconnection stopping node rejoining cluster

clebert.suconic Nov 17, 2010 8:17 AM (in response to grantlittle)

With the new HA, the backup and live are protected by a file lock.. so that won't be an issue with the new failover. That's why we closed that JIRA. (It's moot with the new failover)

20. Re: Bridge reconnection stopping node rejoining cluster

grantlittle Nov 17, 2010 7:48 PM (in response to clebert.suconic)

Thanks Clebert,

A file lock - does this mean there must be a shared file system between the live and backup pair?

The reason we are seeing this issue is because we have choosen replication (shared disk introduces a single point of failure). The issue only appears (as far as I am aware) under replication conditions where a client tries to establish a connection (due to live being unavailable) and there by activating the backup server before the live server is re-started and gets a response from its request to initialise replication.

The reason I ask is that I have actually been using the trunk code base (which I believe is 2.2.x) and am still seeing this issue. Now if the code has not been commited/written yet then that is understandable but if it has been commited into subversion then it appears there is still possibly a problem. Maybe there is a different way of configuring the nodes that I am not aware of as part of the new HA features?

21. Re: Bridge reconnection stopping node rejoining cluster

clebert.suconic Nov 17, 2010 11:18 PM (in response to grantlittle)

I'm not sure yet about the schedule on repplication... we will first get shared file system.. There are a few rendundant disks that won't introduce a single point of failure... ( I know I know.. not everybody have the budget for those.. we will get repplication back.. it's just that we will do first things first).

22. Re: Bridge reconnection stopping node rejoining cluster

grantlittle Nov 17, 2010 11:53 PM (in response to clebert.suconic)

Hi Clebert,

So when you say you will look at the replication later, do you still mean as part of 2.2.0? I would presume so.

23. Re: Bridge reconnection stopping node rejoining cluster

jombo Nov 24, 2010 10:10 PM (in response to grantlittle)

I think that is because when your backup server starts for the first time, it is be added into the cluster and other live node makes a core bridge connection to it. so when u restart the backup ,it will be actived and the live server will throw a exception like that when it is started