0 Replies Latest reply on May 5, 2008 6:09 PM by dmurphy2

Providing Safe Buddly Replication Failover

dmurphy2 May 5, 2008 6:09 PM

Hi - these question relate to establishing the safe operation of buddy replication under AS 4.0.5.

Selection of Buddies
Say we have nodes a1, a2 and a3 and they are booted in that order. What we see is that when a2 starts it forms a buddy pair with a1. Then when a3 starts a1 becomes the backup for a3. So in this scenario a1 is backing up two nodes and a3 is backing up zero nodes.

So the memory utilization across the nodes is unbalanced. (we now have logging around the session replication listener to analyse this behaviour)

This seems to be broken. Is this the way buddy replication should select buddies or do we have a config problem somewhere? What we need is that each node has the same amount of backup work (memory, cpu etc) overhead to even the load of providing replication across the cluster.

Failover Operation
Having read the JBOSS doc I still need to understand more about the basic operation of failover. Currently we have no replication so if an app server node goes down we loose ~25% of users but the other 75% stays pretty operational. In practice, we will have a cluster of 6 app servers and during peak times we would see 2000-3000 users per node. 18K concurrent users in all.

What I am concerned about using buddy replication is that if a node goes down we could send other nodes down as well as they have to rapidly take over the work of the node that failed (a sort of domino affect). After reading the doc I still dont have a solid understanding of how this process works or the risks we might have.

Assume a2 backs up a1, a3 backs up a2 and a1 backs up a3. This is buddy replication with one backup buddy. All nodes are fronted by an F5 load balancer that provides sticky sessions and will redirect a user to a random node if the node with its original session fails.

So what, in detail, happens if a1 goes down? After the failure of a1 the F5 will direct Some of a1's users to a2 and some to a3.

1) How does the cluster determine who is the new primary owner of a1's session data? Hopefully it will decide to use a2 since it already has a copy of a1's session cache.

2) For users directed to a3 by the F5 - how does a3 now populate its session cache to service those newly arriving users.

3) I assume the cluster also now picks a new buddy for a3 since it lost its buddy a1. In this case it will have to be a2 since there are no other nodes. So question is - what is the impact (network, cpu etc) on a2 and a3 to establish a2 as the new buddy relationship is established. What we are worried about is that both a2 and a3 now suddenly have a large group of new users to support as well as taking the resource hit to replicate each others session state.

Failover Best Practices
What are the buddy replicatoon 'best practices' that we should follow to provide safe and reliable failover in a heavily loaded cluster?