Austin Clustering meeting thoughts| JBoss.org Content Archive (Read Only)

30. Re: Austin Clustering meeting thoughts

adrian.brock May 8, 2006 6:54 AM (in response to timfox)

Just in case we have a naming problem.

When I say a "hot backup" I mean it will transparently failover
to that backup. I don't mean the backup can be used while the original
is still intact.

This is different to the current JBossMQ HASingleton which I describe
as a "cold backup" because the failover is not transparent to the client.
It must make the effort to reconnect to the failover node when it is told
the original one has failed.

31. Re: Austin Clustering meeting thoughts

marklittle May 8, 2006 7:00 AM (in response to timfox)

"adrian@jboss.org" wrote:
"mark.little@jboss.com" wrote:

I agree, but that doesn't negate the original issue (or my reading of it): how to ensure that the node that is hosting the queue is highly available?

I think you mean the queue is HA? :-)
The node is just a JVM that can crash at anytime.

Actually both are valid ;-) but in this case, you're correct: my terminology was sloppy.

I said before there needs to be a "hot" backup. That is provided for
either by replication or shared persistent store/logs.

Yes. Like I said much earlier, the passive replication technique makes more sense here too because of the non-deterministic nature of the entity being replicated.

This introduces the other major problem. That of messages/transactions getting temporarily "lost" until the crashed node recovers any
prepared transactions.

e.g. In the store/forward protocol, it is possible that the client
has been told the message will be delivered, but it won't actually
be delivered until the node it sent the message is recovered
and forwards the message to the real destination.

This does not break the JMS spec, which just guarantees delivery.
It doesn't guarantee when or even that you can send a message
to a queue and then instantaneosly re-retrieve the message.
JMS has quite weak "atomic" requirements in that respect.

send() just means it will be delivered at some point.

Yeah, I think we're in agreement :-)

32. Re: Austin Clustering meeting thoughts

marklittle May 8, 2006 7:07 AM (in response to timfox)

"adrian@jboss.org" wrote:
Just in case we have a naming problem.

When I say a "hot backup" I mean it will transparently failover
to that backup. I don't mean the backup can be used while the original
is still intact.

We're on the same page. Passive replication (primary copy, coordinator cohort, leader follower) are along those lines. The backup(s) are kept up-to-date with respect to the primary (obviously scope for some inconsistencies, but that depends upon how often the checkpoints occur - transaction boundaries are frequently used). You need some mechanism to ensure split-brain syndrome doesn't occur (different clients see different views of who's the primary), but that's straightforward to accomplish. But the backups cannot be used until the primary has failed (unless you're using something like coordinator-cohort, where things get tricky with different clients seeing different primaries because we're using virtual synchrony).

This is different to the current JBossMQ HASingleton which I describe
as a "cold backup" because the failover is not transparent to the client.
It must make the effort to reconnect to the failover node when it is told
the original one has failed.

Well something has to reconnect in both cases, unless you rely on a NAT (or similar approach) to mask the failure outside of the client. An intelligent proxy, for instance. But ultimately there's a reconnection going on at some level in the system ;-)

In the replication world, "cold backup" is often used to mean a replica/machine that is started only when the primary has failed and must initialise its state from some out-of-date store: essentially it comes up in the initial state the primary had before it began to work.

33. Re: Austin Clustering meeting thoughts

adrian.brock May 8, 2006 7:13 AM (in response to timfox)

"mark.little@jboss.com" wrote:

This is different to the current JBossMQ HASingleton which I describe
as a "cold backup" because the failover is not transparent to the client.
It must make the effort to reconnect to the failover node when it is told
the original one has failed.

Well something has to reconnect in both cases, unless you rely on a NAT (or similar approach) to mask the failure outside of the client. An intelligent proxy, for instance. But ultimately there's a reconnection going on at some level in the system ;-)

In the replication world, "cold backup" is often used to mean a replica/machine that is started only when the primary has failed and must initialise its state from some out-of-date store: essentially it comes up in the initial state the primary had before it began to work.

Yeah. I use it for a similar meaning. In the JBossMQ case,
any jms session on the client has to be restarted
from the beginning.

34. Re: Austin Clustering meeting thoughts

belaban May 8, 2006 7:33 AM (in response to timfox)

"mark.little@jboss.com" wrote:

You need some mechanism to ensure split-brain syndrome doesn't occur (different clients see different views of who's the primary), but that's straightforward to accomplish

I disagree; the split-brain syndrome is one of the hardest problems in CS to solve, how do you suggest we go about it ?

I don't see how you can prevent different clients from having different primary hosts. All you need is a network partition which separates the primary from the backup(s) and goes right through the 'middle' of the clients too.
Are you hinting at a Primary Partition approach, where the non-majority partition simply shuts down ? Or do you envisage a merging after the partition healed ?

35. Re: Austin Clustering meeting thoughts

marklittle May 8, 2006 7:58 AM (in response to timfox)

"bela@jboss.com" wrote:
"mark.little@jboss.com" wrote:

You need some mechanism to ensure split-brain syndrome doesn't occur (different clients see different views of who's the primary), but that's straightforward to accomplish

I disagree; the split-brain syndrome is one of the hardest problems in CS to solve, how do you suggest we go about it ?

I don't see why? If you use something like weighted voting then it's trivial. OK, you need 2N+1 replicas to tolerate N failured, but you don't get split brain. Period.

I don't see how you can prevent different clients from having different primary hosts. All you need is a network partition which separates the primary from the backup(s) and goes right through the 'middle' of the clients too.

Easily. In coordinator-cohort you do have different primaries but that's because Ken defined it when they were building ISIS and it used virtual synchrony. In primary copy, which dates back to the late 1960's (haven't got my PhD to hand, so I'm not even going to bother trying to remember the references), all clients saw the same primary. The same is the case for leader-follower based approaches. Having all clients see the same primary is in fact easier than the other approach.

Are you hinting at a Primary Partition approach, where the non-majority partition simply shuts down ? Or do you envisage a merging after the partition healed ?

I wasn't hinting at anything ;-) I thought it was obvious. Partition healing approaches are required and approaches to this are described in the literature. But they do often require semantic information about the data. The obvious approach which doesn't is that the minority partition (which can obviously continue to service read-only data) gets a new checkpoint when the partition heals. That's the approach we used in Arjuna back in the mid 1990's.

36. Re: Austin Clustering meeting thoughts

belaban May 8, 2006 9:13 AM (in response to timfox)

"mark.little@jboss.com" wrote:

I don't see why? If you use something like weighted voting then it's trivial. OK, you need 2N+1 replicas to tolerate N failured, but you don't get split brain. Period.

So this doesn't work for a 2 node cluster which is what 60% of our customers run.

37. Re: Austin Clustering meeting thoughts

marklittle May 8, 2006 9:41 AM (in response to timfox)

In my experience, customers will run what you tell them is needed to get the requirements they want. Availability isn't proportional to the number of replicas. Increased performance isn't predicated on having more replicas. Why are they using 2 machines? Probably because it's better than using 1 (or so they think)? If you tell them that they can have increased availability by running a single machine (just buy from XYZ rather than ABC) then I bet they'd do that. So if you tell them that in general if they need 99.99999% availability they need 5 machines, then they'll go for that too.

Although I've never come across a deployment of JGroups, I don't see why it would be any different from the systems I have seen, such as RTR (from HP) and Tandem (fail-stop), in deployments like telco and stock exchange. These places are prepared to deploy whatever it takes, because a failure costs them a lot of money (and hence, they have a lot of money to throw at the hardware side of the equation).

I would contend that it our customers are only using 2 machines because it's "better" than 1, then they really don't understand the problem.

38. Re: Austin Clustering meeting thoughts

belaban May 8, 2006 3:22 PM (in response to timfox)

"mark.little@jboss.com" wrote:
II don't see why? If you use something like weighted voting then it's trivial. OK, you need 2N+1 replicas to tolerate N failured, but you don't get split brain. Period.

So how do you solve the problem where you have 5 nodes attached to a switch, and the switch crashes ? Nobody can communicate with anyone else; this leads to a split-brain situation, where a simple majority doesn't help.
Secondary networks or shared disks won't help either, they will reduce the probability of a split brain situation but they can never reduce it to 0.

39. Re: Austin Clustering meeting thoughts

marklittle May 9, 2006 5:26 AM (in response to timfox)

"bela@jboss.com" wrote:
"mark.little@jboss.com" wrote:
II don't see why? If you use something like weighted voting then it's trivial. OK, you need 2N+1 replicas to tolerate N failured, but you don't get split brain. Period.

So how do you solve the problem where you have 5 nodes attached to a switch, and the switch crashes ? Nobody can communicate with anyone else; this leads to a split-brain situation, where a simple majority doesn't help.
Secondary networks or shared disks won't help either, they will reduce the probability of a split brain situation but they can never reduce it to 0.

First thing: it's badly replicated! That looks like a single point of failure there ;-) Every deployment I know of in places like finance, carefully examines their environment to remove single points of failure in hardware: multiple SCSI controllers, multiple disks, multiple network switches etc.

However, if for some reason you have to have a single point of failure in this way, then yes you'll end up in a situation where you can't do any work. Fix it: either repair the switch or, better yet, put more effort into carefully constructing your network to avoid these failure scenarios (or at least reduce the probability). Hey, switches are cheap when compared to machines!

BTW, replication, as with *any* fault tolerance technique, is always probabilistic: you can never get the probability of a failure down to zero - it is physically impossible to do so (entropy will always get in the way, and that's a fundamental law of the universe). This is why simply throwing more replicas at a problem can actually reduce the availability of a system. http://citeseer.ist.psu.edu/little94replica.html

40. Re: Austin Clustering meeting thoughts

belaban May 9, 2006 5:49 AM (in response to timfox)

Okay, so we're on the same page then, because your previous comment lead me to think that you thought split-brain could be prevented... :-)

In one of my previous companies I worked on a WAN-based 2-node system (no chance to add another node, because 1 node was a SUN ES6500, costing half a mill and up), connected via 2-3 independent networks.
On top of that, a shared arbiter resource, *and* a dial-up connection for additional arbitration.
Yes, we still provided for manual (admin on the phone to another admin) intervention in case we suspected we had a split brain.

41. Re: Austin Clustering meeting thoughts

marklittle May 9, 2006 5:55 AM (in response to timfox)

"bela@jboss.com" wrote:
Okay, so we're on the same page then, because your previous comment lead me to think that you thought split-brain could be prevented... :-)

It can. Just to be sure we're talking about the same thing: split-brain is where two (or more) partitions of a replica group work independently because they think the other partition is "dead". We need to avoid that for all sorts of different reasons. Weighted voting avoids that. In the scenario you outlined with the network switch, I thought you were saying that none of the nodes could see each other and in which case no work could be done. And that's what I was agreeing to :-) In that case, there is no majority (you'd need a partition with 3 nodes to do any work): though as I said before, each node could service read-only data, i.e., state that can never be changed anyway.

In one of my previous companies I worked on a WAN-based 2-node system (no chance to add another node, because 1 node was a SUN ES6500, costing half a mill and up), connected via 2-3 independent networks.
On top of that, a shared arbiter resource, *and* a dial-up connection for additional arbitration.
Yes, we still provided for manual (admin on the phone to another admin) intervention in case we suspected we had a split brain.

Why not just say 2N+1 replicas and you need N+1 in a group to do any work ;-) Obviously you could weight the replicas so that some had more than a unitary value and perhaps some had less, but the basic principle is the same.

42. Re: Austin Clustering meeting thoughts

timfox May 10, 2006 9:04 AM (in response to timfox)

"adrian@jboss.org" wrote:

The problem with load balancing the queue is that you need
to replicate the queue. So you are doing the same work over network
anyway (probably a lot more since you need to slow down the
cluster with the overhead of the locking/ordering guarantee).

In most cases, this is likely to be redudant anyway.
Simply forwarding the client to the singleton means you can
use an in memory lock and messages only go to other nodes
(besides the backup(s)) as required.

Not sure why multicasting the message to all nodes is significantly less performant than sending the message to one node as required? - but I am not a network/jgroups expert.

I accept that the total ordering protocol implementation itself would introduce some overhead, but I am not qualified to say how much of an issue that would be.

If the performance overhead is small, and the total ordering protocol implementation scales well with number of nodes (I don't know if this is true), then I would tend to go for scalability rather than straight performance, since a better total ordering protocol can always be provided by JGroups and plugged in if necessary without effecting our code.

It strikes me that channeling all traffic to the single node which hosts a singleton queue instance is going to cause problems when/if we have hundreds of nodes in the cluster, both in terms of contention on the queue locks, memory required on that node to support all the consumers. Also with the number of requests that node can process per second, and the amount of network traffic going through that node.

Also another issue which I pointed out in the initial post in this thread, and which Adrian has referred to, is if a node crashes what node takes over respsonsibility for the messages on the singleton node that crashed?

Do we wait for the sysadmin to bring the node up again? What if the machine is fried and they have no other box? This could be a big pain in the arse.

By replicating the queues we would have no such problem.

I guess it all boils down to how much overhead the total ordering introduces over straight unicast., and whether it is a general limitiation of total ordering, or just a limitation due to the particular total ordering implementation. If the limitation is just due the current implenentation I am not so bothered since it can always be improved, if it is more fundamental then that is a different issue.

If the total ordering overhead is not a problem then it strikes me we can provide a much simpler and more scalable solution by actively replicating the queue using total ordering and passively replicating the rest. Failover and recovery would be much simpler.

It worries me a little that if we go for a singleton queue solution then this is going to come back and bite us in the future when we deploy on a very large cluster - and it will be very hard to fix since it goes deep into the design.

My worries may be misplaced, so I look forwarding to being corrected by you guys ;)

43. Re: Austin Clustering meeting thoughts

marklittle May 10, 2006 9:14 AM (in response to timfox)

I don't know what protocol JGroups uses, but usually these things require multiple phases (multiple message exchanges) between participants to determine ordering. There are plenty of papers on the subject.