1 2 3 Previous Next 43 Replies Latest reply on May 10, 2006 9:14 AM by marklittle

    Austin Clustering meeting thoughts

    timfox

      In preparation to our meeting on Wednesday in Austin (I believe Adrian, Brian S, Ovidiu and myself are attending), here's a brief summary of the situation as I see it and my thoughts on it, to kick start the meeting a little and especially for the benefit of those we are not completely au fait with the current state of messaging clustering.

      We currently have some clustering code in the current codebase which is currently disabled. It is based around the concept of a "distributed destination". (A destination is a queue or a topic).

      The term "distributed" really relates to the fact that the destination is load-balanced across multiple nodes in a cluster, it does *not* provide high availability features which would have to be layered on top.

      Messages can be sent to a queue/topic via any node in the cluster and consumed with a consumer attached to any node in the cluster.

      The way distributed destinations work is described here: http://wiki.jboss.org/wiki/Wiki.jsp?page=JBossMessagingCore so I won't re-iterate here.

      Some possible issues with current approach:

      1) Does not intrinsically handle high availability. Messages are only stored in memory on one node at any one time, so if that node goes down then the messages are lost.
      This is not a fault of the design per se, since HA was always meant to be layered on separately with this solution, however dealing with HA separately makes the solution more complex since you have to a) implement distributed destinations, then b) implement some kind of replication for HA.

      2) Joining/leaving cluster. If a node is lost from the cluster then other nodes do not take on responsibility for handling the persistent messages for the lost node. This is because messages are effectively pinned to a node. This means the persistent messages for a node will never be processed until that node is recovered (which may never occur).

      An alternative solution would be to forget the current idea of a "distributed destination" and have a solution based on replication (actually the current solution would have to employ replication *anyway* (to provide HA)).

      The idea here would be that every node in the cluster is pretty much an exact replica of every other node. I.e. each node has the same queues, subscriptions and transactional state as every other node. In other words we replicate all the relevant state machines between the nodes.

      To ensure the state machines are replicated correctly we would have to ensure that each node receives messages in the same order, so a total ordering protocol would have to be used. I believe JGroups currently has 2 protocols which provide total ordering.

      With this solution, joining/leaving and recovery becomes much simpler since messages aren't pinned to individual nodes.

      Also HA is "built in" since the solution is based on replication- failing over from one node to another is simple since the second node is already a replica of the failed node.

      Potential issues / things to investigate

      1) More network traffic since every state machine change needs to be broadcast over the cluster. Not sure if this is an issue - hopefully Bela/Brian have some insights here.

      1) Larger memory footprint for a node since all messages are stored on all nodes. However this is no more than for a single server "cluster".

      Another avenue we could explore is how we actually do the replication.

      We could either use JGroups directly to replicate state machines, or perhaps we could use JBossCache (?), not being an expert on JBoss Cache this is one of the things I am interested in discussing more.

        • 1. Re: Austin Clustering meeting thoughts
          arvinder

          Hi Tim,

          I've started looking at wrapping messaging around the microcontainer. I am unsure how clustering will affect this or be introduced. For example, if the lowest layer is the Microcontainer, there will need to be a bridge between it and other Mc's within the cluster, etc.

          Anyway, just another thought to discuss if you guys feel its worth it.

          thanks
          arvinder

          • 2. Re: Austin Clustering meeting thoughts
            belaban

            I'm going to talk to Brian today to brief him on my thoughts regarding clustering in Messaging.
            Comments below.

            "timfox" wrote:


            The idea here would be that every node in the cluster is pretty much an exact replica of every other node. I.e. each node has the same queues, subscriptions and transactional state as every other node. In other words we replicate all the relevant state machines between the nodes.

            To ensure the state machines are replicated correctly we would have to ensure that each node receives messages in the same order, so a total ordering protocol would have to be used. I believe JGroups currently has 2 protocols which provide total ordering.


            Yes, JGroups has 2 total order protocols: TOTAL-TOKEN (based on a rotating token, written by Vladimir) and SEQUENCER (based on sending all messages unicast to a coordinator, who them multicasts on behalf of a member).

            SEQUENCER is new, I wrote it ca half a year ago. There are some unit tests, and it seems to work pretty well. I get ca 6500 1K messages/sec for a 2 node cluster in a 100Mbps network.
            We would work closely with you guys to make sure SEQUENCER gets real robust. Not to say that it isn't, but it can yet get better.
            TOTAL-TOKEN is currently unmaintained, and (shoudl you decide to use it), we have to think about whether Vlad can support it, or whether we rewrite it. It requires virtual synchrony which itself has not been updated for quite some time in JGroups (I wrote it in 1999).



            Potential issues / things to investigate

            1) More network traffic since every state machine change needs to be broadcast over the cluster. Not sure if this is an issue - hopefully Bela/Brian have some insights here.


            This shouldn't be an issue; JGroups can actually batch changes, e.g. wait for 30K of replication data to accumulate or 20ms, whichever comes first, and then send 1 larger message.



            1) Larger memory footprint for a node since all messages are stored on all nodes. However this is no more than for a single server "cluster".


            Should you decide to use JBossCache, you could use Buddy Replication, which does away with the send-everything-to-everybody concept.



            • 3. Re: Austin Clustering meeting thoughts
              marklittle

              You could always look at passive-replication (primary does the work and checkpoints to the backups). Although this has the disadvantage that failover is slower than in an active replication protocol such as available copies or weighted voting, it does have the advantage that it does not require all of the replicas to be deterministic and use a reliable (ordered) delivery protocol. When you start adding services like transactions and databases into the pool, ensuring that replicas are deterministic becomes extremely difficult/impossible (there's no such thing as simultanaeity, so clocks are never going to be synchronized, database locks and contention can be different the more clients you throw at the system etc.)

              • 4. Re: Austin Clustering meeting thoughts
                timfox

                I'm thinking we could probably do a bit of both active and passive.

                I think we'd only really need active replication for adding/removing messages from distributed queue/subscription instances. (I.e. replicate queue/subscription state machine).

                For the rest (mainly transactional state) my initial feeling is we could probably get away with passive (probably via JBossCache), since no other node needs to know the transactional state unless failover occurs.

                This gives us the added advantage of being able to configure the failover group separately from the "active" cluster group. E.g. we could use buddy replication for the passive and JGroups directly for the active.

                Need to flesh this out more though.

                • 5. Re: Austin Clustering meeting thoughts
                  belaban

                  This works too, but has the disadvantage that all clients have to access the primary all the times.
                  Backup hosts cannot be accessed (safe maybe for reading?), so this is a centralized solution.
                  Advantage: simpler implementation.

                  • 6. Re: Austin Clustering meeting thoughts
                    marklittle

                    As soon as you make the entity that you wish to be replicated complex, something like primary copy replication (or even coordinator cohort/leader-follower) has immediate advantages: correctness being just one of them.

                    • 7. Re: Austin Clustering meeting thoughts
                      marklittle

                      There are always tradeoffs. Ensuring determinism is practically impossible with COTS operating system, software and hardware. Unless you go the Tandem route and control everything. That's why all major deployments of replicated systems use a hot-standby approach.

                      As I said, there are problems with fail-over, but using something like a NAT you can alleviate them to some degree. In the end, it comes down to correctness though: can you ensure that the entities being replicated actively are deterministic? In Java, with a very non-deterministic JVM out-of-the-box, that's difficult enough to achieve. You start adding transactions, a database, network traffic, time dependencies etc. and I'd say that it's pretty near impossible to prove the correctness of such a system, let alone ensure we can debug any problems.

                      • 8. Re: Austin Clustering meeting thoughts
                        timfox

                         

                        "bela@jboss.com" wrote:
                        This works too, but has the disadvantage that all clients have to access the primary all the times.
                        Backup hosts cannot be accessed (safe maybe for reading?), so this is a centralized solution.
                        Advantage: simpler implementation.


                        A centralized (like HA singleton) solution is no good for us.

                        However I don't see why a combination of active and passive forces us to such a solution.

                        If we actively replicate the queue/subscription state machine using a total ordering, then we can support multiple producers/consumers concurrently on the same queue/subscription on different nodes (i.e. non centralized).

                        Then we use passive replication to replicate the transactional state using JbossCache (for instance)...

                        I guess I'm missing something.


                        • 9. Re: Austin Clustering meeting thoughts
                          marklittle

                          Do you need to replicate all messages? Going back to my Arjuna days helping the Arjuna Messaging Service people in their HA implementation: if the messages are volatile and best-effort (no ack), why bother replicating them? If the queue fails, then the client can resend.

                          • 10. Re: Austin Clustering meeting thoughts
                            timfox

                            One of the features of JBoss Messaging is that we guarantee no loss of non persistent messages on failure of a node - this is configurable though - but we need to provide it if requested.

                            • 11. Re: Austin Clustering meeting thoughts
                              marklittle

                              Sure, but in many large throughput deployments of messaging I've seen (stock trading, banking etc), they want volatile messages to be quick and failure of delivery for them isn't a problem because by the time the failure is detected, a new message from the system is on its way anyway with more up-to-date information (e.g., stock value).

                              • 12. Re: Austin Clustering meeting thoughts

                                Guys, I know I've told all you before about this, except Mark Little
                                (and I might have only mentioned it in passing to Tim?).

                                JMS does not require total ordering, except as it appears
                                for one client connection/session.

                                If you have competing senders there is no guarantee how the race
                                will be resolved so there is no point introducing total ordering.
                                This is true for one node as it is for clustered nodes.

                                e.g. You can have two clients sending messages A and B to a topic
                                simultaneously. Two topic subscriptions can see these messages
                                in different orders.

                                All that is required for HA JMS is:
                                1) You have a singleton location that is in charge of a queue/topic
                                subscription
                                2) A client can connect to the singleton location regardless of which
                                machine it actually initially connects to
                                3) You replicate/move the messages to the singleton location in a guaranteed way (for persistent mesasges and durable destinations only). i.e. persist/forward/ack/delete
                                4) You duplicate the singleton location under the singleton's location control
                                (for consistency) this can either be total replication across every node or it could be buddy replication or some other mechanism like shared database.

                                Other "nice to haves":
                                1) queues/topic subscriptions are "load balanced"/moved across the cluster
                                to avoid all the work done on one node. Load balancing the front end remoting connection is pointless and inefficient. Making it HA is not.
                                2) the singleton location(s) can be on the clients (serverless jms)

                                Non-requirements
                                1) Total ordering across all clients - not a requirement of the spec and too expensive
                                2) Load balancing front end connection (it should find the singleton location and stick to it).

                                Difficult problem:
                                A session that contains messages for multiple singleton locations on
                                different machines.

                                Possible solutions:
                                A) Do DTM style processing such that all the locations that take part in the transaction such that the changes occur immediately.
                                B) Choose one of the nodes in the session and do store and forward
                                to the other nodes
                                i.e. the initial node persists the requested changes that are not local in
                                the first transaction. It then uses new transactions to forward the
                                messages or acks to the other nodes

                                (B) is less prone failure problems and is simpler in that when you
                                are dealing with servers you can setup transaction logs to do
                                what obviously requires XA
                                (A) is also possible if you set up sub co-ordinators on the servers
                                but that assumes a DTM

                                This info is probably not complete. See previous comments for
                                detailed discussion of the individual isses.

                                • 13. Re: Austin Clustering meeting thoughts
                                  timfox

                                  If you have a distributed topic (distributed across n nodes) handling the stock updates, a particular client is attached to node A, and when it sends a message a client attached to another node B needs to receive the message.

                                  So node A broadcasts the message to the cluster, so node B gets the message then sends it (currently point to point - but we can broadcast here too eventually) to the consumer.

                                  So you need to broadcast them anyway (unless you have an HA singleton) just so node B can get the message.

                                  If the messages are no persistent which they are likely to be for this use case then there's no persistence hit so performance should be ok.

                                  • 14. Re: Austin Clustering meeting thoughts

                                     

                                    "mark.little@jboss.com" wrote:
                                    Sure, but in many large throughput deployments of messaging I've seen (stock trading, banking etc), they want volatile messages to be quick and failure of delivery for them isn't a problem because by the time the failure is detected, a new message from the system is on its way anyway with more up-to-date information (e.g., stock value).


                                    "Best effort" is trivial (i.e. non-persistent or non-durable).
                                    A compliant implementation could just drop the message during the send. :-)

                                    It is the guaranteed **ONLY ONCE** delivery that is hard.


                                    1 2 3 Previous Next