9 Replies Latest reply on Feb 17, 2011 2:37 AM by Bela Ban

    JBCACHE-816 vs JGRP-105

    Galder Zamarreño Master

      In the last few days Manik and I have been discussing http://jira.jboss.com/jira/browse/JBCACHE-816 and as a result of that, we've also started discussing http://jira.jboss.com/jira/browse/JGRP-105

      1.- Looking at 816 I thought, is there a real need to resolve this issue at the JBC level? What can JBC bring to the table to resolve this?

      Re 1.- JBC is a specialist keeping stuff in memory. Will this proxy keep the data transfered between DCs in memory? I don't see any point of doing this. Local cache instances within DCs will already have this information as propagated by the GR. This led to a second question.

      2.- Could this be fully resolved by the enhanced GR (outcome of JGRP-105)?

      Re 2.- Resolving this at the GR level it's always gonna be faster than doing it at the JBC level via CacheListeners. Besides, using JBC and CacheListeners as a mere tunnel seems to me an overkill.

      3.- In my discussion with Manik, he replied that we might not want all local DC multicasting traffic to be broadcasted to other DCs as this would slow things down. We might only want for example JBC Crud (put, remove...etc) methods to be broadcasted to other DCs.

      Re 3.- To resolve this, the enhanced GR could provide provide some kind of filtering that determines which traffic to propagate and which not.

      4.- Manik also mentioned that we would like local DC multicast to be synchronous, whereas inter DC or inter GR communications to be asynchronous.

      Re 4.- I guess this could be configured at the GR level?

      Summary: I'd like to add the following requirements to JGRP-105:
      - be able to filter which traffic get's propagated to DCs, i.e. only Cache CRUD methods. JGroups could define an interface for that, which JBC would implement based on its requirements.
      - configurable (sync/async) communications for in DC (typically IP multicast) and inter DC (inter GR - typically TCP based)

      Finally, JGRP-105 should also provide:
      - GR failover. We need to support failover of a GR so that another node in the local DC takes over.

      Thoughts?

        • 1. Re: JBCACHE-816 vs JGRP-105
          Bela Ban Master

          Please add those requirements directly to JGRP-105. Vlad and I will discuss 105 next week, including scope.

          First, we intend to improve ConnectionTable and use it directly in GossipRouter (we currently use duplicate code for GR connection mgmt).

          Second, we'll look at new functionality of GR, like dual faced router, router federation, router clusters including failover. This may or may not make it into 2.6.

          Let me know if you want to be part of that call.

          • 2. Re: JBCACHE-816 vs JGRP-105
            Galder Zamarreño Master

            Ok, will do.

            Yeah, I'd like to be part of the call.

            • 3. Re: JBCACHE-816 vs JGRP-105
              Galder Zamarreño Master

              A month ago, Bela, Vladimir, Jimmy and myself were discussing my suggestions proposed in this forum thread. Apologies if I haven't been able to post the notes earlier.

              First of all, we discussed about the possible use cases with catastrophy recovery as the main use case. In such cases, a specific data centre might get hit by a tornado/earthquake...etc, and then users needs to be directed to another data centre so that they can carry on with their work. Data correctness between different data centres is not crucial. We had already some customers asking for such functionality.

              The second use case would be the creation of G2G structures which could be interesting for VoIP solutions.

              Both use cases could be resolved at the GR level via JGRP-105 with the addition of filtering for inter group/datacentre communication a configurable communications layer for intra and inter group/datacentre comms. The main problem right now is that GR needs a lot of refactoring done. Vladimir is going to work on this, but chances are we wouldn't be able to get this done till JGroups 2.7/2.8.

              The first use case (catastrophic failover) is more prioritary than the G2G use case, so we discussed a solution based around JBCACHE-816 so that we can catter for those customers, while JGRP-105 gets resolved. To sum up, here's the use case we want to resolve:

              Use Case:
              - 1 primary center and 1 secondary centre
              - unidirecctional from primary to secondary at one moment in time with the assumption that all clients go to primary
              - asynchronous communication between centres; it's not about 100% data reliability
              - the aim is for business to be able to resume business on secondary in case of failover or upgrade.
              - upon manual failover (upgrade of primary center) or catastrophic failover, manual intervention is required to switch a potential lb to a secondary centre, where clients will be directed to.

              Solution:

              Interceptor vs listener vs HA-Singleton:
              - based on a configuration option, an interceptor (preferred) or cache listener could be created that encapsulates solution; this is preferred to a standalone solution based on HA-Singleton.
              - interceptor allows for greater flexibility and simplicity in terms of relaying data between dcs (data centres), filtering, transactional work and state transfer propagation than cache listener solution.
              - this interceptor is only active if it's the coordinator of the local datacentre; use similar technique to what Singleton Store Cache Loader uses.

              Mux Channel:
              - interceptor starts a mux channel, where coordinator of each data centre joins (inter data centre mux channel).
              - if a split occurred, coordinators could attempt to join the inter dc cluster.
              - mux channel would most likely be configured with TUNNEL so that it can talk to an intermediate GossipRouter; use TUNNEL definition from stacks.xml.

              Relaying:
              - in the context of this use case, relaying refers to inter dc communication.
              - relaying could be done periodically, or per transaction/operation (put, remove, clear...etc).
              - relaying would be asynchronous as data correctness is not paramount and it gives better performance.
              - we're assumming that all clients go to one data centre and upon catastroyphy/upgrade, all clients are redirected to a different data centre manually, so when a dc (data centre) channel member intercepts a relevant local data centre replication event, it relays it. That means that any node in the inter dc cluster can relay information, but with the given assumption, simplifies the design.
              - enabling any dc cluster member to relay simplifies solution for situations where inter dc link fails rather than nodes failing. In this case, clients are moved manually to a different dc, but relaying would continue to happen from any surviving dcs without any further necessary action.
              - a dc cluster member reads what it gets via the mux channel, or primary data centre sends, converts it into an invocation and passes it up.
              - inter dc relay ping pong effect should be avoided so that inter dc changes that are applied locally do not bounce back to other dcs. requires further thought on how to avoid it.

              Transactions:
              - inter dc communications is asynchronous, so 1pc always.
              - intra dc will likely be synchronous but could be asynchronous.
              - upon commit (if sync intra dc) or prepare (if async intra dc), take the list of modifications and relay them.
              - on rollback, do nothing.

              2nd in line:
              - within a data centre, the second in line could maintain a list of modifications within the interceptor.
              - if the second in line becomes master, it would join the inter dc cluster and could replay the modifications after receiving the state. That would guarantee that any possible modifications that were not delivered because the dc coordinator failed are applied.

              State Transfer:
              - if a new node starts and it's first in local data centre, inter data centre interceptor is active and joins the mux channel, potentially requesting a state transfer from the coordinator. The coordinator of the inter dc cluster does not necessarily have to be the primary relayer, but inter dc cluster members should have the same data.
              - if a new node starts and it's not first in local data centre, standard local state transfer rules apply.
              - streaming state transfer at inter data centre level would require further thought as potential intermediate firewalls would come into play.

              • 4. Re: JBCACHE-816 vs JGRP-105
                Manik Surtani Master

                 

                "galder.zamarreno@jboss.com" wrote:
                A month ago, Bela, Vladimir, Jimmy and myself were discussing my suggestions proposed in this forum thread. Apologies if I haven't been able to post the notes earlier.


                I was wondering where this went! :-)

                "galder.zamarreno@jboss.com" wrote:

                Interceptor vs listener vs HA-Singleton:
                - based on a configuration option, an interceptor (preferred) or cache listener could be created that encapsulates solution; this is preferred to a standalone solution based on HA-Singleton.
                - interceptor allows for greater flexibility and simplicity in terms of relaying data between dcs (data centres), filtering, transactional work and state transfer propagation than cache listener solution.
                - this interceptor is only active if it's the coordinator of the local datacentre; use similar technique to what Singleton Store Cache Loader uses.


                Why is the interceptor approach better than the cache listener one? The reason why I say this is from an integration perspective, a cache listener is far less tightly coupled to JBC internals than an interceptor approach.

                In terms of achieving goals, I don't see why this is hard:

                1) A CL should be registered on every cache instance.
                2) Querying viewChange events will tell the CL if it is the intra-ds coord (and hence whether or not to relay stuff to the inter-ds group
                3) Registers a channel for the inter-ds group, listens for method invocations which it applies to the cache (if it is coord) or temporarily caches modifications in a Collection and removes them on commit (if it is the 2nd in line)

                "galder.zamarreno@jboss.com" wrote:

                Relaying:
                - in the context of this use case, relaying refers to inter dc communication.
                - relaying could be done periodically, or per transaction/operation (put, remove, clear...etc).
                - relaying would be asynchronous as data correctness is not paramount and it gives better performance.


                Could use a replication queue.

                "galder.zamarreno@jboss.com" wrote:

                - inter dc relay ping pong effect should be avoided so that inter dc changes that are applied locally do not bounce back to other dcs. requires further thought on how to avoid it.


                The Cache Listener would only relay stuff to other DCs if it is marked as being in the "active" dc, given than the dc switchover would be manual. This would prevent the "dc ping pong" you describe.

                "galder.zamarreno@jboss.com" wrote:

                State Transfer:
                - if a new node starts and it's first in local data centre, inter data centre interceptor is active and joins the mux channel, potentially requesting a state transfer from the coordinator. The coordinator of the inter dc cluster does not necessarily have to be the primary relayer, but inter dc cluster members should have the same data.
                - if a new node starts and it's not first in local data centre, standard local state transfer rules apply.
                - streaming state transfer at inter data centre level would require further thought as potential intermediate firewalls would come into play.


                This needs careful thought, especially where blocking, WAN links and large states are concerned. An entire new ds coming up could block the inter-ds group for quite a while, and this could throttle the inter-ds proxy on the active ds. Even if this is async, if we're talking of several GBs of data over a WAN link, this could mean the inter ds group being blocked for hours, which could easily lead to the inter-ds proxy throwing OOMEs on queued async calls.

                I think we need something better WRT state transfer before we can think of applying this to a WAN scenario. Perhaps something where state is chunked and delivered in several bursts of, for example, under 50MB at a time, so that the group isn't blocked for too long. I don't have a solution here (yet), just thinking aloud - I know it's not an easy problem to solve.

                • 5. Re: JBCACHE-816 vs JGRP-105
                  Bela Ban Master

                  It is better to have an interceptor because the CacheListeners don't know about transactions (maybe they do, now ?). So PUT1, PUT2, PUT3 in a TX would all get replicated with a CacheListener, and - when rolling back the TX - the UNDO ops would get replicated too.
                  This is not the case with an interceptor.
                  Yes, interceptors are tied to JBossCache,, but this solution *is* for JBossCache, so no worries from my side.

                  • 6. Re: JBCACHE-816 vs JGRP-105
                    Manik Surtani Master

                    From 2.x, CacheListeners do have transaction context. And also do get notified of transaction boundary events like a commit or rollback.

                    • 7. Re: JBCACHE-816 vs JGRP-105
                      Galder Zamarreño Master

                       

                      "manik.surtani@jboss.com" wrote:
                      From 2.x, CacheListeners do have transaction context. And also do get notified of transaction boundary events like a commit or rollback.



                      Thanks for the heads up. I see a CacheListener based solution feasible with these callbacks included.

                      • 8. Re: JBCACHE-816 vs JGRP-105
                        Galder Zamarreño Master

                         

                        "manik.surtani@jboss.com" wrote:
                        "galder.zamarreno@jboss.com" wrote:
                        A month ago, Bela, Vladimir, Jimmy and myself were discussing my suggestions proposed in this forum thread. Apologies if I haven't been able to post the notes earlier.


                        I was wondering where this went! :-)


                        I was hiding it secretly ;)

                        My fault :$

                        "manik.surtani@jboss.com" wrote:
                        "galder.zamarreno@jboss.com" wrote:

                        Interceptor vs listener vs HA-Singleton:
                        - based on a configuration option, an interceptor (preferred) or cache listener could be created that encapsulates solution; this is preferred to a standalone solution based on HA-Singleton.
                        - interceptor allows for greater flexibility and simplicity in terms of relaying data between dcs (data centres), filtering, transactional work and state transfer propagation than cache listener solution.
                        - this interceptor is only active if it's the coordinator of the local datacentre; use similar technique to what Singleton Store Cache Loader uses.


                        Why is the interceptor approach better than the cache listener one? The reason why I say this is from an integration perspective, a cache listener is far less tightly coupled to JBC internals than an interceptor approach.

                        In terms of achieving goals, I don't see why this is hard:

                        1) A CL should be registered on every cache instance.
                        2) Querying viewChange events will tell the CL if it is the intra-ds coord (and hence whether or not to relay stuff to the inter-ds group
                        3) Registers a channel for the inter-ds group, listens for method invocations which it applies to the cache (if it is coord) or temporarily caches modifications in a Collection and removes them on commit (if it is the 2nd in line)


                        +1 on the basis that transactional callbacks are now available.

                        "manik.surtani@jboss.com" wrote:
                        "galder.zamarreno@jboss.com" wrote:

                        Relaying:
                        - in the context of this use case, relaying refers to inter dc communication.
                        - relaying could be done periodically, or per transaction/operation (put, remove, clear...etc).
                        - relaying would be asynchronous as data correctness is not paramount and it gives better performance.


                        Could use a replication queue.


                        Indeed. I'll look at the replication queue solution currently available within JBC and see if it could be used by the listener. Regardless, queue based replication (time or size) seems to fit this use case more than per operation/tx.

                        "manik.surtani@jboss.com" wrote:
                        "galder.zamarreno@jboss.com" wrote:

                        - inter dc relay ping pong effect should be avoided so that inter dc changes that are applied locally do not bounce back to other dcs. requires further thought on how to avoid it.


                        The Cache Listener would only relay stuff to other DCs if it is marked as being in the "active" dc, given than the dc switchover would be manual. This would prevent the "dc ping pong" you describe.


                        Thought about such solution: By default, coordinator of inter-dc cluster is active or the relayer. User can change this manually at runtime to suit their needs. Any changes to who's the active relayer would require consensus in the cluster.

                        "manik.surtani@jboss.com" wrote:
                        "galder.zamarreno@jboss.com" wrote:

                        State Transfer:
                        - if a new node starts and it's first in local data centre, inter data centre interceptor is active and joins the mux channel, potentially requesting a state transfer from the coordinator. The coordinator of the inter dc cluster does not necessarily have to be the primary relayer, but inter dc cluster members should have the same data.
                        - if a new node starts and it's not first in local data centre, standard local state transfer rules apply.
                        - streaming state transfer at inter data centre level would require further thought as potential intermediate firewalls would come into play.


                        This needs careful thought, especially where blocking, WAN links and large states are concerned. An entire new ds coming up could block the inter-ds group for quite a while, and this could throttle the inter-ds proxy on the active ds. Even if this is async, if we're talking of several GBs of data over a WAN link, this could mean the inter ds group being blocked for hours, which could easily lead to the inter-ds proxy throwing OOMEs on queued async calls.

                        I think we need something better WRT state transfer before we can think of applying this to a WAN scenario. Perhaps something where state is chunked and delivered in several bursts of, for example, under 50MB at a time, so that the group isn't blocked for too long. I don't have a solution here (yet), just thinking aloud - I know it's not an easy problem to solve.


                        If we have the concept of relayer or active node in inter dc cluster, could a new node request the state from a non active node? We can't guarantee that all inter dc nodes will have the same state, but at least, if we request it from a non active inter dc node, we avoid the active node having to produce the state. This does not resolve the issue that the active node will not be able to forward messages while the state transfer is on going in the inter dc cluster, but at least reduces the burden on it. If there's no non-active node in the inter-dc cluster, then we'd have no other choice that request state from active.

                        This solution assumes that state transfer can be requested from any node in a cluster, not necessarily the coordinator. Not sure whether it's possible at this moment in time. It probably needs some coding in the listener to negotiate this.

                        • 9. JBCACHE-816 vs JGRP-105
                          Bela Ban Master

                          To update this discussion, I've meanwhile added a RELAY protocol to JGroups. This is issue JGRP-747 [1], which supercedes/replaces JGRP-105.

                           

                          I've outlined the design of RELAY in a blog port [2]. In a nutshell, RELAY bridges 2 local clusters into a big virtual cluster. The local clusters are completely autonomous; it was paramount in the design to avoid a local cluster having to block on the other local cluster.

                           

                          RELAY converts remote addresses to local ones and vice versa, and forwards (relays) traffic from a local cluster to the remote cluster. To be more specific, multicasts are forwarded (with the sender getting wrapped into a ProxyAddress) and unicasts are either sent locally (dest is not a remote address) or forwarded to the other cluster (dest is remote).

                           

                          This works pretty well so far (as of Feb 2011), and even RPCs across the virtual cluster (with unicast replies) work. I'm currently working on making Infinispan work in DIST mode over RELAY.

                           

                          I'll present this at JBossWorld 2011 (Boston in May).

                           

                          [1] https://issues.jboss.org/browse/JGRP-747

                          [2] http://belaban.blogspot.com/2010/11/clustering-between-different-sites.html