4 Replies Latest reply on Nov 3, 2008 7:02 AM by Manik Surtani

    advice on cache (mis)use for transaction log store

    Jonathan Halliday Master

      Hello JBossCache gurus

      I've been toying with the idea of using in-memory replication across a cluster to preserve transaction log information for JBossTS, as an alternative to writing it to disk. What follows is my current thinking on the matter, which you can poke holes in at your leisure.

      'Memory is the new disk', so let's use it as such...

      Transaction log entries are typically short lived (just the interval between the prepare and commit phases if all goes well) but must survive a node failure. Or, depending on the degree of user paranoia, perhaps multiple node failures. The size is not much - a few hundred bytes per transaction. Writing the tx log to RAID is a major bottleneck for transaction systems and hence app servers.

      JBossTS already has a pluggable interface for transaction log ('ObjectStore') implementations, so writing one based on JBossCache is not too difficult. The relative performance of this approach compared to the existing file system or database stores remains to be seen. Of course it largely depends on the disk and network hardware and utilization. I should be able to get some preliminary numbers without too much work, but first I need to decide what configurations to test...

      Clearly the number of replicas is critical - it must be high enough to ensure at least one node will survive any outage, but low enough to perform well.

      Writes must be synchronous for obvious reasons, but ideally a node that is up should not halt just because another member of the cluster is down. That model would preserve information but reduce availability, which is undesirable.

      So my first question is: does the cache support a mode somewhere between async and sync, say 'return when at least M of N nodes have acked' ? I can get something similar with buddy replication, but it's not quite the model I want - if more than M nodes are available they should be used. Similarly the crash of one buddy should not halt the system if there is an additional node available such that the total live number remains more than M. Perhaps I can do this only with the raw JGroups API, not the cache?

      Also, are there any numbers on the performance as a function of groups size, particularly mixing nodes on the same or different network segments. I'm thinking that to get independent failure characteristics for the nodes will probably require a distributed cluster, such that the nodes are on different power supplies etc. Having all the nodes in the same rack probably provides a false sense of security...

      On a similar note, whilst cache puts must be synchronous, my design can tolerate asynchronous removes. Is such a hybrid configuration possible?

      Transaction log entries fall into two groups: the ones for transactions that complete cleanly and the ones for transaction that go wrong. The former set is much larger and its members have a lifetime of at most a few seconds. The failure set is much smaller (hopefully empty!) but entries may persist indefinitely.

      I'm thinking of setting up the cache loaders such that the eviction time is longer than the expected lifetime of members of the first group. What I want to achieve is this:

      Synchronous write of an entry to at least N in-memory replicas.

      If the transaction works, remove, possibly asynchronously, of that information from the cluster.

      If the transaction fails, writing of the entry to disk for longer term storage.

      Critically this is not the same as having all writes go through to disk. Is it possible to configure the cache loaders to write only on eviction?

      Or I guess there is another possibility: since the loader's writes are asynchronous with respect to cache puts, is it possible to have it try to write everything, but intelligently remove queued writes from its work list if the corresponding node is removed before the write for its 'put' is made? That would effectively cause the disk to operate at its max throughput without (subject to the size limit of the work q) throttling the in-memory replication. It thus provides an extra layer of assurance compared to in-memory only copies but without the performance hit of synchronous disk writes.

      Also, it vital to ensure there is no circular dependency between the cache and the transaction manager. I'm assuming this can be achieved simply by ensuring there is no transaction context on the thread at the time the cache API is called. Or does it use transactions JTA anywhere internally?

      One final question: Am I totally mad, or only mildly demented?

      Thanks

      Jonathan.

        • 1. Re: advice on cache (mis)use for transaction log store
          Elias Ross Master

          I would probably start with a configuration that is close to your ideal. A prototype, I suppose. Then you could be more specific about what you need.

          Something that would work but perhaps not ideally, for your first round, is to configure a synchronous replicated cache with a async cache loader, say the BerkeleyDB or JBPM. Add in some eviction. And you'd get something close to what you need.

          3.2 will have better replication strategy.

          Having sync put and async removal is a feature maybe you could raise in JIRA?

          • 2. Re: advice on cache (mis)use for transaction log store
            Bela Ban Master

            Wait, isn't it possible in JBossCache to configure the cache to be sync, but make individual calls async ? I think this is done with an Option, or whatever Manik calls it these days.
            This gets stored in an InvocationContext (that's the name !) and the ReplInterceptor then makes the call asynchronous.

            • 3. Re: advice on cache (mis)use for transaction log store
              Bela Ban Master

              Hi Jonathan,

              this is a bordeline case of using JBC, which was designed for high read/low write scenarios.
              If you have a high number of writes, I wouldn't use JBC, and use JGroups directly.
              So I understand writes have to be synchronous, or at least a majority of nodes need to ack a write before it can return, and removes can be async.
              When do you use reads ? Just in the exceptional case right ?
              Bela

              • 4. Re: advice on cache (mis)use for transaction log store
                Manik Surtani Master

                Hi, sorry for not responding to this sooner. Answers inline:

                "jhalliday" wrote:

                Clearly the number of replicas is critical - it must be high enough to ensure at least one node will survive any outage, but low enough to perform well.

                Writes must be synchronous for obvious reasons, but ideally a node that is up should not halt just because another member of the cluster is down. That model would preserve information but reduce availability, which is undesirable.


                I am guessing that you will have session affinity, i.e., for the non-failing cases, it will always be one instance that works on a single transaction log. Hence, I would recommend using buddy replication, in sync mode (as per your requirement). BR allows you to tune how many backup copies are stored as well, and since the number of backups are fixed, your system will scale well.

                "jhalliday" wrote:

                Similarly the crash of one buddy should not halt the system if there is an additional node available such that the total live number remains more than M.


                The crash of a buddy will not halt the system. It will just attempt to find an alternate buddy. Even if you just end up with one node in the system, it will still run, albeit log some severe warnings that you don't have anywhere to backup to! :-)

                "jhalliday" wrote:

                Also, are there any numbers on the performance as a function of groups size, particularly mixing nodes on the same or different network segments. I'm thinking that to get independent failure characteristics for the nodes will probably require a distributed cluster, such that the nodes are on different power supplies etc. Having all the nodes in the same rack probably provides a false sense of security...


                BR allows you to provide hints when selecting buddies (see the buddy group cfg attribute) so that the system will prefer buddies in the same group. You can then create groups that span racks, e.g., one on each rack.

                "jhalliday" wrote:

                On a similar note, whilst cache puts must be synchronous, my design can tolerate asynchronous removes. Is such a hybrid configuration possible?


                Option.setForceAsynchronous() allows you to set this on a per-invocation basis.

                "jhalliday" wrote:

                Critically this is not the same as having all writes go through to disk. Is it possible to configure the cache loaders to write only on eviction?


                Yes. Set passivation to true in your cache loader cfg.

                "jhalliday" wrote:

                Also, it vital to ensure there is no circular dependency between the cache and the transaction manager. I'm assuming this can be achieved simply by ensuring there is no transaction context on the thread at the time the cache API is called. Or does it use transactions JTA anywhere internally?


                Yes. Just suspend any JTA transactions before making cache calls.

                "jhalliday" wrote:

                One final question: Am I totally mad, or only mildly demented?


                No, this sounds pretty interesting. :-)

                Re: Bela's comment about this being write-mostly and hence not suited to a cache, I disagree with this because you have session affinity and the cached dataset by each instance will not overlap. Hence you don't have concurrent writers to the same dataset across the cluster, and hence my suggestion on buddy replication. This feels a lot like HTTP session replication IMO, where only one instance really needs the data; the backup is just if servers die and things get ugly.

                Cheers
                Manik