5 Replies Latest reply on Jun 24, 2007 11:12 AM by timfox

    DB performances for large number of messages

    garu

      Hi all,
      i'd like to discuss about db organization because i'm concerned that with actual db structure, messaging won't be able to handle the volume of data i'm going to pump within it.
      More generally speaking i think messaging in its actual incarnation is not well suited to handle large volume data.
      In a few words i need messaging to become a sort of multiplexer that taking messages from a single source should be able to distribute them to different subscribers. Of these subscribers one will always be active and the others (at least two but could be more) will be active only when needed. This means that i have to guarantee that messages on topics are persisted until the subscribers read them or a certain amount of time has passed.
      No problem until now, but i have to handle tenth of topics and hundredth million messages a day. It's not a problem of data size, the average message size is less than 200 byte, but just of number of messages.
      Actually, if i din't miss something, all messages for all queues/topics are handled with only two tables jbm_msg and jbm_msg_ref and this is the limit.
      I don't know how experienced you are about dbms but i know by experience that doesn't matter what dbms engine you have under the cover, when a table begin to fill with tenths or hundredths millon rows, performances go down the kitchen sink.
      Now we have a legacy system (non jms, just C programs) that, to avoid performance bottlenecks, divide the message flows in different tables so that each table (partitioned tables) can remain low volume in terms of rows number and insert time can remain low, but to obtain the same subdivision i'd need a messaging instance for each flow.
      If i'd try to propose such an architecture, i'd be killed on place!
      Obviously i'm not thinking of a single instance to handle the whole data flow, it would be unsafe to say the least, but on the other side i cannot think to have an instance for each flow.

      What i'm thinking about and are proposing is a system by which when you deploy a queue/topic you can ask that the queue/topic be allocated on a different tables set than the default one. This means that choosing the queue/topic where i send i implicitly choose the tables where messages are written, allowing for performances tuning for large number of messages.

      I'd like to know you opinion about that.

      Thanks, Gabriele

        • 1. Re: DB performances for large number of messages
          timfox

          Yes, this is all in the roadmap. Please take a look at JIRA.

          Different persistence managers for different destinations:

          http://jira.jboss.com/jira/browse/JBMESSAGING-373

          Also the following are of interest:

          http://jira.jboss.com/jira/browse/JBMESSAGING-406

          http://jira.jboss.com/jira/browse/JBMESSAGING-574

          If you want to volunteer for 373 then be my guest :) It will help it to get done quicker

          • 2. Re: DB performances for large number of messages
            timfox

            One other thing you should think about is your application design.

            In general, high volume data is *not* persistent. E.g. market data feed where messages are shot out very fast with low latency, and the application is designed to cope with loss of messages. If it doesn't get a particular message, that's fine it will get another one shortly whose result (the price) will override the previous one anyway.

            Quite often you can "design out" persistence at the application level.

            If your data is non persistent then the system will be able to handle much larger volumes and wil scale much better since it doesn't have to persist through the bottleneck - the database.

            So I'd ask the question - does the data really need to be persistent? How long do your messages live for?

            Perhaps you could explain your use cases in more detail?


            • 3. Re: DB performances for large number of messages
              garu

              Hi Tim,
              thanks for the info, as usual when all the rest has failed read the instructions :)
              About the application, is not that simple to explain.
              It's not a single application but a group of applications that handle many different data flows from sensors distributed in a geographical network.
              These flows arrive with different protocols, pure udp, udp with application protocol, tcp, ftp, etc..., with different priorities and with rates that are not constant through the day but show very high peaks.
              JMS is seen as the means by which we can decouple the network from the data usage and the JMS persistence db must guarantee that once the message has been aknowledged to the originator, it will remain available to subscribers for the time defined for that flow.
              Those persistence times will vary from few hours to two or three days, depending on the particular flow and on the minimum time span we need for the data to become a significant measurement set.
              Then there is another class of messages, alarm messages, from which can depend the safeness of people, so you can imagine how carefully those message must be handled.

              Here i've just scratched the surface, but the unifying guidelines are that we must guarantee we are not loosing any message and the system must be able to handle traffic peaks much higher than average flow rate without slowing down the flows.

              As i said in my first append, the legacy system i wrote several years ago in C is still working for production, but now we need to feed other channels and that legacy system is no longer suited for that and we devised JMS as the decoupling system that could allow us to feed as many channels as we need.
              Obviously my objective is to have the minimum number of messages parked for the minimum time as possible in the messaging persistence db, but nonetheless i'll have to move hundredths millions messages a day in and out of persistence db.
              I can partition the flows in different messaging instances so that they use different db, but i cannot reach the 1:1 granularity and only the possibility to associate a destination to specific msg and msg_ref tables can give the flexibility to handle the forecasted volumes.

              • 4. Re: DB performances for large number of messages
                garu

                Continuing from previuos append.

                I'd like to follow your suggestion to write myself the patch for this, after all, after many year, coding is still giving me fun, but i have such a tight schedule in my job that i cannot commit or promise anything.

                I tried to figure out how it's working now (as a colleague said many years ago when there were still 300 baud modems around, it is not written as i would have written it so ,obviously, it's not working :) (i'm kidding, it is instead very well written) and i've got some ideas.

                - Queue and Topic mbeans should have a couple of attributes like StoreName and RefStoreName. They could be both read/write with defaults in Queue-xmbean.xml and Topic-xmbean.xml of jbm_msg and jbm_msg_ref, or perhaps only the first one could be r/w and RefStoreName could result from composition of StoreName + "_ref".
                - those attribues should be available in a couple of columns into jbm_postoffice, so that when a queue is associated to a channel is also associated to specific StoreName and RefStoreName.
                - those attributes should be available from Binding, Queue,and Topic so that they are available to persistence manager to get the table name to prime sql statements, that actually use jbm_msg and jbm_msg_ref, with the correct table name to read and write for that specific channel.
                - whenever a binding is added to postoffice table the store tables must be created if not already present because other channels are already writing on them.
                - whenever a binding is removed from postoffice table the store tables should be removed if empty (this could be conditioned by a a flag attribute in queue/topic mbean, it would be not worth removing them if the destination will be reused again)

                Obviously this is the result of just a rapid overlook of the code, so i'm sure i missed many subtleties, but i'm interested to know what you think.

                • 5. Re: DB performances for large number of messages
                  timfox

                  I think the way this should be done is to enable the user to configure a specific destination to use a specific persistence manager.

                  This would involve an extra (optional) attribute "PersistenceManager" on the destination.

                  When doing any database operations associated with a particular destination, we just need to make sure the correct persistence manager is used, then (hopefully) pretty much all the rest of the code doesn't need to change.

                  There is the following complication.

                  When doing transactional operations (sending/acking) to more than one persistence manager in the same transaction, we need that to be transactional.

                  So we can either

                  1) Implement our own XAResource to handle this (not recommended)

                  2) Leverage any XA capabilities of the underlying database.

                  If we make sure we use an XADatasource for each database, we should be able to leverage the JCA adapter to handle transaction enlistment.

                  Of course, if the underlying database does not support XA we cannot do this.

                  Going ahead I think this will less of a problem since we will support local file based persistence stores so each node in the cluster can have its own storage.