1 2 Previous Next 25 Replies Latest reply on Sep 13, 2010 6:46 PM by clebert.suconic

    OOM errors when using chained diverts

    ronnys

      Hi,

       

      we need to distribute messages to 2 different destinations. Originally we wanted to use a simple topic and 2 durable subscribers w/o filters, but due to the nature of the paging mechanism, this does not work very well if the backlog of both subscribers gets too different (i.e. the gap gets too large), as the subscriber with the bigger backlog stops the subscriber with the smaller backlog.

       

      So I tried diverts and (ab-?)used it in the following way:

       

      • Client posts to queue A
      • HornetQ exclusively diverts any messages coming in on queue A to queue B
      • HornetQ non-exclusively diverts any messages coming in on queue B to queue C

       

      When posting messages to queue A and reading them concurrently from queue B, the HornetQ server fails with OOM errors (tested with the latest trunk version as of Friday, July, 16th). Server config files are attached, the memory and Java settings are the default except enabled remote JMX access. The queues used are named jms.queue.msgduplicate*.

       

      In order to reproduce the issue:

      • delete HornetQ data directory and start HornetQ; watch memory consumption
      • start Publisher
        • LoadTestPublisher org.jnp.interfaces.NamingContextFactory jnp://host:port ConnectionFactory                       queue/msgduplicate_t1_inbound 241200000 10 2048 5000 0
      • Wait until publisher posted 200000 messages, then start consumer (do not consume messages from the other queue!)
        • LoadTestConsumer org.jnp.interfaces.NamingContextFactory jnp://host:port ConnectionFactory                       queue/msgduplicate_t1_outbound_1 241200000 5000
      • The server should consume a lot of memory and dump OOM errors within the next 2 minutes.

       

      Still looking for a better way to distribute messages to multiple different queues ...

       

      Best regards,

      Ronny

        • 1. Re: OOM errors when using chained diverts
          clebert.suconic

          You should instead divert from address1 to diverted-address1 and diverted-address-2

           

           

          From your configs:

          
             <diverts>
              <!-- Move all msgs from msgduplicate_t1_inbound to msgduplicate_t1_outbound_1 -->
                  
              <divert name="msgduplicate-t1-divert-1">                  
               <address>jms.queue.msgduplicate_t1_inbound</address>
               <forwarding-address>jms.queue.msgduplicate_t1_outbound_1</forwarding-address>    
               <exclusive>true</exclusive>
          
              </divert>            
                  
              <!-- Copy all msgs from msgduplicate_t1_outbound_1 to msgduplicate_t1_outbound_2 -->
                  
              <divert name="msgduplicate-t1-divert-2">                  
               <address>jms.queue.msgduplicate_t1_inbound</address>
          
               <forwarding-address>jms.queue.msgduplicate_t1_outbound_2</forwarding-address>    
               <exclusive>false</exclusive>
              </divert>     
              
             </diverts>
          
          
          
          
          

           

          And I don see any paging config for the outbound addressses.. So, they will build up in memory. You should probably configure them to page.
          • 2. Re: OOM errors when using chained diverts
            ronnys

            Hi Clebert,

             

            thanks a lot. Using 2 exclusive diverts from the source queue worked. Btw, the paging settings were fine, the server died nevertheless due to OOM. Maybe something to take a look at.

             

            Best regards,

            Ronny

            • 3. Re: OOM errors when using chained diverts
              ronnys

              Hi Clebert,

               

              unfortunately the OOM issue is still present in r9489, it just take a while to appear. I'm using 2 exclusive diverts on a single queue now in order to duplicate the inbound messages. When running the load test (send to inbound queue, receive from both outbound queues), HornetQ fails after about 10min with the following exception, which causes errors on the client side. The process memory consumption grows dramatically upfront beyond the specified Xmx size (which is 1Gb), up to 2Gb, where it starts failing.

               

              java.lang.OutOfMemoryError: Direct buffer memory
                  at java.nio.Bits.reserveMemory(Bits.java:633)
                  at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:95)
                  at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
                  at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:57)
                  at sun.nio.ch.IOUtil.read(IOUtil.java:205)
                  at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:144)
                  at org.hornetq.core.journal.impl.NIOSequentialFile.read(NIOSequentialFile.java:181)
                  at org.hornetq.core.journal.impl.NIOSequentialFile.read(NIOSequentialFile.java:174)
                  at org.hornetq.core.paging.impl.PageImpl.read(PageImpl.java:99)
                  at org.hornetq.core.paging.impl.PagingStoreImpl.readPage(PagingStoreImpl.java:623)
                  at org.hornetq.core.paging.impl.PagingStoreImpl.access$400(PagingStoreImpl.java:62)
                  at org.hornetq.core.paging.impl.PagingStoreImpl$DepageRunnable.run(PagingStoreImpl.java:1175)
                  at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                  at java.lang.Thread.run(Thread.java:619)

               

              Paging is configured for the inbound address (jms.queue.msgduplicate_divert_inbound) and both outbound addresses (jms.queue.msgduplicate_divert_outbound_*). The messages are clearly paged out to disk for the outbound addresses (the directories are there, the files are changing), so I assume the paging configuration is correct.

               

              Configuration files, load test class etc. are attached. You need to run the load test like this:

               

              java LoadTest jndi_hq_61099_msgduplicate_divert_inbound.properties jndi_hq_61099_msgduplicate_divert_outbound_1.properties,jndi_hq_61099_msgduplicate_divert_outbound_2.properties stopfile 2500000 25000 5 25 10 2048 5000 1000000 true

               

              The properties files are attached as samples.

               

              Best regards,
              Ronny

              • 4. Re: OOM errors when using chained diverts
                clebert.suconic

                sun IO will use a bunch of cached native buffers... it seems you would need more memory.

                 

                I'm looking at changing the way I read page files. I will need some time testing it.

                • 5. Re: OOM errors when using chained diverts
                  clebert.suconic

                  This is a bug on the JDK actually...

                   

                  Sun is using a SoftHashMap caching NativeBuffer... but.. That doesn't work.

                   

                  A NativeBuffer will only have a few bytes allocated on the VM itself while the real deal will be allocated through malloc, what won't affect GC counters (what would fire a GC and eventually freeing up the SoftHashMap).

                   

                  A SoftReference is something that will be hard until the system needs more memory. a Native buffer only shows a few bytes to the JDK, hiding most part under its native world.

                   

                  As a result... if you read different sizes each time (as we do on page), you may just increase the caching (i.e. a leak)

                   

                   

                  Now.. I haven't found anything about that on the bug reports. I will talk internally to someone at RedHat's JDK team.

                  • 6. Re: OOM errors when using chained diverts
                    clebert.suconic

                    I have implemented a workaround on PageImpl. I will be instantiating my own native buffer instead of sending a regular ByteBuffer, what will avoid the naive caching from the JDK.

                     

                     

                    I will be running more tests now. but I could run your test for 3 hours already.

                    • 7. Re: OOM errors when using chained diverts
                      timfox

                      Clebert, you should talk to Trustin about this. I remember he has done a lot of experimenting with native vs non-native buffers in the past.

                      • 8. Re: OOM errors when using chained diverts
                        ronnys

                        Hi Clebert,

                         

                        thanks for the quick solution. The memory consumption looks fine now. I'll keep the test running over night or for some days.

                         

                        Best regards,

                        Ronny

                        • 9. Re: OOM errors when using chained diverts
                          ronnys

                          Hi Clebert,

                           

                          the test ran fine for ~14 hours; afterwards the server failed with

                           

                          [Thread-13 (group:HornetQ-server-threads280564915-1075051425)] 00:37:51,204 WARNING [org.hornetq.core.journal.impl.JournalImpl]  Couldn't open a file in 60 Seconds
                          java.lang.Exception: Warning: Couldn't open a file in 60 Seconds
                              at org.hornetq.core.journal.impl.JournalImpl.enqueueOpenFile(JournalImpl.java:3325)
                              at org.hornetq.core.journal.impl.JournalImpl.moveNextFile(JournalImpl.java:3267)
                              at org.hornetq.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:3095)
                              at org.hornetq.core.journal.impl.JournalImpl.appendAddRecordTransactional(JournalImpl.java:1158)
                              at org.hornetq.core.persistence.impl.journal.JournalStorageManager.storeMessageTransactional(JournalStorageManager.java:567)
                              at org.hornetq.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:904)
                              at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:668)
                              at org.hornetq.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:536)
                              at org.hornetq.core.paging.impl.PagingStoreImpl.onDepage(PagingStoreImpl.java:1007)
                              at org.hornetq.core.paging.impl.PagingStoreImpl.readPage(PagingStoreImpl.java:636)
                              at org.hornetq.core.paging.impl.PagingStoreImpl.access$400(PagingStoreImpl.java:62)
                              at org.hornetq.core.paging.impl.PagingStoreImpl$DepageRunnable.run(PagingStoreImpl.java:1175)
                              at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)
                              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                              at java.lang.Thread.run(Thread.java:619)

                           

                          while the client was trying to receive messages. The client failed with

                           

                          Aug 4, 2010 12:38:17 AM org.hornetq.core.logging.impl.JULLogDelegate warn
                          WARNING: Connection failure has been detected: Did not receive data from server for org.hornetq.core.remoting.impl.netty.NettyConnection@6c241f31[local= /10.115.151.40:56320, remote=ttwsrvdev02.dev.ttw/10.115.151.40:61445] [code=3]
                          Aug 4, 2010 12:38:47 AM org.hornetq.core.logging.impl.JULLogDelegate error
                          SEVERE: Failed to handle failover
                          HornetQException[errorCode=3 message=Timed out waiting for response when sending packet 32]
                              at org.hornetq.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:277)
                              at org.hornetq.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:874)
                              at org.hornetq.core.client.impl.FailoverManagerImpl.reconnectSessions(FailoverManagerImpl.java:817)
                              at org.hornetq.core.client.impl.FailoverManagerImpl.failoverOrReconnect(FailoverManagerImpl.java:718)
                              at org.hornetq.core.client.impl.FailoverManagerImpl.handleConnectionFailure(FailoverManagerImpl.java:580)
                              at org.hornetq.core.client.impl.FailoverManagerImpl.access$600(FailoverManagerImpl.java:72)
                              at org.hornetq.core.client.impl.FailoverManagerImpl$DelegatingFailureListener.connectionFailed(FailoverManagerImpl.java:1150)
                              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.callFailureListeners(RemotingConnectionImpl.java:482)
                              at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:254)
                              at org.hornetq.core.client.impl.FailoverManagerImpl$PingRunnable$1.run(FailoverManagerImpl.java:1208)
                              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                              at java.lang.Thread.run(Thread.java:619)

                           

                          but I guess the latter exception is normal. The client did not received any new messages anymore after the error, even after reconnecting to the server. HornetQ version was r9495; I'm going to test with r9502 again.

                           

                          No other errors were dumped and there was still plenty of space on the file system.

                           

                          After checking the HornetQ source code I guess I should have gotten a thread dump of the server; will do that next time if it happens again.

                           

                          Best regards,
                          Ronny

                          • 10. Re: OOM errors when using chained diverts
                            clebert.suconic

                            No other errors on logs?

                             

                            running out of Files could lead to a case where the open failed and you didn't have a openedFile available.

                             

                            The leak I fixed on page shouldn't affect any of this.

                             

                            Please, take a thread dump if this happens again. I have your test running for 20 hours already, and I had it restarted after another day running it.

                            • 11. Re: OOM errors when using chained diverts
                              ronnys

                              Hi Clebert,

                               

                              the issue just happened again with r9513 after ~24h.

                              No other errors on logs?

                               

                              running out of Files could lead to a case where the open failed and you didn't have a openedFile available.

                               

                              No other errors besides the ones I already posted, full HornetQ log file attached for reference (the test ran unattended over the weekend and failed already on Saturday morning).

                               

                              Please, take a thread dump if this happens again. I have your test running for 20 hours already, and I had it restarted after another day running it.

                               

                              I took a series of thread dumps (every 5s for 3min). Attached. Hope they're helpful.

                               

                              Best regards,

                              Ronny

                              • 12. Re: OOM errors when using chained diverts
                                clebert.suconic

                                Thanks a lot,

                                 

                                 

                                I believe this is something I introduced when I changed some of the moveNext on cleanup.

                                 

                                I have removed the possibility of the wait lock during close, what will complete eliminate this problem.

                                 

                                I will keep your test and a couple of other tests running for the next few days.

                                 

                                 

                                Also: I have done some tweaks on compacting and cleanup. If the system was compacting the first files (I call this as cleanup) the system would always delete a file instead of possibly reuse it (if numFiles < minFiles).

                                 

                                I could run your test for a long period without creating any files whatsoever. I set minFiles = 100, and the system never created any files. Always reused the same files, what should improve throughput at your system.

                                 

                                If you look at the performance chapter, we suggest you keep minFiles tuned up to the maximum number of files used by your system. I would suggest you increase that number.

                                • 13. Re: OOM errors when using chained diverts
                                  ronnys

                                  Hi Clebert,

                                   

                                  I believe this is something I introduced when I changed some of the moveNext on cleanup.

                                   

                                  I have removed the possibility of the wait lock during close, what will complete eliminate this problem.

                                   

                                  I will keep your test and a couple of other tests running for the next few days.

                                   

                                  thanks a lot for the fix. I'll rerun my test as well.

                                   

                                  Also: I have done some tweaks on compacting and cleanup. If the system was compacting the first files (I call this as cleanup) the system would always delete a file instead of possibly reuse it (if numFiles < minFiles).

                                   

                                  I could run your test for a long period without creating any files whatsoever. I set minFiles = 100, and the system never created any files. Always reused the same files, what should improve throughput at your system.

                                   

                                  If you look at the performance chapter, we suggest you keep minFiles tuned up to the maximum number of files used by your system. I would suggest you increase that number.

                                   

                                  Thanks for the hint, I'll give it a try.

                                   

                                  Best regards,

                                  Ronny

                                  • 14. Re: OOM errors when using chained diverts
                                    ronnys

                                    Hi Clebert,

                                     

                                    I ran my test again since Tuesday last week against r9522. Unfortunately, HornetQ failed again with the "[org.hornetq.core.journal.impl.JournalImpl]  Couldn't open a file in 60 Seconds" error after ~79h. It was in this error state since ~56h when I came in this morning. The recommended change regarding the minFiles parameter has not yet been applied to not falsify the test.

                                     

                                    New thread dump series & full log attached again. Don't be confused by the time stamps in the log; the server clock is off by 6 hours since the last reboot, this shouldn't have had any effect to the test. The dump should be ready to be loaded into Samurai.

                                     

                                    I saw quite some changes in SVN since last Tuesday. Please let me know if I shall run another test with the latest revision.

                                     

                                    Best regards,
                                    Ronny

                                    1 2 Previous Next