11 Replies Latest reply on Mar 18, 2010 8:14 AM by sv_srinivaas

    Cluster message redistribution / lost messages

    sv_srinivaas

      HI,

      I have an issue with Hornetq2.0.0.GA and JBoss 5.1.0 GA where the messages are not picked sometimes by the consumer even though the redistribution is enabled. Sometimes the messages are lost when there are no consumers. This issue happens only in Linux but not in Windows.

       

      I have a queue cluster of two nodes nodeq1 and nodeq2. I have another cluster of two nodes nodeb1 and nodeb2 where I have my MDBs deployed.

       

      Steps to replicate the issue

      Started both the queue nodes nodeq1 and nodeq2
      Started one consumer node nodeb2 for MDB2 to consume from nodeq2
      Sent 4 messages and all got consumed by MDB2 from nodeb2

       

      Started nodeb1 also for MDB1 to consume from nodeq1 (and anyway MDB2 should consume from nodeq2)
      Sent 4 messages and two got consumed by MDB1 and two by MDB2

       

      Stopped nodeb1
      Sent 4 messages and all 4 got consumed by MDB2

       

      Stopped nodeb2 as well
      Sent 4 messages, and I encountered 3 different scenarios

       

      Scenario1: Sometimes all 4 messages were sent to nodeq2  and all remained unconsumed
      Scenario2: Sometimes 2 messages were sent to nodeq1 and two messages to nodeq2 and all remained unconsumed
      Scenario3: Sometimes 4 messages were sent to nodeq1 and ALSO 4 messages to nodeq2 and all of them were shown as consumed in jmx console even though there were no consumers to pick the message in both the queue nodes. Also the below exception was reported in both the queue logs for Scenario3 alone

       

      2010-02-18 06:12:51,113 WARN  [org.hornetq.core.client.impl.FailoverManagerImpl] (Thread-8 (group:HornetQ-client-global-threads-717964854)) Failed to connect to server.
      2010-02-18 06:12:59,251 WARN  [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
      2010-02-18 06:12:59,859 WARN  [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
      2010-02-18 06:13:00,663 WARN  [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
      2010-02-18 06:13:01,488 WARN  [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed

       

      Again Started nodeb2 for MDB2 to consume from nodeq2
      Sent 4 messages and two were sent to nodeq1 and two to nodeq2. Only two messages from nodeq2 were consumed by MDB2 and remaning 4 messages in nodeq1 remanined unconsumed even though the redistubution is enabled in both the nodeq1 and nodeq2. (We have set forward-when-no-consumers to false and redistribution-delay to 0 in both the nodes)

       

      I've also attached the xmls for all 4 nodes. pls help

        • 1. Re: Cluster message redistribution / lost messages
          timfox

          You know the drill by now

           

          Please attach a test program that demonstrates the issue, etc.....

          • 2. Re: Cluster message redistribution / lost messages
            timfox

            It would be easier to replicate this without using MDBs.

             

            You could write a simple program that creates connections to the two respective nodes, creates consumers, and closes / opens them in the order you describe so the problem can be replicated.

            • 3. Re: Cluster message redistribution / lost messages
              sv_srinivaas

              Tim, I tried replicating this issue without MDBs but then i dont see that issue of lost messages at all with normal receive methods. In fact I created two queues requestQ and mdbRequestQ and I created a servlet to consume from requestQ and MDB to consume from mdbRequestQ.

               

              Then I tested the message redistribution feature by sending few messages to both requestQ and mdbRequestQ and things worked fine.

               

              Then killed / restarted the consumer nodes (nodeb1 or nodeb2) couple of times at random (as it doesn't happen always in the same sequence). Tested the application with few more messages and everything worked.

               

              Now when I killed both the consumer nodes (i.e. nodeb1 and nodeb2), and sent two messages to both requestQ and mdbRequestQ and this time, messages were load balanced in requestQ whereas from the jmx-console of mdbRequestQ I could see the duplicate mesages in both nodeq1 and nodeq2 which had two messages each and also both the messages were consumed.

               

              I'd like to know how the messages can be in consumed state when there were no MDBs to consume and also why are the messages getting redistributed when there were no consumers in both the queue nodes? Below exception was thrown at this stage.

               

              2010-02-18 06:12:51,113 WARN [org.hornetq.core.client.impl.FailoverManagerImpl] (Thread-8 (group:HornetQ-client-global-threads-717964854)) Failed to connect to server.
              2010-02-18 06:12:59,251 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
              2010-02-18 06:12:59,859 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
              2010-02-18 06:13:00,663 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
              2010-02-18 06:13:01,488 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed

               

              I've attached the sender java client code and the mdb code (i've already sent all the xmls for all 4 nodes ). Pls let me know if you need any other information.

               

              OS version : Linux 2.6.18-164.9.1.el5

              Java : jdk 1.6.0.17

               

              Note: I dont have this issue in Windows where my jdk version is 1.6.0.05

              • 4. Re: Cluster message redistribution / lost messages
                sv_srinivaas

                Hi, If I override and set the <journal-type> to NIO instead of AIO (in linux) then I dont get any issues with redistribution or lost messages and everything works fine, but with AIO I get those issues.

                 

                Do I need to do anything in specific for redistribution to work with AIO?

                • 5. Re: Cluster message redistribution / lost messages
                  timfox

                  What distro/kernel version are you using?

                   

                  Also, what file system are you using to store your journal?

                  • 6. Re: Cluster message redistribution / lost messages
                    sv_srinivaas

                    Tim,

                     

                    Just now we checked and it looks like we have some difference in the kernel minor version between two queue nodes. We'll make it same in both the nodes and try anyway.

                     

                    nodeq1:
                    Filesystem: ext3
                    Linux Distribution: RHEL 5.4
                    Kernel: 2.6.18-164.9.1.el5


                    nodeq2:
                    Filesystem: ext3
                    Linux Distribution: RHEL 5.4
                    Kernel: 2.6.18-164.el5

                     

                     

                    • 7. Re: Cluster message redistribution / lost messages
                      timfox
                      Are you sure, you're not storing your journal on a network drive, e.g. NFS?
                      • 8. Re: Cluster message redistribution / lost messages
                        sv_srinivaas
                        Tim, I dont think we are using the network drive, but is there an easy way to find out if we are using one? Thanks!
                        • 9. Re: Cluster message redistribution / lost messages
                          timfox

                          Ask your sysadmin

                           

                          BTW, you said previously your filesystem was ext3, if you know that, then you know you're not using NFS right?

                           

                          Since you're not sure, I can only assume you're not sure about your previous answer.

                          • 10. Re: Cluster message redistribution / lost messages
                            sv_srinivaas

                            Tim,

                             

                            Our File system is ext3 only .I checked with admin and confirmed that .

                            • 11. Re: Cluster message redistribution / lost messages
                              sv_srinivaas

                              Hi, this issue is resolved now. It is nothing to do with windows or linux, it was just that the linux node was running low on memory (due to memory leak in the application) and hence messages were paged to disk that I was not aware of. No issues now. Thanks!