11 Replies Latest reply on Mar 18, 2010 8:14 AM by sv_srinivaas

Cluster message redistribution / lost messages

sv_srinivaas Feb 18, 2010 5:58 AM

HI,

I have an issue with Hornetq2.0.0.GA and JBoss 5.1.0 GA where the messages are not picked sometimes by the consumer even though the redistribution is enabled. Sometimes the messages are lost when there are no consumers. This issue happens only in Linux but not in Windows.

I have a queue cluster of two nodes nodeq1 and nodeq2. I have another cluster of two nodes nodeb1 and nodeb2 where I have my MDBs deployed.

Steps to replicate the issue

Started both the queue nodes nodeq1 and nodeq2
Started one consumer node nodeb2 for MDB2 to consume from nodeq2
Sent 4 messages and all got consumed by MDB2 from nodeb2

Started nodeb1 also for MDB1 to consume from nodeq1 (and anyway MDB2 should consume from nodeq2)
Sent 4 messages and two got consumed by MDB1 and two by MDB2

Stopped nodeb1
Sent 4 messages and all 4 got consumed by MDB2

Stopped nodeb2 as well
Sent 4 messages, and I encountered 3 different scenarios

Scenario1: Sometimes all 4 messages were sent to nodeq2 and all remained unconsumed
Scenario2: Sometimes 2 messages were sent to nodeq1 and two messages to nodeq2 and all remained unconsumed
Scenario3: Sometimes 4 messages were sent to nodeq1 and ALSO 4 messages to nodeq2 and all of them were shown as consumed in jmx console even though there were no consumers to pick the message in both the queue nodes. Also the below exception was reported in both the queue logs for Scenario3 alone

2010-02-18 06:12:51,113 WARN [org.hornetq.core.client.impl.FailoverManagerImpl] (Thread-8 (group:HornetQ-client-global-threads-717964854)) Failed to connect to server.
2010-02-18 06:12:59,251 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
2010-02-18 06:12:59,859 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
2010-02-18 06:13:00,663 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
2010-02-18 06:13:01,488 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed

Again Started nodeb2 for MDB2 to consume from nodeq2
Sent 4 messages and two were sent to nodeq1 and two to nodeq2. Only two messages from nodeq2 were consumed by MDB2 and remaning 4 messages in nodeq1 remanined unconsumed even though the redistubution is enabled in both the nodeq1 and nodeq2. (We have set forward-when-no-consumers to false and redistribution-delay to 0 in both the nodes)

I've also attached the xmls for all 4 nodes. pls help

hornetq-xmls.zip 23.7 KB

1. Re: Cluster message redistribution / lost messages

timfox Feb 18, 2010 6:00 AM (in response to sv_srinivaas)

You know the drill by now

Please attach a test program that demonstrates the issue, etc.....
Actions
2. Re: Cluster message redistribution / lost messages

timfox Feb 18, 2010 6:35 AM (in response to timfox)

It would be easier to replicate this without using MDBs.

You could write a simple program that creates connections to the two respective nodes, creates consumers, and closes / opens them in the order you describe so the problem can be replicated.
Actions
3. Re: Cluster message redistribution / lost messages

sv_srinivaas Feb 22, 2010 6:25 AM (in response to timfox)
Tim, I tried replicating this issue without MDBs but then i dont see that issue of lost messages at all with normal receive methods. In fact I created two queues requestQ and mdbRequestQ and I created a servlet to consume from requestQ and MDB to consume from mdbRequestQ.

Then I tested the message redistribution feature by sending few messages to both requestQ and mdbRequestQ and things worked fine.

Then killed / restarted the consumer nodes (nodeb1 or nodeb2) couple of times at random (as it doesn't happen always in the same sequence). Tested the application with few more messages and everything worked.

Now when I killed both the consumer nodes (i.e. nodeb1 and nodeb2), and sent two messages to both requestQ and mdbRequestQ and this time, messages were load balanced in requestQ whereas from the jmx-console of mdbRequestQ I could see the duplicate mesages in both nodeq1 and nodeq2 which had two messages each and also both the messages were consumed.

I'd like to know how the messages can be in consumed state when there were no MDBs to consume and also why are the messages getting redistributed when there were no consumers in both the queue nodes? Below exception was thrown at this stage.

2010-02-18 06:12:51,113 WARN [org.hornetq.core.client.impl.FailoverManagerImpl] (Thread-8 (group:HornetQ-client-global-threads-717964854)) Failed to connect to server.
2010-02-18 06:12:59,251 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
2010-02-18 06:12:59,859 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
2010-02-18 06:13:00,663 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed
2010-02-18 06:13:01,488 WARN [org.hornetq.core.postoffice.impl.PostOfficeImpl] (New I/O server worker #1-2) Duplicate message detected - message will not be routed

I've attached the sender java client code and the mdb code (i've already sent all the xmls for all 4 nodes ). Pls let me know if you need any other information.

OS version : Linux 2.6.18-164.9.1.el5
Java : jdk 1.6.0.17

Note: I dont have this issue in Windows where my jdk version is 1.6.0.05

MDBListener.java.zip 707 bytes

MySender.java.zip 1.1 KB
Actions
4. Re: Cluster message redistribution / lost messages

sv_srinivaas Feb 23, 2010 2:19 AM (in response to sv_srinivaas)

Hi, If I override and set the <journal-type> to NIO instead of AIO (in linux) then I dont get any issues with redistribution or lost messages and everything works fine, but with AIO I get those issues.

Do I need to do anything in specific for redistribution to work with AIO?
Actions
5. Re: Cluster message redistribution / lost messages

timfox Feb 23, 2010 4:30 AM (in response to sv_srinivaas)

What distro/kernel version are you using?

Also, what file system are you using to store your journal?
Actions
6. Re: Cluster message redistribution / lost messages

sv_srinivaas Feb 23, 2010 5:06 AM (in response to timfox)

Tim,

Just now we checked and it looks like we have some difference in the kernel minor version between two queue nodes. We'll make it same in both the nodes and try anyway.

nodeq1:
Filesystem: ext3
Linux Distribution: RHEL 5.4
Kernel: 2.6.18-164.9.1.el5

nodeq2:
Filesystem: ext3
Linux Distribution: RHEL 5.4
Kernel: 2.6.18-164.el5
Actions
7. Re: Cluster message redistribution / lost messages

timfox Feb 23, 2010 5:26 AM (in response to sv_srinivaas)

Are you sure, you're not storing your journal on a network drive, e.g. NFS?
Actions
8. Re: Cluster message redistribution / lost messages

sv_srinivaas Feb 23, 2010 5:53 AM (in response to timfox)

Tim, I dont think we are using the network drive, but is there an easy way to find out if we are using one? Thanks!
Actions
9. Re: Cluster message redistribution / lost messages

timfox Feb 23, 2010 6:06 AM (in response to sv_srinivaas)

Ask your sysadmin

BTW, you said previously your filesystem was ext3, if you know that, then you know you're not using NFS right?

Since you're not sure, I can only assume you're not sure about your previous answer.
Actions
10. Re: Cluster message redistribution / lost messages

sv_srinivaas Feb 23, 2010 6:48 AM (in response to timfox)

Tim,

Our File system is ext3 only .I checked with admin and confirmed that .
Actions
11. Re: Cluster message redistribution / lost messages

sv_srinivaas Mar 18, 2010 8:14 AM (in response to sv_srinivaas)

Hi, this issue is resolved now. It is nothing to do with windows or linux, it was just that the linux node was running low on memory (due to memory leak in the application) and hence messages were paged to disk that I was not aware of. No issues now. Thanks!
Actions

Go to original post