1 2 3 Previous Next 30 Replies Latest reply on Nov 18, 2009 12:48 PM by ataylor

Race condition found by failure in GroupingFailoverReplicati

jmesnil Nov 10, 2009 9:59 AM

This is related to intermittent failure on GroupingFailoverReplicationTest.testGroupingLocalHandlerFails.
The test sometimes fail because the remote queue binding is added twice to the same server.

Here are some logs I added to understand what's going on:

INFO: ADDED BINDING uuid=51ab42be-ce05-11de-b662-001c42000009, notif.51aaa67a-ce05-11de-b662-001c42000009519b3d24-ce05-11de-b662-001c42000009, distance=0, backup=false, server=519b3d24-ce05-11de-b662-001c42000009
INFO: ADDED BINDING uuid=51acf06f-ce05-11de-b662-001c42000009, queue0519b3d24-ce05-11de-b662-001c42000009, distance=0, backup=false, server=519b3d24-ce05-11de-b662-001c42000009
INFO: SEND INFO TO QUEUE from 519b3d24-ce05-11de-b662-001c42000009
INFO: ROUTE 51ae01e0-ce05-11de-b662-001c42000009 BINDING ADDED For queue0519b3d24-ce05-11de-b662-001c42000009 in 519b3d24-ce05-11de-b662-001c42000009
INFO: CLEAR ALL BINDINGS
INFO: adding 51ae01e0-ce05-11de-b662-001c42000009 binding for cluster queue0519b3d24-ce05-11de-b662-001c42000009, to 51a687c7-ce05-11de-b662-001c42000009, distance=0, backup=false
INFO: ADDED BINDING uuid=51b3ce41-ce05-11de-b662-001c42000009, queue0519b3d24-ce05-11de-b662-001c42000009, distance=1, backup=false, server=51a687c7-ce05-11de-b662-001c42000009
INFO: adding 51acf06f-ce05-11de-b662-001c42000009 binding for cluster queue0519b3d24-ce05-11de-b662-001c42000009, to 51a687c7-ce05-11de-b662-001c42000009, distance=0, backup=false
ATTENTION: Remote queue binding queue0519b3d24-ce05-11de-b662-001c42000009 has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses

519b3d24 <-> server #0
51a687c7 <-> server #1
notif.51aaa67a <-> the bridge's notification consumer
queue0519b3d24 <-> the queue

Here is what happens when the test fails:

1.1 on server #0, local binding is added for queue0519b3d24
-> sent notification 51acf06f
1.2 on server #1, the bridge creates a notif queue notif.51aaa67a
-> it sends a management message "sendInfoQueue"
2. server #0 replies to the "sendInfoQueue" by sending a BINDING_ADDED message (51ae01e0)
3. server #1 clear all its bindings
3. server #1 receive the BINDING_ADDED message sent from sendInfoToQueue (51ae01e0)
-> it will add the remote queue binding for queue0519b3d24
4. server #1 receive the BINDING_ADDED *notification* (51acf06f)
-> EXCEPTION! binding has already been added

There is a notificationLock in PostOfficeImpl which should prevent to receive the messages
from "sendInfoToQueue" *before* the notification when the local binding is added.
However, the messages and the notification do not walk the same code path.
Notification will be routed like a regular messages while messages sent by sendInfoQueue
will use routeDirect() instead.
I need to find out if it is possible that message routed directly (i.e. from sendInfoQueue ) will go before the message routed normally (i.e. the notifications) even if the routeDirect() is called after route().

1. Re: Race condition found by failure in GroupingFailoverRepli

jmesnil Nov 10, 2009 11:48 AM (in response to jmesnil)

ok, some more findings

The issue is that messages are routed from separate thread and are not routed in the expected order.

* in 1 thread, the BINDING_ADDED notification (202d41cd ) is routed normally
-> it enters PostOffice.processRoute() first
* in another thread, the PostOffice.sendInfoQueueToQueue route a BINDING_ADDED message (202e051e) directly
-> it enters PostOffice.processRoute() second

However, the BINDING_ADDED message (202e051e) will finish to process first first and will be added to the queue
before the BINDING_ADDED notification (202d41cd)

The order is not what we expect...

But the real issue is about the ordering of the CLEAR_BINDINGS message sent by PostOffice.sendInfoQueueToQueue() and the BINDING_ADDED message.

If the CLEAR_BINDINGS message is received before the BINDING_ADDED notification, the bindings will be added twice to the PostOffice -> the exception will occur wether it is the BINDING_ADDED notification or the BINDING_ADDED notification message which is received first by the consumer.

The root issue is that CLEAR_BINDINGS message can be delivered to the consumer first even if the BINDING_ADDED notification is sent first.
These 2 messages are processed by separate thread and there is no guarantee that BINDING_ADDED will be routed and added to the queue first even though it calls processRoute() first.

Another issue which aggravates the race condition is that the notifications are *durable* messages while the messages sent by PostOffice.sendInfoQueueToQueue() are *not durable* -> the notifications will spent even more time being processed by the PostOffice and make it more likely the messages sent by PostOffice.sendInfoQueueToQueue() to be delivered first.

For the record, here are the logs which shows the race condition:

INFO: ADDED BINDING uuid=20297139-ce17-11de-bb74-001c42000009, notif.20292318-ce17-11de-bb74-001c420000092013ed62-ce17-11de-bb74-001c42000009, distance=0, backup=false, server=2013ed62-ce17-11de-bb74-001c42000009
INFO: ADDED BINDING uuid=202d41cd-ce17-11de-bb74-001c42000009, queue02013ed62-ce17-11de-bb74-001c42000009, distance=0, backup=false, server=2013ed62-ce17-11de-bb74-001c42000009
INFO: PROCESS ROUTE for 202d41cd-ce17-11de-bb74-001c42000009 true null
INFO: SEND INFO 202e051e-ce17-11de-bb74-001c42000009
INFO: PROCESS ROUTE for 202e051e-ce17-11de-bb74-001c42000009 true null
INFO: PROCESSING ROUTE TOOK 4 202e051e-ce17-11de-bb74-001c42000009 1257870958719
INFO: ADDED 202e051e-ce17-11de-bb74-001c42000009 to notif.20292318-ce17-11de-bb74-001c42000009
INFO: DELIVERING MESSAGE 202e051e-ce17-11de-bb74-001c42000009
INFO: PROCESSING ROUTE TOOK 6 202d41cd-ce17-11de-bb74-001c42000009 1257870958720
INFO: ADDED 202d41cd-ce17-11de-bb74-001c42000009 to notif.20292318-ce17-11de-bb74-001c42000009
INFO: CLEAR BINDINGS
INFO: DIRECT DELIVERED HANDLED 202e051e-ce17-11de-bb74-001c42000009 org.hornetq.core.server.impl.ServerConsumerImpl@2781b3d4
INFO: DELIVERING MESSAGE 202d41cd-ce17-11de-bb74-001c42000009
INFO: DIRECT DELIVERED HANDLED 202d41cd-ce17-11de-bb74-001c42000009 org.hornetq.core.server.impl.ServerConsumerImpl@2781b3d4
INFO: adding 202e051e-ce17-11de-bb74-001c42000009 binding for cluster queue02013ed62-ce17-11de-bb74-001c42000009, to 202181f5-ce17-11de-bb74-001c42000009, distance=0, backup=false
INFO: ADDED BINDING uuid=2031ae9f-ce17-11de-bb74-001c42000009, queue02013ed62-ce17-11de-bb74-001c42000009, distance=1, backup=false, server=202181f5-ce17-11de-bb74-001c42000009
INFO: adding 202d41cd-ce17-11de-bb74-001c42000009 binding for cluster queue02013ed62-ce17-11de-bb74-001c42000009, to 202181f5-ce17-11de-bb74-001c42000009, distance=0, backup=false
ATTENTION: Remote queue binding queue02013ed62-ce17-11de-bb74-001c42000009 has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses

2. Re: Race condition found by failure in GroupingFailoverRepli

jmesnil Nov 10, 2009 11:59 AM (in response to jmesnil)

A simple question to finish this task for tomorrow?

Why are messages sent by PostOffice.sendInfoToQueue routed directly?

They seem a little bastardized: they look like notifications (except theuy're not durable), they are handled like notifications but they are routed directly instead of using the NotificationService.
Why can't they be send like regular notifications?
Actions
3. Re: Race condition found by failure in GroupingFailoverRepli

timfox Nov 10, 2009 12:37 PM (in response to jmesnil)

"jmesnil" wrote:
A simple question to finish this task for tomorrow?

Why are messages sent by PostOffice.sendInfoToQueue routed directly?

They seem a little bastardized: they look like notifications (except theuy're not durable), they are handled like notifications but they are routed directly instead of using the NotificationService.
Why can't they be send like regular notifications?

Because they must only be sent to the consumer that requested them.

If you routed them through the post office they would be sent to all queues registered at that address, which would mean they would arrive at consumers on different nodes which had not requested them.
Actions
4. Re: Race condition found by failure in GroupingFailoverRepli

timfox Nov 10, 2009 12:41 PM (in response to jmesnil)

"jmesnil" wrote:
ok, some more findings

The issue is that messages are routed from separate thread and are not routed in the expected order.

* in 1 thread, the BINDING_ADDED notification (202d41cd ) is routed normally
-> it enters PostOffice.processRoute() first
* in another thread, the PostOffice.sendInfoQueueToQueue route a BINDING_ADDED message (202e051e) directly
-> it enters PostOffice.processRoute() second

I'm trying to understand how this can be the case.

Both managementService.sendNotification() and postoffice:sendQueueInfoToQueue require the notification lock before they can executed, so how can they happen concurrently as you suggest?

I guess I am missing something here...
Actions
5. Re: Race condition found by failure in GroupingFailoverRepli

jmesnil Nov 12, 2009 8:53 AM (in response to jmesnil)
"timfox" wrote:

I'm trying to understand how this can be the case.

Both managementService.sendNotification() and postoffice:sendQueueInfoToQueue require the notification lock before they can executed, so how can they happen concurrently as you suggest?

I guess I am missing something here...

Holding the notification lock is not enough to guarantee that the notification will be added to the queue before the sendQueueInfoToQueue messages.

The management service will hold the lock, route the notification, a Runnable will be added to StorageManager.afterReplicated() and the lock is released. The same goes for postoffice:sendQueueInfoToQueue.

storageManager.afterReplicated(new Runnable() { public void run() { addReferences(refs); } });

The lock ensures that the notification and messages are routed in the correct order.
However, the messages are effectively added later on when the Runnable is called back outside the notification lock.
When the test fails, the sendQueueInfos messages are added to the queue *before* the notification even though the notification was routed before.

I am now tracing the code related to the ReplicationManager to understand why the sendQueueInfos replication is not done after the notification replication...
Actions
6. Re: Race condition found by failure in GroupingFailoverRepli

timfox Nov 12, 2009 8:55 AM (in response to jmesnil)

Ah you're talking about the new replication code that Clebert added?
Actions
7. Re: Race condition found by failure in GroupingFailoverRepli

jmesnil Nov 12, 2009 10:13 AM (in response to jmesnil)
I am confused by the thread activities related to the Replication.

When GroupingFailoverReplicationTest.testGroupingLocalHandlerFails fails, messages are routed from separate server session (1 for the management notification consumer, 1 for the client creating the queue).

I added some logging to PostOfficeImpl.processRoute():

if (storageManager.isReplicated()) { log.info("IN PROCESS ROUTE " + message.getStringProperty("foobar") + " " + Thread.currentThread().getId()); storageManager.afterReplicated(new Runnable() { public void run() { log.info("AFTER REPLICATION " + message.getStringProperty("foobar") + " " + Thread.currentThread().getId()); addReferences(refs); } }); }

It logs which thread is executing PostOffice.processRoute() and which thread is executing the runnable.
When the test fails, the logs are:

INFO: IN PROCESS ROUTE 353320f5-cf9a-11de-b688-001c42000009 82 INFO: IN PROCESS ROUTE 35351cc6-cf9a-11de-b688-001c42000009 83 INFO: AFTER REPLICATION 35351cc6-cf9a-11de-b688-001c42000009 83 INFO: AFTER REPLICATION 353320f5-cf9a-11de-b688-001c42000009 64

As you can see there are 2 threads calling PostOffice.processRoute() (#82 & #83).
Thread #83 will finish replication before thread #82 -> messages will be added to the queue out of order. This is the issue causing the test to fail.
A seconde issue is that the thread will will execute the runnable added in thread #82 is thread #64 and not thread #82!
Actions
8. Re: Race condition found by failure in GroupingFailoverRepli

timfox Nov 12, 2009 10:21 AM (in response to jmesnil)

It's not a problem if replication responses are executed on different threads. Don't expect them to be executed on the same thread.

But it is a problem if they are executed out of order.
Actions
9. Re: Race condition found by failure in GroupingFailoverRepli

jmesnil Nov 12, 2009 12:20 PM (in response to jmesnil)

It's not clear yet but the issue may be that replication path is not the same for both types of messages.

The messages sent from PostOffice.sendInfoQueueToQueue are not replicated and they do not trigger any replication (the bridge notif queue is non-durable), hence, the "replication" process will complete quite fast for them as they have nothing to wait from the backup server. The thread calling this method is not waiting for any replication from the backup server.

The notification message will not be replicated either but its executing thread is waiting for the replication of a packet REPLICATION_APPEND. The notif message will not be added to the queue until the packet has been replicated on the backup server.
When the test fails, it is because the replication response for the REPLICATION_APPEND arrives late and the CLEAR_BINDING message was already added to the queue.

I need to understand under which conditions the failure occurs, what's the REPLICATION_APPEND packet corresponds to, if it is normal to wait for the replication of this packet before adding the notification message, etc.
Actions
10. Re: Race condition found by failure in GroupingFailoverRepli

timfox Nov 12, 2009 1:30 PM (in response to jmesnil)

Hmm, this sounds like a more general problem with the replication code. From what I can gather from your explanation, it seems that durable messages are replicated but non durable ones aren't?

If this is true, then this will cause ordering problems, e.g. if I send the following messages (D=Durable, N=Non Durable) from a producer

D1 N1 D2 N2 D3 N3 D4 N4

Then on the server the non durable ones will be added to the queue immediately but the durable ones will only be added when the replication response comes back.

So you'll probably end up with the messages in the queue in the following order:

N1 N2 N3 N4 D1 D2 D3 D4

Which is incorrect and breaks JMS ordering guarantees.

The quick fix for this would be to always replicate a packet even for non durable messages - we can create a dummy replication packet which actually does nothing on the backup server, but this will guarantee that all messages will be added to the queue in the correct order, since they will only be added when the replication responses return.
Actions
11. Re: Race condition found by failure in GroupingFailoverRepli

clebert.suconic Nov 12, 2009 1:43 PM (in response to jmesnil)

Yes.. it sounds like a bug:

I have created a JIRA: https://jira.jboss.org/jira/browse/HORNETQ-218
Actions
12. Re: Race condition found by failure in GroupingFailoverRepli

clebert.suconic Nov 12, 2009 1:54 PM (in response to jmesnil)

I need to understand under which conditions the failure occurs, what's the REPLICATION_APPEND packet corresponds to, if it is normal to wait for the replication of this packet before adding the notification message, etc.

@Jeff: Just to answer your question, REPLICATION_APPEND is sent by ReplicatedJournal.

Any time you call Storagemanager.append, the Replicatoin_Append packet will be sent to to the ReplicationEndpoint on the backup node.
Actions
13. Re: Race condition found by failure in GroupingFailoverRepli

clebert.suconic Nov 12, 2009 8:05 PM (in response to jmesnil)

I have a fix already, but I couldn't commit it today. If you guys are interested on it, I have attached a patch to the JIRA.
Actions
14. Re: Race condition found by failure in GroupingFailoverRepli

clebert.suconic Nov 12, 2009 11:56 PM (in response to jmesnil)

I got 100% testsuite pass with the patch I attached into https://jira.jboss.org/jira/browse/HORNETQ-218

However GroupingFailoverReplicationTest takes 93 seconds on each test. It always pass to me, but It's taking a huge ammount of time to close the connection, and to stop the server. (The executor is taking forever to shutdown). However it looks like this was already happening without my changes.
Actions

1 2 3 Previous Next

Go to original post