Race condition found by failure in GroupingFailoverReplicati
jmesnil Nov 10, 2009 9:59 AMThis is related to intermittent failure on GroupingFailoverReplicationTest.testGroupingLocalHandlerFails.
The test sometimes fail because the remote queue binding is added twice to the same server.
Here are some logs I added to understand what's going on:
INFO: ADDED BINDING uuid=51ab42be-ce05-11de-b662-001c42000009, notif.51aaa67a-ce05-11de-b662-001c42000009519b3d24-ce05-11de-b662-001c42000009, distance=0, backup=false, server=519b3d24-ce05-11de-b662-001c42000009
INFO: ADDED BINDING uuid=51acf06f-ce05-11de-b662-001c42000009, queue0519b3d24-ce05-11de-b662-001c42000009, distance=0, backup=false, server=519b3d24-ce05-11de-b662-001c42000009
INFO: SEND INFO TO QUEUE from 519b3d24-ce05-11de-b662-001c42000009
INFO: ROUTE 51ae01e0-ce05-11de-b662-001c42000009 BINDING ADDED For queue0519b3d24-ce05-11de-b662-001c42000009 in 519b3d24-ce05-11de-b662-001c42000009
INFO: CLEAR ALL BINDINGS
INFO: adding 51ae01e0-ce05-11de-b662-001c42000009 binding for cluster queue0519b3d24-ce05-11de-b662-001c42000009, to 51a687c7-ce05-11de-b662-001c42000009, distance=0, backup=false
INFO: ADDED BINDING uuid=51b3ce41-ce05-11de-b662-001c42000009, queue0519b3d24-ce05-11de-b662-001c42000009, distance=1, backup=false, server=51a687c7-ce05-11de-b662-001c42000009
INFO: adding 51acf06f-ce05-11de-b662-001c42000009 binding for cluster queue0519b3d24-ce05-11de-b662-001c42000009, to 51a687c7-ce05-11de-b662-001c42000009, distance=0, backup=false
ATTENTION: Remote queue binding queue0519b3d24-ce05-11de-b662-001c42000009 has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses
519b3d24 <-> server #0
51a687c7 <-> server #1
notif.51aaa67a <-> the bridge's notification consumer
queue0519b3d24 <-> the queue
Here is what happens when the test fails:
1.1 on server #0, local binding is added for queue0519b3d24
-> sent notification 51acf06f
1.2 on server #1, the bridge creates a notif queue notif.51aaa67a
-> it sends a management message "sendInfoQueue"
2. server #0 replies to the "sendInfoQueue" by sending a BINDING_ADDED message (51ae01e0)
3. server #1 clear all its bindings
3. server #1 receive the BINDING_ADDED message sent from sendInfoToQueue (51ae01e0)
-> it will add the remote queue binding for queue0519b3d24
4. server #1 receive the BINDING_ADDED *notification* (51acf06f)
-> EXCEPTION! binding has already been added
There is a notificationLock in PostOfficeImpl which should prevent to receive the messages
from "sendInfoToQueue" *before* the notification when the local binding is added.
However, the messages and the notification do not walk the same code path.
Notification will be routed like a regular messages while messages sent by sendInfoQueue
will use routeDirect() instead.
I need to find out if it is possible that message routed directly (i.e. from sendInfoQueue ) will go before the message routed normally (i.e. the notifications) even if the routeDirect() is called after route().