1 Reply Latest reply on Mar 16, 2011 8:59 AM by jeroen.v

Cluster connection and bridge not detecting stopped node

jeroen.v Feb 24, 2011 10:19 AM

Hi,

We currently have a HornetQ cluster set up with 2 nodes. They autodiscover each other using multicast. Once discovered the HornetQ nodes correctly connect to each other using a bridge. For all of this we more or less use the standard config.

We have several clients connected to this cluster, they all use the multicast discovery. Some of the clients are using the JCA connector, but this is not really relevant I think. The clients send a request on one queue and expect a response on another queue.

When one hornetQ node goes down the clients seem to correctly recover from this and connect to another node in the cluster. In the meanwhile they might loose some messages but this is not really important for us. Once the situation is stable again on the client side we see that exactly half of the messages is not arriving. After some digging in log files I traced it back to the cluster connection. It seems that the nodelist of the cluster connection is never cleaned up and the load balancing continues. With 2 nodes in the cluster this explains why exactly half of the messages goes missing.

I can correct this situation manually by using JMX to stop the clusterconnection. After this no message is lost anymore. Even when I restart the clusterconnection again, it loses no messages.

My question now is how can I configure the clusterconnection so that this situation does not happen anymore?

I thought that I could change some parameters on the bridge that is created, but I'm not able to configure this since the bridge is auto created for me. I would like to override the parameter reconnect-attempts, because by default it is -1. Maybe this is the reason the bridge never stops?

I also do not understand why the cluster connection is not stopped entirely since there are no more nodes available. HornetQ should see this via the discovery, no?

Extra info: HornetQ 2.1.2-FINAL, Hotspot JRE 1.6.0_23

Thanks for your help, Jeroen

1. Cluster connection and bridge not detecting stopped node

jeroen.v Mar 16, 2011 8:59 AM (in response to jeroen.v)

We found a solution to this issue. Apparantly it only occurs when you have some crash and there is just one node left in the cluster. If you set up a cluster with 3 or more nodes and make sure that after a crash you have at least 2 nodes surviving, you will not face this issue.
I guess this is because of how the software is build. Maybe someone can elaborate on how this works exactly?
I would be interested ...

Jeroen
Actions