1 2 Previous Next 16 Replies Latest reply on Sep 18, 2012 4:37 PM by jbertram

Cluster never attempts core-bridge and many messages are never delivered

david.berkman Sep 14, 2012 5:03 PM

I'm using what I think is a simple enough static setup (we operate in the EC2 cloud, so no discovery before 2.3.0-Alpha), but I see no attempt to create the core bridges in the logs, and I subsequently get many messages that simply fail to arrive at their queue listeners as well, with no errors reported at all. I have also tried using the jgroups-based cluster under 2.3.0-Alpha, with the same results. No attempt to create core bridges, no warnings or errors, and many messages fail to be delivered (once again with no errors). With a single queue everything works fine.

I would like self-discovery (jgroups under 2.3.0-Alpha) but would settle for the static setup working in 2.2.14 or 2.3.0-Alpha. If someone can tell me what log level I need to be at, and what I should be looking for to help debug this I would be grateful. If anyone can't point me to config that works I would be ecstatic.

Here is my current config (please leave out of any replies unless you are marking corrections)...

hornetq-jms.xml (same for all 3 servers)

-------------------------------------------------------------------------

<connection-factory name="ConnectionFactory">

<connector-ref connector-name="netty-connector"/>

</connectors>

</entries>

<connection-load-balancing-policy-class-name>

com.glu.epoxy.substrate.jms.hornetq.RoundRobinConnectionLoadBalancingPolicy

</connection-load-balancing-policy-class-name>

<retry-interval>1000</retry-interval>

<!-- Multiply subsequent reconnect pauses by this multiplier. This can be used to

implement an exponential back-off. For our purposes we just set to 1.0 so each reconnect

pause is the same length -->

<retry-interval-multiplier>1.0</retry-interval-multiplier>

<reconnect-attempts>-1</reconnect-attempts>

<connection-ttl>30000</connection-ttl>

<client-failure-check-period>10000</client-failure-check-period>

<consumer-window-size>0</consumer-window-size>

</connection-factory>

<durable>false</durable>

</queue>

</topic>

</configuration>

hornetq-configuration.xml for server 0

-------------------------------------------------------------------------

<cluster-password>P@ssw0rd!</cluster-password>

<bindings-directory>/usr/local/hornetqdata/bindings</bindings-directory>

<journal-directory>/usr/local/hornetqdata/journal</journal-directory>

<large-messages-directory>/usr/local/hornetqdata/largemessages</large-messages-directory>

<paging-directory>/usr/local/hornetqdata/paging</paging-directory>

<journal-type>ASYNCIO</journal-type>

<jmx-management-enabled>true</jmx-management-enabled>

<message-counter-enabled>true</message-counter-enabled>

<message-counter-max-day-history>7</message-counter-max-day-history>

<message-counter-sample-period>10000</message-counter-sample-period>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

</connectors>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

</acceptor>

</acceptors>

<cluster-connections>

<cluster-connection name="hornetq-cluster">

<connector-ref>netty-connector</connector-ref>

<retry-interval>500</retry-interval>

<use-duplicate-detection>true</use-duplicate-detection>

<forward-when-no-consumers>false</forward-when-no-consumers>

<max-hops>1</max-hops>

<static-connectors>

<connector-ref>g2esb-2</connector-ref>

<connector-ref>g2esb-3</connector-ref>

</static-connectors>

</cluster-connection>

</cluster-connections>

<security-settings>

<security-setting match="jms.#">

</security-setting>

</security-settings>

<address-settings>

<address-setting match="jms.#">

<last-value-queue>false</last-value-queue>

<max-size-bytes>104857600</max-size-bytes>

<page-size-bytes>10485760</page-size-bytes>

<redistribution-delay>300</redistribution-delay>

<send-to-dla-on-no-route>false</send-to-dla-on-no-route>

<address-full-policy>PAGE</address-full-policy>

</address-setting>

</address-settings>

</configuration>

hornetq-configuration.xml for server 1

-------------------------------------------------------------------------

<cluster-password>P@ssw0rd!</cluster-password>

<bindings-directory>/usr/local/hornetqdata/bindings</bindings-directory>

<journal-directory>/usr/local/hornetqdata/journal</journal-directory>

<large-messages-directory>/usr/local/hornetqdata/largemessages</large-messages-directory>

<paging-directory>/usr/local/hornetqdata/paging</paging-directory>

<journal-type>ASYNCIO</journal-type>

<jmx-management-enabled>true</jmx-management-enabled>

<message-counter-enabled>true</message-counter-enabled>

<message-counter-max-day-history>7</message-counter-max-day-history>

<message-counter-sample-period>10000</message-counter-sample-period>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

</connectors>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

</acceptor>

</acceptors>

<cluster-connections>

<cluster-connection name="hornetq-cluster">

<connector-ref>netty-connector</connector-ref>

<retry-interval>500</retry-interval>

<use-duplicate-detection>true</use-duplicate-detection>

<forward-when-no-consumers>false</forward-when-no-consumers>

<max-hops>1</max-hops>

<static-connectors>

<connector-ref>g2esb-1</connector-ref>

<connector-ref>g2esb-3</connector-ref>

</static-connectors>

</cluster-connection>

</cluster-connections>

<security-settings>

<security-setting match="jms.#">

</security-setting>

</security-settings>

<address-settings>

<address-setting match="jms.#">

<last-value-queue>false</last-value-queue>

<max-size-bytes>104857600</max-size-bytes>

<page-size-bytes>10485760</page-size-bytes>

<redistribution-delay>300</redistribution-delay>

<send-to-dla-on-no-route>false</send-to-dla-on-no-route>

<address-full-policy>PAGE</address-full-policy>

</address-setting>

</address-settings>

</configuration>

hornetq-configuration.xml for server 2

-------------------------------------------------------------------------

<cluster-password>P@ssw0rd!</cluster-password>

<bindings-directory>/usr/local/hornetqdata/bindings</bindings-directory>

<journal-directory>/usr/local/hornetqdata/journal</journal-directory>

<large-messages-directory>/usr/local/hornetqdata/largemessages</large-messages-directory>

<paging-directory>/usr/local/hornetqdata/paging</paging-directory>

<journal-type>ASYNCIO</journal-type>

<jmx-management-enabled>true</jmx-management-enabled>

<message-counter-enabled>true</message-counter-enabled>

<message-counter-max-day-history>7</message-counter-max-day-history>

<message-counter-sample-period>10000</message-counter-sample-period>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyConnectorFactory</factory-class>

</connector>

</connectors>

<factory-class>org.hornetq.core.remoting.impl.netty.NettyAcceptorFactory</factory-class>

</acceptor>

</acceptors>

<cluster-connections>

<cluster-connection name="hornetq-cluster">

<connector-ref>netty-connector</connector-ref>

<retry-interval>500</retry-interval>

<use-duplicate-detection>true</use-duplicate-detection>

<forward-when-no-consumers>false</forward-when-no-consumers>

<max-hops>1</max-hops>

<static-connectors>

<connector-ref>g2esb-1</connector-ref>

<connector-ref>g2esb-2</connector-ref>

</static-connectors>

</cluster-connection>

</cluster-connections>

<security-settings>

<security-setting match="jms.#">

</security-setting>

</security-settings>

<address-settings>

<address-setting match="jms.#">

<last-value-queue>false</last-value-queue>

<max-size-bytes>104857600</max-size-bytes>

<page-size-bytes>10485760</page-size-bytes>

<redistribution-delay>300</redistribution-delay>

<send-to-dla-on-no-route>false</send-to-dla-on-no-route>

<address-full-policy>PAGE</address-full-policy>

</address-setting>

</address-settings>

</configuration>

Thank you for any help,

David

1. Re: Cluster never attempts core-bridge and many messages are never delivered

clebert.suconic Sep 16, 2012 4:46 PM (in response to david.berkman)

Ec2 doesn't support UDP, are you using the File sharing from JGroups?

With 2.2 you would have to setup static cluster and set reconnection-attemps to -1 (which is the default). With 2.3 you have to make sure you don't use UDP on EC2, and are using the File Mechanism from JGroups. I can provide more details tomorrow.
Actions
2. Re: Cluster never attempts core-bridge and many messages are never delivered

david.berkman Sep 17, 2012 1:36 AM (in response to clebert.suconic)

If you will look at the included configuration you will see that we're using a static cluster. The provided configuration also shows we have reconnection attempts set at -1. With 2.3, to repeat myself, we also tried jgroups (having read the documentation for both hornetq and the amazon cloud we've never had any expectation that a multi-cast dependent UDP setup would work). Jgroups did not work either. However, that's a more complex setup. Once again, we'd be happy just getting the static cluster working and viable. If you are proposing providing details which are available in the documentation, please do not bother. We have the documentation. We've read it. From what I can gather, our configuration should be fine. I fully believe we've done something wrong, but I need to know what it is. Or the specific section of the documentation that we've contravened. A simple rehash of the docs will not lead me to greater enlightenment. Telling us how to set our debug, and what to look for, that might be helpful as well.

Sorry to be so direct, but talking about the problems with UDP just indicates that you did not bother reading the description of our problem, or our configuration.
Actions
3. Re: Cluster never attempts core-bridge and many messages are never delivered

clebert.suconic Sep 17, 2012 11:34 AM (in response to david.berkman)

It was sunday night, reading it through an iPad.. of course I didn't read the whole thing .. I was just trying to identify some simple scenarios.

A question about your post:

I'm using what I think is a simple enough static setup (we operate in the EC2 cloud, so no discovery before 2.3.0-Alpha), but I see no attempt to create the core bridges in the logs, and I subsequently get many messages that simply fail to arrive at their queue listeners as well, with no errors reported at all. I have also tried using the jgroups-based cluster under 2.3.0-Alpha, with the same results. No attempt to create core bridges, no warnings or errors, and many messages fail to be delivered (once again with no errors). With a single queue everything works fine.

Did you mean with a single server? With a single queue everything works fine?

Being honest to you I'm at lost on what's going on.. this should work fine.

One thing you could try is setting the acceptor to a real IP instead of 0.0.0.0. There's some code on te topology to distribute IPs I'm not sure if this would give you any difference here.

You could set org.hornetq.core.client.impl.Topology to trace. and verify how the servers are identifying each other. I could also take a look at the logs if you upload any.

I need some more data to analyze what's going on.. the logs would probably help you... try setting the IP as I told you... if you can't figure out the logs.. attach it here.. and if you figure out let me know if you fixed it for future reference here.
1 of 1 people found this helpful
Actions
4. Re: Cluster never attempts core-bridge and many messages are never delivered

david.berkman Sep 17, 2012 1:35 PM (in response to clebert.suconic)

Thank you, that was reasoned, inciteful and helpful.

I meant that with a single server everything works well, all messages delivered to the listeners. We happen to be running a single que and a single topic on that server, but I suspect that, within reason, we could run multiple queues and topics without seeing any erratic behavior. When we switch to a cluster, it all goes awry.

We are also at a loss, which is why I posted here. It looks to me like it should work fine as well. We have a development server with a two machine cluster using very similar config, and I can clearly see the cluster bridges being formed in the info logging. I can shut down one mchine, see the bridge dropped on the other, then restart that machine and see both servers reform their bridges. I see none of that on the production setup. All ports are open between these machines, for both TCP and UDP, or so I am told. Worse, I upgraded development to 2.3.0-Alpha and matched config with production to see what would happen, and stopped seeing the core bridges in the info log. I kept 2.3.0-Alpha but switched back to the old config... and still don't see the core bridges. Maybe I should swap back to 2.2.14, but something else must be going on. I get no error messages.

I'll switch the Topology to trace as you say, and I'll capture the output with both real IPs and 0.0.0.0 and post it back.

Thank you
Actions
5. Re: Cluster never attempts core-bridge and many messages are never delivered

jbertram Sep 17, 2012 2:43 PM (in response to david.berkman)

You didn't happen to copy one server to make the other did you? I've seen clusters fail to form like this when one server (including its journal) is copied to create another server instance since the journal contains a special, random ID (which should be unique in the cluster).
Actions
6. Re: Cluster never attempts core-bridge and many messages are never delivered

david.berkman Sep 17, 2012 2:57 PM (in response to jbertram)

Me no... my ops team, yes, they did. As soon as I cleared out the old journal and forced reconstruction the core bridges came right up.

Thank you Clebert and a big gold star goes to Justin. I struggled for hours and days with this, so happy now.

Thank you all. Problem solved. Do *not* template the journal directory when constructing new servers.
Actions
7. Re: Cluster never attempts core-bridge and many messages are never delivered

clebert.suconic Sep 17, 2012 3:09 PM (in response to jbertram)

++Justin

We did that to avoid multiple servers having the same id. So, each server has its own journal. But if the user copies the data this could be an issue.

It would be great to identify a way on this and throw a big warning on the logs.
Actions
8. Re: Cluster never attempts core-bridge and many messages are never delivered

jbertram Sep 17, 2012 3:15 PM (in response to clebert.suconic)

Could we have some kind of sanity check where the broadcast group sends out its ID and each node compares it to their own?
Actions
9. Re: Cluster never attempts core-bridge and many messages are never delivered

david.berkman Sep 17, 2012 3:19 PM (in response to clebert.suconic)

Put a big-ass warning in the docs.
Better yet. Don't put the unique id in the journal file, put it in a .pid file. At least ops teams are likely to recognize that as something that should not be copied.
Also. If a cluster machine sees its own unique id, on a different host, emit a clear error message, with correction hint.
Better yet. Construct the unqiue id, if possible, on server start-up, and hold it in memory. Does it need to persist from re-start to re-start? If so, maybe fall-back to a .lastpid file.Something a little more explicit.

Love HornetQ. Fast, powerful, flexible, easy to embed, complete configuration from both file and code. This issue sucked, but overall we're still rocking it.
Actions
10. Re: Cluster never attempts core-bridge and many messages are never delivered

clebert.suconic Sep 17, 2012 3:59 PM (in response to david.berkman)

The issue is not actually on the journal itself. There's a file used for locking..there's where the ID is located. copying the journal is fine. That needs to be there before restart.

We could write the localhost and the name of the directory somewhere.. if any of that changed.. we would replace and give a warning once! and also a Big Warning on the docs...

@Justin... can you add the warning somewhere please?
Actions
11. Re: Cluster never attempts core-bridge and many messages are never delivered

clebert.suconic Sep 17, 2012 3:59 PM (in response to clebert.suconic)

@Justin.. I meant on the docs.
Actions
12. Re: Cluster never attempts core-bridge and many messages are never delivered

jbertram Sep 17, 2012 4:04 PM (in response to clebert.suconic)

I thought we had a warning in the docs already. I'll try to find it and make it more prominent or create one if it doesn't exist.
Actions
13. Re: Cluster never attempts core-bridge and many messages are never delivered

clebert.suconic Sep 17, 2012 4:25 PM (in response to jbertram)

I also thought we had it.. but based on what David suggested, it seems like they digged the docs and didn't see it there.
Actions
14. Re: Cluster never attempts core-bridge and many messages are never delivered

david.berkman Sep 17, 2012 4:50 PM (in response to clebert.suconic)

Being human I may have missed it, but even taking another look through I see no proper warning about this. Actually, the whole clustering section of the docs deals much more with replication and failover issues, than with clustering per se. I would rename the section to Replication & Failover, and create a Clustering section that deals just with the various cluster setups, and such gotchas as the unique journal id.
Actions

1 2 Previous Next

Go to original post