2 Replies Latest reply on Oct 31, 2014 10:40 AM by jbertram

Co-located failover/failback issues with HornetQ 2.4.1 and Wildfly 8.1

dave7043 Oct 16, 2014 11:27 AM

Hello,

I am in the process of testing a wildfly 8.1 HA cluster (domain mode) with HornetQ 2.4.1 configured to run replicated in a co-located manner (i.e. node1 backing up node2, and vice versa).

It runs OK in normal operation, but I am seeing some strange behavior after HornetQ has failed over, and again when failing back.

I apologize for combining multiple questions into one post, but I believe (hope) that they may have a related root cause.

Configuration

--------------------------

OS: Redhat EL 6.5 x64

Cluster: 3 machines (1 domain controller/load balancer, 2 slave nodes running the servers)

Java: Oracle JDK 1.7_x64_u51

JMS flow types: All internal using MDBs, _no_ remote messaging

Core Behavior

--------------------------

In normal operation, JMS messages are sent/received normally, and appear to be distributing evenly.

After a hard failover (kill -9 on all wildfly processes of a slave node), the backup JMS server successfully takes over, and successfully processes any pending/in-process messages from the dead node.

Problem 1

However, at this point any new JMS message sent to a queue has a ~50% (non sequential) chance of being consumed by the remaining node. The missing messages appear to be queued correctly, but never gets consumed, and I can't find them anywhere on disk or in memory.

This seems very likely to be a configuration issue or a jndi issue with the consumers, but I have been unable to find any current documentation on configuring a co-located HA cluster in Wildfly 8.1. The configuration I've put together so far is based on the JBOSS EAP 6 cluster reference (http://www.redhat.com/en/resources/jboss-eap-6-clustering), and the old EAP docs last provided in HornetQ 2.2.2 (http://docs.jboss.org/hornetq/2.2.2.Final/user-manual/en/html/clusters.html.

Question 1

Any hints for finding out where those messages that never get consumed are going/waiting?

The HornetQ configuration is:

<subsystem xmlns="urn:jboss:domain:messaging:2.0">
    <hornetq-server name="live">
        <persistence-enabled>true</persistence-enabled>
        <cluster-password>${jboss.messaging.cluster.password:wildflycluster123}</cluster-password>

        <statistics-enabled>true</statistics-enabled>
        <message-counter-sample-period>1000</message-counter-sample-period>
        <message-counter-max-day-history>3</message-counter-max-day-history>


        <backup>false</backup>
        <allow-failback>true</allow-failback>
        <failover-on-shutdown>false</failover-on-shutdown>
        <shared-store>false</shared-store>
        <journal-type>NIO</journal-type>
        <journal-file-size>102400</journal-file-size>
        <check-for-live-server>true</check-for-live-server>
        <backup-group-name>${hornetq.live.group.name}</backup-group-name>
        <paging-directory path="data/jms/live/paging"/>
        <bindings-directory path="data/jms/live/bindings"/>
        <journal-directory path="data/jms/live/journal"/>
        <large-messages-directory path="data/jms/live/large-messages"/>

        <connectors>
            <http-connector name="http-connector" socket-binding="http">
                <param key="http-upgrade-endpoint" value="http-acceptor"/>
            </http-connector>
            <http-connector name="http-connector-throughput" socket-binding="http">
                <param key="http-upgrade-endpoint" value="http-acceptor-throughput"/>
                <param key="batch-delay" value="50"/>
            </http-connector>
            <in-vm-connector name="in-vm" server-id="1"/>
        </connectors>

        <acceptors>
            <http-acceptor http-listener="default" name="http-acceptor"/>
            <http-acceptor http-listener="default" name="http-acceptor-throughput">
                <param key="batch-delay" value="50"/>
                <param key="direct-deliver" value="false"/>
            </http-acceptor>
            <in-vm-acceptor name="in-vm" server-id="1"/>
        </acceptors>

        <broadcast-groups>
            <broadcast-group name="bg-group1">
                <socket-binding>messaging-group</socket-binding>
                <connector-ref>
                    http-connector
                </connector-ref>
            </broadcast-group>
        </broadcast-groups>

        <discovery-groups>
            <discovery-group name="dg-group1">
                <socket-binding>messaging-group</socket-binding>
            </discovery-group>
        </discovery-groups>

        <cluster-connections>
            <cluster-connection name="my-cluster">
                <address>jms</address>
                <connector-ref>http-connector</connector-ref>
                <discovery-group-ref discovery-group-name="dg-group1"/>
            </cluster-connection>
        </cluster-connections>

        <security-settings>
            <security-setting match="#">
                <permission type="send" roles="guest"/>
                <permission type="consume" roles="guest"/>
                <permission type="createNonDurableQueue" roles="guest"/>
                <permission type="deleteNonDurableQueue" roles="guest"/>
            </security-setting>
        </security-settings>

        <address-settings>
            <address-setting match="#">
                <dead-letter-address>jms.queue.DLQ</dead-letter-address>
                <expiry-address>jms.queue.ExpiryQueue</expiry-address>
                <max-size-bytes>10485760</max-size-bytes>
                <page-size-bytes>2097152</page-size-bytes>
                <message-counter-history-day-limit>10</message-counter-history-day-limit>
                <redistribution-delay>1000</redistribution-delay>
            </address-setting>
        </address-settings>

        <jms-connection-factories>
            <connection-factory name="InVmConnectionFactory">
                <connectors>
                    <connector-ref connector-name="in-vm"/>
                </connectors>
                <entries>
                    <entry name="java:/InVmConnectionFactory"/>
                </entries>
            </connection-factory>
            <connection-factory name="RemoteConnectionFactory">
                <connectors>
                    <connector-ref connector-name="http-connector"/>
                </connectors>
                <entries>
                    <entry name="java:jboss/exported/jms/RemoteConnectionFactory"/>
                </entries>
            </connection-factory>
            <pooled-connection-factory name="hornetq-ra">
                <transaction mode="xa"/>
                <connectors>
                    <connector-ref connector-name="in-vm"/>
                </connectors>
                <entries>
                    <entry name="java:jboss/DefaultJMSConnectionFactory"/>
                    <entry name="java:/ConnectionFactory"/> <!-- NOTE: All MDBs are driven off of this -->
                    <entry name="java:/JmsXA"/>
                </entries>
                <ha>true</ha>
                <block-on-acknowledge>true</block-on-acknowledge>
                <reconnect-attempts>-1</reconnect-attempts>
            </pooled-connection-factory>
        </jms-connection-factories>

        <jms-destinations>
            <jms-queue name="ExpiryQueue">
                <entry name="java:/jms/queue/ExpiryQueue"/>
            </jms-queue>
            <jms-queue name="DLQ">
                <entry name="java:/jms/queue/DLQ"/>
            </jms-queue>
            <jms-queue name="testStage1">
                <entry name="java:/queue/product/stage1"/>
            </jms-queue>
            <jms-queue name="testStage2">
                <entry name="java:/queue/product/stage2"/>
            </jms-queue>
        </jms-destinations>
    </hornetq-server>

    <hornetq-server name="backup1">
        <persistence-enabled>true</persistence-enabled>
        <cluster-password>${jboss.messaging.cluster.password:wildflycluster123}</cluster-password>
        <statistics-enabled>true</statistics-enabled>
        <message-counter-sample-period>1000</message-counter-sample-period>
        <message-counter-max-day-history>3</message-counter-max-day-history>
        <backup>true</backup>
        <allow-failback>true</allow-failback>
        <shared-store>false</shared-store>
        <journal-type>NIO</journal-type>
        <journal-file-size>102400</journal-file-size>
        <backup-group-name>${hornetq.backup.group.name}</backup-group-name>

        <paging-directory path="data/jms/backup1/paging"/>
        <bindings-directory path="data/jms/backup1/bindings"/>
        <journal-directory path="data/jms/backup1/journal"/>
        <large-messages-directory path="data/jms/backup1/large-messages"/>

        <connectors>
            <http-connector name="http-backup-connector" socket-binding="httpBackup">
                <param key="http-upgrade-endpoint" value="http-backup-acceptor"/>
            </http-connector>
            <http-connector name="http-backup-connector-throughput" socket-binding="httpBackup">
                <param key="http-upgrade-endpoint" value="http-backup-acceptor-throughput"/>
                <param key="batch-delay" value="50"/>
            </http-connector>
            <in-vm-connector name="in-vm" server-id="2"/>
        </connectors>

        <acceptors>
            <http-acceptor http-listener="default" name="http-backup-acceptor"/>
            <http-acceptor http-listener="default" name="http-backup-acceptor-throughput">
                <param key="batch-delay" value="50"/>
                <param key="direct-deliver" value="false"/>
            </http-acceptor>
            <in-vm-acceptor name="in-vm" server-id="2"/>
        </acceptors>

        <broadcast-groups>
            <broadcast-group name="bg-group1">
                <socket-binding>messaging-group</socket-binding>
                <connector-ref>
                    http-backup-connector
                </connector-ref>
            </broadcast-group>
        </broadcast-groups>

        <discovery-groups>
            <discovery-group name="dg-group1">
                <socket-binding>messaging-group</socket-binding>
            </discovery-group>
        </discovery-groups>

        <cluster-connections>
            <cluster-connection name="my-cluster">
                <address>jms</address>
                <connector-ref>http-backup-connector</connector-ref>
                <discovery-group-ref discovery-group-name="dg-group1"/>
            </cluster-connection>
        </cluster-connections>

        <security-settings>
            <security-setting match="#">
                <permission type="send" roles="guest"/>
                <permission type="consume" roles="guest"/>
                <permission type="createNonDurableQueue" roles="guest"/>
                <permission type="deleteNonDurableQueue" roles="guest"/>
            </security-setting>
        </security-settings>

        <address-settings>
            <address-setting match="#">
                <max-size-bytes>10485760</max-size-bytes>
                <page-size-bytes>2097152</page-size-bytes>
                <message-counter-history-day-limit>10</message-counter-history-day-limit>
                <redistribution-delay>1000</redistribution-delay>
            </address-setting>
        </address-settings>

        <!-- NOTE: I've tested with and _without_ this block and don't see any difference in behavior -->
        <jms-connection-factories>
            <connection-factory name="InVmConnectionFactory">
                <connectors>
                    <connector-ref connector-name="in-vm"/>
                </connectors>
                <entries>
                    <entry name="java:/InVmConnectionFactoryBak"/>
                </entries>
            </connection-factory>
            <connection-factory name="RemoteConnectionFactory">
                <connectors>
                    <connector-ref connector-name="http-backup-connector"/>
                </connectors>
                <entries>
                    <entry name="java:jboss/exported/jms/RemoteConnectionFactoryBak"/>
                </entries>
            </connection-factory>
            <pooled-connection-factory name="hornetq-ra-bak">
                <transaction mode="xa"/>
                <connectors>
                    <connector-ref connector-name="in-vm"/>
                </connectors>
                <entries>
                    <entry name="java:/ConnectionFactoryBak"/>
                    <entry name="java:/JmsXABak"/>
                </entries>
                <ha>true</ha>
                <block-on-acknowledge>true</block-on-acknowledge>
                <reconnect-attempts>-1</reconnect-attempts>
            </pooled-connection-factory>

        </jms-connection-factories>
    </hornetq-server>
</subsystem>

Problem 2

When failing back two distinct and repeatable behaviors are seen.

NOTE: The full server logs for both type 1 and 2 failures are attached. Node 2 is always the node that is killed and restarted.

Type 1

The restarted node fails to detect the backup node running and attempts to start normally. This results in the 'HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node.'' message being printed out every second on each node.

I've traced this issue down to a call in 'org.hornetq.core.server.impl.HornetQServerImpl.SharedNothingLiveActivation#isNodeIdUsed'

try
{
    locator.setReconnectAttempts(0);
    try
    {
        locator.addClusterTopologyListener(listener);
        factory = locator.connectNoWarnings();
    }
    catch (Exception notConnected)
    {
        HornetQServerLogger.LOGGER.info("=CUSTOM DEBUG= isNodeIdUsed: Not connected exception!", notConnected);
        return false;
    }

    listener.latch.await(20, TimeUnit.SECONDS);
    HornetQServerLogger.LOGGER.info("=CUSTOM DEBUG= isNodeIdUsed: is node present = " + listener.isNodePresent);
    return listener.isNodePresent;
}

Each time a Type 1 failure occurs, the notConnected exception is being caught (line 09), which results in a return value of false, and the JMS server trying to start normally.

The debug output from line 11 in this case is:

2014-10-16 11:51:32,864 INFO  [org.hornetq.core.server] (ServerService Thread Pool -- 58) =CUSTOM DEBUG= isNodeIdUsed: Not connected exception!: HornetQNotConnectedException[errorType=NOT_CONNECTED message=HQ119007: Cannot connect to server(s). Tried with all available servers.]
  at org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:905) [hornetq-core-client-2.4.1.Final.jar:]
  at org.hornetq.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:669) [hornetq-core-client-2.4.1.Final.jar:]
  at org.hornetq.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:653) [hornetq-core-client-2.4.1.Final.jar:]
  at org.hornetq.core.server.impl.HornetQServerImpl$SharedNothingLiveActivation.isNodeIdUsed(HornetQServerImpl.java:2757) [:]
  at org.hornetq.core.server.impl.HornetQServerImpl$SharedNothingLiveActivation.run(HornetQServerImpl.java:2692) [:]
  at org.hornetq.core.server.impl.HornetQServerImpl.start(HornetQServerImpl.java:450) [:]
  at org.hornetq.jms.server.impl.JMSServerManagerImpl.start(JMSServerManagerImpl.java:485) [hornetq-jms-server-2.4.1.Final.jar:]
  at org.jboss.as.messaging.jms.JMSService.doStart(JMSService.java:169) [wildfly-messaging-8.1.0.Final.jar:8.1.0.Final]
  at org.jboss.as.messaging.jms.JMSService.access$000(JMSService.java:63) [wildfly-messaging-8.1.0.Final.jar:8.1.0.Final]
  at org.jboss.as.messaging.jms.JMSService$1.run(JMSService.java:96) [wildfly-messaging-8.1.0.Final.jar:8.1.0.Final]
  at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [rt.jar:1.7.0_51]
  at java.util.concurrent.FutureTask.run(Unknown Source) [rt.jar:1.7.0_51]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [rt.jar:1.7.0_51]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [rt.jar:1.7.0_51]
  at java.lang.Thread.run(Unknown Source) [rt.jar:1.7.0_51]
  at org.jboss.threads.JBossThread.run(JBossThread.java:122) [jboss-threads-2.1.1.Final.jar:2.1.1.Final]

Question 2

Has anyone seen this behavior before? I tried increasing the reconnect attempts to 5, and that makes the problem less likely to occur, but it still occurs quite often. Any ideas on further debugging?

Type 2

If the Type 1 failure does not occur, then the presence of the backup server running is detected correctly, and the activation object in org.hornetq.core.server.impl.HornetQServerImpl#start is set to:

activation = new SharedNothingBackupActivation(wasLive);   // wasLive is equal to true.

The startup makes it to the final line in org.hornetq.core.server.impl.HornetQServerImpl.SharedNothingBackupActivation.EndpointConnector#run

clusterManager.announceReplicatingBackupToLive(pingChannel, attemptFailBack);'  // attemptFailBack is equal to true.

I don't see anything else past that point. The recovering server says it is waiting for the backup to fail, and I don't see any output from the failed over server indicating that it received any failback messages.

Excerpt from node 1 log (surviving node):

-----------------------------------
=CUSTOM DEBUG= Killing Slave 2
-----------------------------------

2014-10-15 17:11:32,527 WARN  [org.hornetq.core.server] (Thread-0 (HornetQ-client-global-threads-1247513532)) HQ222095: Connection failed with failedOver=false: HornetQNotConnectedException[errorType=NOT_CONNECTED message=HQ119006: Channel disconnected]
  at org.hornetq.core.client.impl.ClientSessionFactoryImpl.connectionDestroyed(ClientSessionFactoryImpl.java:427) [hornetq-core-client-2.4.1.Final.jar:]
  at org.hornetq.core.remoting.impl.netty.NettyConnector$Listener$1.run(NettyConnector.java:1055) [hornetq-core-client-2.4.1.Final.jar:]
  at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:104) [hornetq-core-client-2.4.1.Final.jar:]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [rt.jar:1.7.0_51]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [rt.jar:1.7.0_51]
  at java.lang.Thread.run(Unknown Source) [rt.jar:1.7.0_51]

2014-10-15 17:11:32,641 WARN  [org.hornetq.core.client] (Thread-3 (HornetQ-client-global-threads-1247513532)) HQ212004: Failed to connect to server.
2014-10-15 17:11:32,777 ERROR [org.hornetq.core.server] (default I/O-4) HQ224018: Failed to create session: HornetQSecurityException[errorType=SECURITY_EXCEPTION message=HQ119031: Unable to validate user: null]
  at org.hornetq.core.security.impl.SecurityStoreImpl.authenticate(SecurityStoreImpl.java:145)
  at org.hornetq.core.server.impl.HornetQServerImpl.createSession(HornetQServerImpl.java:1014)
  at org.hornetq.core.protocol.core.impl.HornetQPacketHandler.handleCreateSession(HornetQPacketHandler.java:150)
  at org.hornetq.core.protocol.core.impl.HornetQPacketHandler.handlePacket(HornetQPacketHandler.java:77)
  at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:636)
  at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:546)
  at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:522)
  at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:635)
  at org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.channelRead(HornetQChannelHandler.java:73)
  at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
  at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
  at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
  at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
  at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
  at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
  at org.xnio.netty.transport.AbstractXnioSocketChannel$ReadListener.handleEvent(AbstractXnioSocketChannel.java:435)
  at org.xnio.netty.transport.AbstractXnioSocketChannel$ReadListener.handleEvent(AbstractXnioSocketChannel.java:371)
  at org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:92) [xnio-api-3.2.2.Final.jar:3.2.2.Final]
  at org.xnio.conduits.ReadReadyHandler$ChannelListenerHandler.readReady(ReadReadyHandler.java:66) [xnio-api-3.2.2.Final.jar:3.2.2.Final]
  at org.xnio.nio.NioSocketConduit.handleReady(NioSocketConduit.java:87) [xnio-nio-3.2.2.Final.jar:3.2.2.Final]
  at org.xnio.nio.WorkerThread.run(WorkerThread.java:539) [xnio-nio-3.2.2.Final.jar:3.2.2.Final]

2014-10-15 17:11:40,818 WARN  [org.hornetq.core.client] (Thread-3 (HornetQ-client-global-threads-1247513532)) HQ212005: Tried 5 times to connect. Now giving up on reconnecting it.
2014-10-15 17:11:40,828 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221037: HornetQServerImpl::serverUUID=52e5bb88-547d-11e4-842d-bd75c1eea38c to become 'live'
2014-10-15 17:11:40,911 INFO  [org.jboss.messaging] (MSC service thread 1-4) JBAS011615: Registered HTTP upgrade for hornetq-remoting protocol handled by http-backup-acceptor-throughput acceptor
2014-10-15 17:11:40,912 INFO  [org.jboss.messaging] (MSC service thread 1-4) JBAS011615: Registered HTTP upgrade for hornetq-remoting protocol handled by http-backup-acceptor acceptor
2014-10-15 17:11:40,932 WARN  [org.hornetq.core.client] (hornetq-discovery-group-thread-dg-group1) HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=52e5bb88-547d-11e4-842d-bd75c1eea38c
2014-10-15 17:11:40,934 WARN  [org.hornetq.core.client] (hornetq-discovery-group-thread-dg-group1) HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=52e5bb88-547d-11e4-842d-bd75c1eea38c
2014-10-15 17:11:41,005 INFO  [org.hornetq.core.server] (Thread-14 (HornetQ-server-HornetQServerImpl::serverUUID=null-833505859)) HQ221027: Bridge ClusterConnectionBridge@656718dc [name=sf.my-cluster.572c1739-547d-11e4-a418-fbebf3abb8c6, queue=QueueImpl[name=sf.my-cluster.572c1739-547d-11e4-a418-fbebf3abb8c6, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=52e5bb88-547d-11e4-842d-bd75c1eea38c]]@53914786 targetConnector=ServerLocatorImpl (identity=(Cluster-connection-bridge::ClusterConnectionBridge@656718dc [name=sf.my-cluster.572c1739-547d-11e4-a418-fbebf3abb8c6, queue=QueueImpl[name=sf.my-cluster.572c1739-547d-11e4-a418-fbebf3abb8c6, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=52e5bb88-547d-11e4-842d-bd75c1eea38c]]@53914786 targetConnector=ServerLocatorImpl [initialConnectors=[TransportConfiguration(name=http-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=8080&host=10-128-57-166&http-upgrade-endpoint=http-acceptor&http-upgrade-enabled=true], discoveryGroupConfiguration=null]]::ClusterConnectionImpl@628677523[nodeUUID=52e5bb88-547d-11e4-842d-bd75c1eea38c, connector=TransportConfiguration(name=http-backup-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=8081&host=10-128-57-166&http-upgrade-endpoint=http-backup-acceptor&http-upgrade-enabled=true, address=jms, server=HornetQServerImpl::serverUUID=52e5bb88-547d-11e4-842d-bd75c1eea38c])) [initialConnectors=[TransportConfiguration(name=http-connector, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=8080&host=10-128-57-166&http-upgrade-endpoint=http-acceptor&http-upgrade-enabled=true], discoveryGroupConfiguration=null]] is connected
2014-10-15 17:12:14,768 WARN  [org.hornetq.core.client] (hornetq-failure-check-thread) HQ212037: Connection failure has been detected: HQ119014: Did not receive data from /10.128.57.235:32987. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check user manual for more information. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
2014-10-15 17:12:14,770 WARN  [org.hornetq.core.server] (hornetq-failure-check-thread) HQ222061: Client connection failed, clearing up resources for session 5c96a8d4-547d-11e4-842d-bd75c1eea38c
2014-10-15 17:12:14,777 WARN  [org.hornetq.core.server] (hornetq-failure-check-thread) HQ222107: Cleared up resources for session 5c96a8d4-547d-11e4-842d-bd75c1eea38c

-----------------------------------
=CUSTOM DEBUG= Restarting Slave 2
-----------------------------------

2014-10-15 17:13:13,152 INFO  [org.hornetq.core.server] (Thread-82) HQ221025: Replication: sending JournalFileImpl: (hornetq-data-4.hq id = 4, recordID = 4) (size=102,400) to backup. NIOSequentialFile /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node1/data/data/jms/live/journal/hornetq-data-4.hq
2014-10-15 17:13:13,156 INFO  [org.hornetq.core.server] (Thread-82) HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-7.bindings id = 1, recordID = 1) (size=1,048,576) to backup. NIOSequentialFile /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node1/data/data/jms/live/bindings/hornetq-bindings-7.bindings
2014-10-15 17:13:13,161 INFO  [org.hornetq.core.server] (Thread-82) HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-5.bindings id = 5, recordID = 5) (size=1,048,576) to backup. NIOSequentialFile /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node1/data/data/jms/live/bindings/hornetq-bindings-5.bindings

== No messages after this point, and the server was stopped ~20 min later ==

Excerpt from node 2 log (node that was killed and restarted)

== Node was killed, and this is where the hornetq logging picks up on restart) ==

2014-10-15 17:13:11,852 INFO  [org.hornetq.core.server] (ServerService Thread Pool -- 58) HQ221000: live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=false,journalDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/journal,bindingsDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/bindings,largeMessagesDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/large-messages,pagingDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/paging)

2014-10-15 17:13:11,858 INFO  [org.hornetq.core.server] (ServerService Thread Pool -- 58) =CUSTOM DEBUG= isNodeIdUsed: Starting check, nodeId=52e5bb88-547d-11e4-842d-bd75c1eea38c

2014-10-15 17:13:11,872 INFO  [org.hornetq.core.server] (ServerService Thread Pool -- 59) HQ221000: backup server is starting with configuration HornetQ Configuration (clustered=true,backup=true,sharedStore=false,journalDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/journal,bindingsDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/bindings,largeMessagesDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/large-messages,pagingDirectory=/home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/paging)
2014-10-15 17:13:11,953 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/bindings to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/bindings2
2014-10-15 17:13:11,954 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/journal to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/journal2
2014-10-15 17:13:11,955 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/paging to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/paging2
2014-10-15 17:13:11,955 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/large-messages to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/backup1/large-messages2
2014-10-15 17:13:12,008 INFO  [org.jboss.as.jacorb] (MSC service thread 1-3) JBAS016330: CORBA ORB Service started
2014-10-15 17:13:12,039 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221013: Using NIO Journal
2014-10-15 17:13:12,091 INFO  [io.netty.util.internal.PlatformDependent] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) Your platform does not provide complete low-level API for accessing direct buffers reliably. Unless explicitly requested, heap buffer will always be preferred to avoid potential system unstability.
2014-10-15 17:13:12,443 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221043: Adding protocol support CORE
2014-10-15 17:13:12,456 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221043: Adding protocol support AMQP
2014-10-15 17:13:12,460 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221043: Adding protocol support STOMP
2014-10-15 17:13:12,464 INFO  [org.jboss.as.jacorb] (MSC service thread 1-2) JBAS016328: CORBA Naming Service started
2014-10-15 17:13:12,755 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-6) JBAS017519: Undertow HTTPS listener https listening on product-dev04/10.128.57.235:8443
2014-10-15 17:13:12,764 INFO  [org.jboss.modcluster] (ServerService Thread Pool -- 59) MODCLUSTER000001: Initializing mod_cluster version 1.3.0.Final
2014-10-15 17:13:12,799 INFO  [org.jboss.modcluster] (ServerService Thread Pool -- 59) MODCLUSTER000032: Listening to proxy advertisements on /224.0.1.105:23364
2014-10-15 17:13:12,992 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221109: HornetQ Backup Server version 2.4.1.Final (Fast Hornet, 124) [null] started, waiting live to fail before it gets active

2014-10-15 17:13:13,024 INFO  [org.hornetq.core.server] (ServerService Thread Pool -- 58) =CUSTOM DEBUG= isNodeIdUsed: is node present = true

2014-10-15 17:13:13,065 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/bindings to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/bindings1
2014-10-15 17:13:13,066 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/journal to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/journal1
2014-10-15 17:13:13,067 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/paging to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/paging1
2014-10-15 17:13:13,068 WARN  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ222162: Moving data directory /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/large-messages to /home/product/wildflyCluster/Node/ProductServer/EmbeddedServer/domain/servers/slave-node2/data/data/jms/live/large-messages1
2014-10-15 17:13:13,069 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221013: Using NIO Journal
2014-10-15 17:13:13,070 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221043: Adding protocol support CORE
2014-10-15 17:13:13,073 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221043: Adding protocol support AMQP
2014-10-15 17:13:13,074 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221043: Adding protocol support STOMP
2014-10-15 17:13:13,109 INFO  [org.jboss.ws.common.management] (MSC service thread 1-6) JBWS022052: Starting JBoss Web Services - Stack CXF Server 4.2.4.Final
2014-10-15 17:13:13,221 INFO  [org.jboss.as] (Controller Boot Thread) JBAS015874: WildFly 8.1.0.Final "Kenny" started in 6794ms - Started 220 of 369 services (181 services are lazy, passive or on-demand)
2014-10-15 17:13:13,492 INFO  [org.hornetq.core.server] (Thread-2 (HornetQ-client-netty-threads-1625644531)) HQ221024: Backup server HornetQServerImpl::serverUUID=572c1739-547d-11e4-a418-fbebf3abb8c6 is synchronized with live-server.
2014-10-15 17:13:14,835 INFO  [org.hornetq.core.server] (HQ119000: Activation for server HornetQServerImpl::serverUUID=null) HQ221109: HornetQ Backup Server version 2.4.1.Final (Fast Hornet, 124) [null] started, waiting live to fail before it gets active
2014-10-15 17:13:14,881 INFO  [org.hornetq.core.server] (Thread-1 (HornetQ-server-HornetQServerImpl::serverUUID=null-1236870381)) HQ221031: backup announced

== No messages after this point, and the server was stopped ~20 min later ==

Question 3

I'm a bit stumped on how to proceed with this one. Is there development or architecture documentation that describes this failover process from a message standpoint? My assumption was the recovering live server would wait for the backup to failback, and then become active before continuing on, but from what I can see in the code, this does not appear to be the case.

If you've made it this far, thank you for taking the time to look it over, and any advice or hints you could provide would be GREATLY appreciated.

Regards,

Dave

1. Re: Co-located failover/failback issues with HornetQ 2.4.1 and Wildfly 8.1

dave7043 Oct 31, 2014 9:53 AM (in response to dave7043)

A quick update on this.

After moving to a private network with only the cluster machines active:
   Problem 1 (missing messages after failover) was never solved.
   Problem 2-1 was solved after moving to an isolated network, it appears to have been some sort of UDP issue.
   Problem 2-2 was not solved either.

I validated the above tests under EAP 6.1.1 and it was working correctly. For Wildfly 8.1, I've moved on to using Active MQ 5.10 and removed hornet for now.
Actions
2. Re: Co-located failover/failback issues with HornetQ 2.4.1 and Wildfly 8.1

jbertram Oct 31, 2014 10:40 AM (in response to dave7043)

I validated the above tests under EAP 6.1.1 and it was working correctly.

Do you mean that you didn't see any of the problem you've reported here when you used EAP 6.1.1?

In general, it's best to keep your problem descriptions as concise and simple as possible and always provide a reproducible test-case if possible. For the time being I recommend you focus on one single problem you're observing and go from there.
Actions

Go to original post