12 Replies Latest reply on Nov 9, 2017 7:15 AM by mnovak

Failover with Shared Store not working when Live Server killed

bpogace Oct 26, 2017 8:17 AM

Dear all,

I am doing some tests to failover 2 servers (live-backup, Wildfly 10.1.0) that have a shared file system on a development environment. The simple test is the following:

- Access the application on live server to produce/consume JMS messages for a queue
- Shutdown or Kill Live server

- Access the application on backup server to check number of messages in queue

The problem is that if I shutdown the Live server I can see a correct number of messages when accessing on backup. But, if I kill (sudo kill -9) the Live server all messages that were just produced/consumed are not updated for Backup server. Reading the documentation this is one of the main disadvantages of a Data Replication which is why I have configured a Shared Store solution.

The following is the Live server configuration for messaging:

<subsystem xmlns="urn:jboss:domain:messaging-activemq:1.0">
           <server name="default">
                <security enabled="false"/>
                <cluster user="$jboss.messaging.cluster.user:userjms" password="$jboss.messaging.cluster.password:secret"/>
                <shared-store-master failover-on-server-shutdown="true"/>
                <bindings-directory path="/shared/sharedstore/bindings"/>
                <journal-directory path="/shared/sharedstore/journal"/>
                <large-messages-directory path="/shared/sharedstore/largemessages"/>
                <paging-directory path="/shared/sharedstore/paging"/>
                <security-setting name="#">
                    <role name="guest" send="true" consume="true" create-non-durable-queue="true" delete-non-durable-queue="true"/>
                    <role name="jms" send="true" consume="true"/>
                </security-setting>
                <address-setting name="#" dead-letter-address="jms.queue.DLQ" expiry-address="jms.queue.ExpiryQueue" max-size-bytes="10485760" page-size-bytes="2097152" message-counter-history-day-limit="10" redistribution-delay="0"/>
                <remote-connector name="netty" socket-binding="messaging"/>
                <remote-connector name="netty-throughput" socket-binding="messaging-throughput">
                    <param name="batch-delay" value="50"/>
                </remote-connector>
                <in-vm-connector name="in-vm" server-id="0"/>
                <remote-acceptor name="netty" socket-binding="messaging"/>
                <remote-acceptor name="netty-throughput" socket-binding="messaging-throughput">
                    <param name="batch-delay" value="50"/>
                    <param name="direct-deliver" value="false"/>
                </remote-acceptor>
                <in-vm-acceptor name="in-vm" server-id="0"/>
                <broadcast-group name="bg-group1" jgroups-channel="activemq-cluster" connectors="netty"/>
                <discovery-group name="dg-group1" jgroups-channel="activemq-cluster" refresh-timeout="1000"/>
                <cluster-connection name="my-cluster" address="jms" connector-name="netty" discovery-group="dg-group1"/>
                <jms-queue name="ExpiryQueue" entries="java:/jms/queue/ExpiryQueue"/>
                <jms-queue name="DLQ" entries="java:/jms/queue/DLQ"/>
                <jms-queue name="RunCommandQueue" entries="java:/jms/RunCommandQueue java:jboss/exported/jms/RunCommandQueue"/>
                <jms-queue name="ServiceCommandQueue" entries="java:/jms/ServiceCommandQueue java:jboss/exported/jms/ServiceCommandQueue"/>
                <jms-queue name="EventQueue" entries="java:/jms/EventQueue java:jboss/exported/jms/EventQueue"/>
                <connection-factory name="InVmConnectionFactory" entries="java:/ConnectionFactory" connectors="in-vm"/>
                <connection-factory name="RemoteConnectionFactory" entries="java:jboss/exported/jms/RemoteConnectionFactory" connectors="netty" block-on-acknowledge="true" reconnect-attempts="-1"/>
                <connection-factory name="EM2Factory" entries="java:/jms/EM2Factory" connectors="in-vm netty" ha="true"/>
                <pooled-connection-factory name="activemq-ra" entries="java:/JmsXA java:jboss/DefaultJMSConnectionFactory" connectors="netty" ha="true" reconnect-attempts="-1" transaction="xa"/>
            </server>

Please check for any missing configuration. Looking forward to having your feedback on this.
Thank you and best regards

1. Re: Failover with Shared Store not working when Live Server killed

mnovak Oct 27, 2017 4:03 AM (in response to bpogace)

Which connection factory are you using in your client? EM2Factory or RemoteConnectionFactory?

If EM2Factory then can you set block-on-acknowledge="true"?
Thanks,
Mirek
Actions
2. Re: Failover with Shared Store not working when Live Server killed

bpogace Oct 27, 2017 6:23 AM (in response to mnovak)

Hi Mirek,

Thank you for your quick reply.

Yes, EM2Factory is the connection factory we are using. Tried the configuration you suggested, but the messages were still not updated in the Backup server.

I don't understand, aren't messages produced/consumed supposed to be saved in the Shared Store?

Best regards,
Besian
Actions
3. Re: Failover with Shared Store not working when Live Server killed

mnovak Oct 27, 2017 7:08 AM (in response to bpogace)

Hi Besian,

long story short there should be used XA transactions to guarantee that all messages will survive failover. But don't loose hope :-) I'll try elaborate how it works and maybe you will find where the problem is.

From producer pov it makes sense only to use Session.SESSION_TRANSACTED mode when sending messages. This is because producer.send(message); does not wait for response from server that message was successfully sent. They're sent asynchronously. Only in case of transacted session session.commit() is blocking call which blocks until response with success or fail is returned from server. In case that session.commit() fails (this happen during failover), producer just resends all messages from failed transaction. However there is a glitch! If server is killed in the moment when commit is written to journal and producer does not get response for session.commit() call then JMSException is thrown with message that commit could or could not be successful. In this case if you send the same message again then there will be duplicated messages in the queue. To avoid duplicates, set "_AMQ_DUP_ID" on the message to some unique value and backup will filter out all duplicated messages. Next call session.commit() will return TransactionRolledBackException which will be indicator that you've sent duplicated messages and should continue with next messages.

I hope this will help :-)

Thanks,
Mirek
Actions

4. Re: Failover with Shared Store not working when Live Server killed

bpogace Oct 27, 2017 11:58 AM (in response to mnovak)

Hi Mirek,

Thank you for the answer but I don't think my problem is related to what you have described.

Moreover, after doing some few more tests, I have noticed another interesting behavior. To be more clear I will call the instance with shared-store-master configuration Server One, and Server Two that with shared-store-slave configuration. Here are a few examples of how the cluster is behaving:

1) If I shutdown Server One, which is the Live server, Server Two starts becoming the Live one and the messages that have been produced/consumed are being successfully updated for it.

2) If I kill Server One, which is the Live server, Server Two starts becoming the Live one and the messages that have been produced/consumed on the queues are not updated, so there is a different number of messages present in the queue/s.
3) Following the situation in 2) and keeping Server One down, if Server Two is restarted then the messages that were produced/consumed when Server One was Live are updated and can be consumed by the running application (this is not ok for HA, since there is a moment when both servers are down)
4) Same behavior for opposite roles: killing Server Two which automatically brings Server One to Live - messages are not updated unless Server One is restarted.

Because of this I can deduce that the messages are stored correctly but are not retrieved when Backup servers automatically become Live.

For completeness I am attaching the messaging configuration for Server Two:

<subsystem xmlns="urn:jboss:domain:messaging-activemq:1.0">
            <server name="backup">
                <security enabled="false"/>
                <cluster user="$jboss.messaging.cluster.user:userjms" password="$jboss.messaging.cluster.password:secret"/>
                <shared-store-slave failover-on-server-shutdown="true"/>
                <bindings-directory path="/shared/sharedstore/bindings"/>
                <journal-directory path="/shared/sharedstore/journal"/>
                <large-messages-directory path="/shared/sharedstore/largemessages"/>
                <paging-directory path="/shared/sharedstore/paging"/>
                <security-setting name="#">
                    <role name="guest" send="true" consume="true" create-non-durable-queue="true" delete-non-durable-queue="true"/>
                    <role name="jms" send="true" consume="true"/>
                </security-setting>
                <address-setting name="#" dead-letter-address="jms.queue.DLQ" expiry-address="jms.queue.ExpiryQueue" max-size-bytes="10485760" page-size-bytes="2097152" message-counter-history-day-limit="10" redistribution-delay="0"/>
                <remote-connector name="netty" socket-binding="messaging"/>
                <remote-connector name="netty-throughput" socket-binding="messaging-throughput">
                    <param name="batch-delay" value="50"/>
                </remote-connector>
                <in-vm-connector name="in-vm" server-id="0"/>
                <remote-acceptor name="netty" socket-binding="messaging"/>
                <remote-acceptor name="netty-throughput" socket-binding="messaging-throughput">
                    <param name="batch-delay" value="50"/>
                    <param name="direct-deliver" value="false"/>
                </remote-acceptor>
                <in-vm-acceptor name="in-vm" server-id="0"/>
                <broadcast-group name="bg-group1" jgroups-channel="activemq-cluster" connectors="netty"/>
                <discovery-group name="dg-group1" jgroups-channel="activemq-cluster" refresh-timeout="10000"/>
                <cluster-connection name="my-cluster" address="jms" connector-name="netty" discovery-group="dg-group1"/>
                <jms-queue name="ExpiryQueue" entries="java:/jms/queue/ExpiryQueue"/>
                <jms-queue name="DLQ" entries="java:/jms/queue/DLQ"/>
                <jms-queue name="RunCommandQueue" entries="java:/jms/RunCommandQueue java:jboss/exported/jms/RunCommandQueue"/>
                <jms-queue name="ServiceCommandQueue" entries="java:/jms/ServiceCommandQueue java:jboss/exported/jms/ServiceCommandQueue"/>
                <jms-queue name="EventQueue" entries="java:/jms/EventQueue java:jboss/exported/jms/EventQueue"/>
                <connection-factory name="InVmConnectionFactory" entries="java:/ConnectionFactory" connectors="in-vm"/>
                <connection-factory name="RemoteConnectionFactory" entries="java:jboss/exported/jms/RemoteConnectionFactory" connectors="netty" block-on-acknowledge="true" reconnect-attempts="-1"/>
                <connection-factory name="EM2Factory" entries="java:/jms/EM2Factory" connectors="in-vm netty" ha="true" reconnect-attempts="-1"/>
                <pooled-connection-factory name="activemq-ra" entries="java:/JmsXA java:jboss/DefaultJMSConnectionFactory" connectors="netty" ha="true" transaction="xa"/>
            </server>
        </subsystem>

Can you think of any reason why this occurs?

Thank you for the help and sorry if I have misguided you with the previous posts.

Best regards,
Besian

5. Re: Failover with Shared Store not working when Live Server killed

mnovak Oct 31, 2017 3:31 AM (in response to bpogace)

I'm not sure how this can happen. Even though Artemis HA is much more stable in WF11 and WF10.1 contains number of issues, they were generally in HA with replicated journal and not in shared store.

What are the NFS mount options?
Actions
6. Re: Failover with Shared Store not working when Live Server killed

mnovak Nov 3, 2017 2:46 AM (in response to mnovak)

Did you manage to check the NFS options?
Actions
7. Re: Failover with Shared Store not working when Live Server killed

bpogace Nov 6, 2017 5:18 AM (in response to mnovak)

Hi Mirek

Sorry for the late reply (due to last week being mainly holiday)

I'm working with a development environment, and have set up a shared folder by doing a simple NFS mount for the 2 Ubuntu 16.04 servers, with this configuration (rw, sync, no_subtree_check) for the folder (written in /etc/exports file).
Here is a guide that explains something similar to what I've done https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04

Best regards,
Besian
Actions
8. Re: Failover with Shared Store not working when Live Server killed

mnovak Nov 6, 2017 8:42 AM (in response to bpogace)

Thanks, could you try to replicate the issue with following mount options?

rw,nosuid,nodev,relatime,sync,vers=4.0,soft,noac,nosharecache,proto=tcp,timeo=50,retrans=2,lookupcache=none,local_lock=none

It's configuration where I know it's working. If you still see the issue with those mount options then the problem will be somewhere else.

Mirek
Actions
9. Re: Failover with Shared Store not working when Live Server killed

bpogace Nov 7, 2017 4:05 AM (in response to mnovak)

Hi Mirek,

Tried the configuration but the behavior didn't change.
Any other ideas?

Best Regards,
Besian
Actions

10. Re: Failover with Shared Store not working when Live Server killed

mnovak Nov 7, 2017 8:52 AM (in response to bpogace)

Ok, could you enable trace logging for Artemis and try to send messages with message property _AMQ_DUP_ID to some unique value. To investigate the issue, we need to know messages which got "lost" to track what happened to them before fail-over. To enable trace logging configure logging subsystem like:

<subsystem xmlns="urn:jboss:domain:logging:3.0">
      <periodic-rotating-file-handler name="FILE" autoflush="true">
        <formatter>
          <pattern-formatter pattern="%d{HH:mm:ss,SSS} %-5p [%c] (%t) %s%E%n"/>
        </formatter>
        <file relative-to="jboss.server.log.dir" path="server.log"/>
        <suffix value=".yyyy-MM-dd"/>
        <append value="true"/>
        <level name="INFO"/>
      </periodic-rotating-file-handler>
      <logger category="com.arjuna">
        <level name="WARN"/>
      </logger>
      <logger category="org.jboss.as.config">
        <level name="DEBUG"/>
      </logger>
      <logger category="sun.rmi">
        <level name="WARN"/>
      </logger>
      <root-logger>
        <level name="INFO"/>
        <handlers>
          <handler name="FILE"/>
          <handler name="FILE-TRACE"/>
          <handler name="CONSOLE"/>
        </handlers>
      </root-logger>
      <formatter name="PATTERN">
        <pattern-formatter pattern="%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p [%c] (%t) %s%e%n"/>
      </formatter>
      <formatter name="COLOR-PATTERN">
        <pattern-formatter pattern="%K{level}%d{HH:mm:ss,SSS} %-5p [%c] (%t) %s%e%n"/>
      </formatter>
      <size-rotating-file-handler name="FILE-TRACE" autoflush="true">
        <formatter>
          <pattern-formatter pattern="%d{HH:mm:ss,SSS} %-5p [%c] (%t) %s%E%n"/>
        </formatter>
        <level name="TRACE"/>
        <rotate-size value="500M"/>
        <max-backup-index value="50"/>
        <file relative-to="jboss.server.log.dir" path="server-trace.log"/>
        <append value="true"/>
      </size-rotating-file-handler>
      <logger category="org.apache.activemq">
        <level name="TRACE"/>
      </logger>   
      <console-handler name="CONSOLE">
        <level name="INFO"/>
        <formatter>
          <pattern-formatter pattern="%d{HH:mm:ss,SSS} %-5p [%c] (%t) %s%E%n"/>
        </formatter>
      </console-handler>
    </subsystem>

I hope that version of the subsystem is not too new. Trace logs will be located in server-trace.log in $JBOSS_HOME/standalone/log directory.

11. Re: Failover with Shared Store not working when Live Server killed

bpogace Nov 9, 2017 6:43 AM (in response to mnovak)
Hi Mirek,

I found a solution to the problem, and it seems to be related to the type of journal it was being used (AIO / NIO).

So, as you can see from the configurations on both servers, the journal-type is not specified, so by default they are trying to use AIO, but that is not supported by the Linux machines where the servers are running. In both server logs this message could be found:
2017-11-08 17:24:11,562 INFO [org.wildfly.extension.messaging-activemq] (MSC service thread 1-1) WFLYMSGAMQ0075: AIO wasn't located on this platform, it will fall back to using pure Java NIO. Your platform is Linux, install LibAIO to enable the AIO journal and achieve optimal performance.

The message suggests that NIO is being used on both machines. But only when I reconfigured the machines and specified NIO as journal-type the shared-store problem was resolved.
I think this message may be misleading.

Anyway, thank you very much for the support. It's been very helpful.
Best Regards,
Besian
Actions
12. Re: Failover with Shared Store not working when Live Server killed

mnovak Nov 9, 2017 7:15 AM (in response to bpogace)

Hi Besian,

Artemis tries to use AIO by default but if it's not present/located then NIO journal-type is used. Thus I'm suspicious that setting NIO journal-type directly might help. You might be just lucky not hitting the issue.

Thanks,
Mirek
Actions

Go to original post