8 Replies Latest reply on Jul 29, 2016 12:14 PM by Wayne Wang

    Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)

    Wayne Wang Apprentice

      Hi

       

      I am testing a configuration of HA for messaging with data replication.

       

      I also set up HA singleton for the whole application so that at any time there is one active wildfly instance.

       

      Server 1 is set up as replication master and Server 2 is set up as replication slave. I start up server 1 and server 2 in that order, and I was able to send message through web server (port 80) and server 1 responds to the requests. When I shut down server 1, sever 2 will become active, and can continue to process messages that were not completed by server 1.

       

      However, when I re-start server 1, server 2 start to have problem. For example, if I send requests through web server (server 2 will process the requests since it is active), server 2 will not be able to process messages.

       

       

      The following are some error messages:

       

      2016-07-26 12:40:10,548 WARN  [org.apache.activemq.artemis.core.server] (Thread-343) AMQ222072: Timed out flushing channel on InVMConnection

      2016-07-26 12:40:16,165 WARN  [org.apache.activemq.artemis.service.extensions.xa.recovery] (Periodic Recovery) AMQ122015: Can not connect to XARecoveryConfig [transportConfiguration=[TransportConfiguration(name=, factory=org-apache-activemq-artemis-core-remoting-impl-invm-InVMConnectorFactory) ?serverId=0], discoveryConfiguration=null, username=null, password=****, JNDI_NAME=java:/JmsXA] on auto-generated resource recovery: ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=AMQ119007: Cannot connect to server(s). Tried with all available servers.]

        at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:777)

        at org.apache.activemq.artemis.service.extensions.xa.recovery.ActiveMQXAResourceWrapper.connect(ActiveMQXAResourceWrapper.java:314)

        at org.apache.activemq.artemis.service.extensions.xa.recovery.ActiveMQXAResourceWrapper.getDelegate(ActiveMQXAResourceWrapper.java:239)

        at org.apache.activemq.artemis.service.extensions.xa.recovery.ActiveMQXAResourceWrapper.recover(ActiveMQXAResourceWrapper.java:69)

        at org.apache.activemq.artemis.service.extensions.xa.ActiveMQXAResourceWrapperImpl.recover(ActiveMQXAResourceWrapperImpl.java:106)

        at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass(XARecoveryModule.java:550)

        at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass(XARecoveryModule.java:190)

        at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:747)

        at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:375)

       

       

      2016-07-26 12:40:16,228 WARN  [org.apache.activemq.artemis.service.extensions.xa.recovery] (Periodic Recovery) AMQ122008: XA Recovery can not connect to any broker on recovery [XARecoveryConfig [transportConfiguration=[TransportConfiguration(name=, factory=org-apache-activemq-artemis-core-remoting-impl-invm-InVMConnectorFactory) ?serverId=0], discoveryConfiguration=null, username=null, password=****, JNDI_NAME=java:/JmsXA]]

      2016-07-26 12:40:16,238 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016027: Local XARecoveryModule.xaRecovery got XA exception XAException.XAER_RMFAIL: javax.transaction.xa.XAException: Error trying to connect to any providers for xa recovery

        at org.apache.activemq.artemis.service.extensions.xa.recovery.ActiveMQXAResourceWrapper.getDelegate(ActiveMQXAResourceWrapper.java:260)

        at org.apache.activemq.artemis.service.extensions.xa.recovery.ActiveMQXAResourceWrapper.recover(ActiveMQXAResourceWrapper.java:69)

        at org.apache.activemq.artemis.service.extensions.xa.ActiveMQXAResourceWrapperImpl.recover(ActiveMQXAResourceWrapperImpl.java:106)

        at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass(XARecoveryModule.java:550)

        at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass(XARecoveryModule.java:190)

        at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:747)

        at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:375)

      Caused by: ActiveMQNotConnectedException[errorType=NOT_CONNECTED message=null]

        at org.apache.activemq.artemis.service.extensions.xa.recovery.ActiveMQXAResourceWrapper.connect(ActiveMQXAResourceWrapper.java:351)

        at org.apache.activemq.artemis.service.extensions.xa.recovery.ActiveMQXAResourceWrapper.getDelegate(ActiveMQXAResourceWrapper.java:239)

        ... 6 more

       

       

      in another log file, I can see that there is an issue with sending message to a queue:

       

      2016-07-26 12:03:18,111 (ERROR) [] [] sms.InboundSmsProducer: Could not send message to inbound queue: javax.jms.JMSException: Failed to create session factory

        at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createConnectionInternal(ActiveMQConnectionFactory.java:727)

       

       

       

       

      The following is part of standalone-full-ha.xml for master

              <subsystem xmlns="urn:jboss:domain:messaging-activemq:1.0">

                  <server name="default">

                      <cluster password="password"/>

                      <security-setting name="#">

                          <role name="guest" delete-non-durable-queue="true" create-non-durable-queue="true" consume="true" send="true"/>

                      </security-setting>

                      <address-setting name="#" redistribution-delay="1000" message-counter-history-day-limit="10" page-size-bytes="2097152" max-size-bytes="10485760" expiry-address="jms.queue.ExpiryQueue" dead-letter-address="jms.queue.DLQ"/>

                      <http-connector name="http-connector" endpoint="http-acceptor" socket-binding="http"/> 

                      <http-connector name="http-connector-throughput" endpoint="http-acceptor-throughput" socket-binding="http"> 

                          <param name="batch-delay" value="50"/> 

                      </http-connector> 

                      <in-vm-connector name="in-vm" server-id="0"/> 

                      <http-acceptor name="http-acceptor" http-listener="default"/> 

                      <http-acceptor name="http-acceptor-throughput" http-listener="default"> 

                          <param name="batch-delay" value="50"/> 

                          <param name="direct-deliver" value="false"/> 

                      </http-acceptor> 

                      <in-vm-acceptor name="in-vm" server-id="0"/> 

        <broadcast-group name="bg-group1" connectors="http-connector" jgroups-channel="activemq-cluster" jgroups-stack="udp"/>

        <discovery-group name="dg-group1" jgroups-channel="activemq-cluster" jgroups-stack="udp"/>

                      <cluster-connection name="my-cluster" discovery-group="dg-group1" connector-name="http-connector" address="jms"/>

                      <jms-queue name="ExpiryQueue" entries="java:/jms/queue/ExpiryQueue"/>

                      <jms-queue name="DLQ" entries="java:/jms/queue/DLQ"/>

                      <replication-master check-for-live-server="true"/>

                      <connection-factory name="InVmConnectionFactory" entries="java:/ConnectionFactory" connectors="in-vm"/>

        <connection-factory name="RemoteConnectionFactory" reconnect-attempts="-1" block-on-acknowledge="true" ha="true" entries="java:jboss/exported/jms/RemoteConnectionFactory" connectors="http-connector"/>

                      <pooled-connection-factory name="activemq-ra" transaction="xa" consumer-window-size="0" reconnect-attempts="-1" block-on-acknowledge="true" ha="true" entries="java:/JmsXA java:jboss/DefaultJMSConnectionFactory" connectors="in-vm"/>

        </server>


      The following is part of standalone-full-ha.xml for slave


              <subsystem xmlns="urn:jboss:domain:messaging-activemq:1.0">

                  <server name="default">

                      <cluster password="password"/>

                      <security-setting name="#">

                          <role name="guest" delete-non-durable-queue="true" create-non-durable-queue="true" consume="true" send="true"/>

                      </security-setting>

                      <address-setting name="#" redistribution-delay="1000" message-counter-history-day-limit="10" page-size-bytes="2097152" max-size-bytes="10485760" expiry-address="jms.queue.ExpiryQueue" dead-letter-address="jms.queue.DLQ"/>

                      <http-connector name="http-connector" endpoint="http-acceptor" socket-binding="http"/> 

                      <http-connector name="http-connector-throughput" endpoint="http-acceptor-throughput" socket-binding="http"> 

                          <param name="batch-delay" value="50"/> 

                      </http-connector> 

                      <in-vm-connector name="in-vm" server-id="0"/> 

                      <http-acceptor name="http-acceptor" http-listener="default"/> 

                      <http-acceptor name="http-acceptor-throughput" http-listener="default"> 

                          <param name="batch-delay" value="50"/> 

                          <param name="direct-deliver" value="false"/> 

                      </http-acceptor> 

                      <in-vm-acceptor name="in-vm" server-id="0"/> 

        <broadcast-group name="bg-group1" connectors="http-connector" jgroups-channel="activemq-cluster" jgroups-stack="udp"/>

        <discovery-group name="dg-group1" jgroups-channel="activemq-cluster" jgroups-stack="udp"/>

                      <cluster-connection name="my-cluster" discovery-group="dg-group1" connector-name="http-connector" address="jms"/>

                      <jms-queue name="ExpiryQueue" entries="java:/jms/queue/ExpiryQueue"/>

                      <jms-queue name="DLQ" entries="java:/jms/queue/DLQ"/>

                      <replication-slave allow-failback="true"/>

                      <connection-factory name="InVmConnectionFactory" entries="java:/ConnectionFactory" connectors="in-vm"/>

        <connection-factory name="RemoteConnectionFactory" reconnect-attempts="-1" block-on-acknowledge="true" ha="true" entries="java:jboss/exported/jms/RemoteConnectionFactory" connectors="http-connector"/>

                      <pooled-connection-factory name="activemq-ra" transaction="xa" consumer-window-size="0" reconnect-attempts="-1" block-on-acknowledge="true" ha="true" entries="java:/JmsXA java:jboss/DefaultJMSConnectionFactory" connectors="in-vm"/>

                  </server>

              </subsystem>

        • 1. Re: Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)
          Miroslav Novak Master

          Hi Wayne,

           

          this is a tricky scenario. Problem seems to be that once server 1 is started again and Artemis master activates then slave in server 2 become passive again (basically stops and waits for master to die again).

           

          HA singleton in server2 activates when server1 stopped. It's using connection factory defined in:

          <pooled-connection-factory name="activemq-ra" transaction="xa" ... connectors="in-vm"/>
          
          

           

          which has in-vm connector. This is a problem as with in-vm connector in server2 is not possible to connect (failback) to Artemis in Server1.

           

          I can see 2 solutions for you:

           

          a) Replace "in-vm" connector by "http-connector" for pooled connection factory:

          <pooled-connection-factory name="activemq-ra" transaction="xa" ... connectors="http-connector"/>
          
          

          With this configuration pooled-connection-factory in server2 will be able to "failback" to master on server1.

           

          b) Setup collocated HA topology as described in https://developer.jboss.org/message/960323?et=watches.email.thread#960323 (it's 3rd comment) It's with shared store which is much more stable in WF10 but you can change it to use replicated journal very easily. Just replicate tags "shared-store-..." for "replication-" and name each live-backup group differently.

           

          Cheers,

          Mirek

          • 2. Re: Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)
            Wayne Wang Apprentice

            Hi Mirek,

             

            By default, when the replication-master is live again, the replication-slave will fail back (the default value of allow-failback is true), and that may explain why wildfly #2 (replication-slave) could not create Pooled connection factory when it was configured as in-vm after wildfly #1 was re-started.

             

            I tried to set the allow-failback="false" in the replication-slave, and this seemed to fix the issue.

             

            The following is the test procedure.

             

            (1) make the whole application HA singleton with the singleton-deployment.xml in META-INF/ in both wildfly #1 and wildfly #2

             

            (2) start up wildfly #1 (replication-master is set up in this instance), then start up wildfly #2 (replication-slave is set up in this instance). Wildfly #1 is now HA singleton provider

            (3) send 500 requests through web server, wildfly #1 is now creating messages and processing messages

            (4) shut down wildfly #1 in the middle of creating and processing messages

            (5) observing wildfly #2 started to pick up the job to process messages. I can confirm that wildfly #1 only processed part of the messages and wildfly #2 processed the remaining messages, achieving HA for messaging function

             

            (6) re-start wildfly #1, now I do not see any error messages from wildfly #2 (currently HA singleton provider)

            (7) send 500 requests through web server, wildly #2 is now creating messages and processing messages

            (8) shut down wildfly #2 in the middle of creating and processing messages

            (9) observing wildfly #1 started to pick up the job to process messages. I can confirm that wildfly #2 only processed part of the messages and wildfly #1 processed the remaining messages, achieving HA for messaging function

             

            Question:

            (1) Now that I set the allow-failback="false" on replication-slave (set on wildfly instance #2), when the wildfly instance #1 (replication-master is configured) is restarted, do we have two live activemq servers running?

             

            (2) If I do not set the allow-failback="false", but set the connectors="http-connector" in replication-slave (wildfly instance #2), will the replication-master have all the messages that are in replication-slave? Any message processing will now go through the replication-master?

             

            (3) I read activemq-artimis-1.1.0.pdf, and there is a section of text describing a "split brain scenario" where there may be an issue with the data replication approach. Specifically, if the backup server (replication-slave) lose connection to its live server temporarily, it will become active although the live server did not really stopped.

             

            In this case, activemq implements an algorithm to decide whether to make the backup server active: If the cluster has multiple activemq instances, activemq will check to see if the backup server can connect to more than half of the servers in the cluster, and it will become active (meaning the live server may really stopped) if this is the case

             

            In a two wildfly instances scenario (one live/master, one backup/slave), this algorithm may never work?

             

            Thanks,

             

            Wayne

            • 3. Re: Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)
              Miroslav Novak Master

              Setting allow-failback to false did not come up to my mind. Nice! :-)


              To your questions:

              (1) Now that I set the allow-failback="false" on replication-slave (set on wildfly instance #2), when the wildfly instance #1 (replication-master is configured) is restarted, do we have two live activemq servers running?

               

              - No. The original master in wildfly instance #1 will wait until slave in wildfly instance #2 dies. It's because you correctly set <replication-master check-for-live-server="true"/> in configuration of master in wildfly instance #1. If it's set to false then master would start no matter whether slave is active or not. Then there would be master and slave running at the same time and that would be disaster.

               

              (2) If I do not set the allow-failback="false", but set the connectors="http-connector" in replication-slave (wildfly instance #2), will the replication-master have all the messages that are in replication-slave? Any message processing will now go through the replication-master?

               

              - Yes, master will have all messages and all processing will go through replication master.

              Before master activates, it will sync its journal with slave server. Basically slave will send its journal to master before master activates. Once master activates then all messages will be on master. Slave will restart itself and sync with master so another failover can occur. There is max-saved-replicated-journal attribute on slave configuration which is set to 2 by default. I suggest to set it to -1 so there can be infinite number of failover->failbacks. This is because each failover -> failback creates one more journal directory on slave.

               

              (3) I read activemq-artimis-1.1.0.pdf, and there is a section of text describing a "split brain scenario" where there may be an issue with the data replication approach. Specifically, if the backup server (replication-slave) lose connection to its live server temporarily, it will become active although the live server did not really stopped.

               

              - Well, this is problem with replication generally. There is not much what can be done here. You can try for a minute to disconnect network between master and slave and you will see that slave will activate. If there just one master/slave pair then there is no way how slave could figure out that live is still running when network between them got broken.

               

              In this case, activemq implements an algorithm to decide whether to make the backup server active: If the cluster has multiple activemq instances, activemq will check to see if the backup server can connect to more than half of the servers in the cluster, and it will become active (meaning the live server may really stopped) if this is the case

              In a two wildfly instances scenario (one live/master, one backup/slave), this algorithm may never work?

               

              - Yes, but you can add more Artemis masters to cluster so backup can see them. Quite ugly but if you are really scared of split brain then more servers are needed.

              • 4. Re: Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)
                Wayne Wang Apprentice

                Hi Mirek,

                 

                Thank you very much for all the questions that I have asked. I feel more confident on setting the parameter allow-failback="false" in our specific scenario (HA Singleton for whole application).

                 

                Just a general question on the options of setting up the HA for messaging in activemq-artemis

                 

                There can be the following options. I am not clear about the use case of the last option.

                (1) shared-store:

                configure one shared-store-master and one or more shared-store-slave in the a standalone-full-ha.xml. Do this for separate wildfly instances

                This essentially is a co-located configuration since two or more activemq servers are set up in one wildfly instance

                 

                (2) replication:

                configure one replication-master in standalone-full-ha.xml of one wildfly instance, and configure one replication-slave in a standalone-full-ha.xml in other backup wildfly instances

                 

                (3) shared-store-colocated:

                configure one master and one slave in the same standalone-full-ha.xml

                Difference between this approach and approach (1) is that you can only configure one slave. I never tried this approach.

                 

                (4) replication-colocated?

                what is the use case of this approach?

                If we do data replication, why do we need a master and a slave in the same wildfly instance?

                 

                Thanks,

                 

                Wayne

                • 5. Re: Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)
                  Miroslav Novak Master

                  (1) shared-store:

                  configure one shared-store-master and one or more shared-store-slave in the a standalone-full-ha.xml. Do this for separate wildfly instances

                  This essentially is a co-located configuration since two or more activemq servers are set up in one wildfly instance

                   

                  - Correct. The recommended way how to configure co-located topology is to have 2 master-slave pairs in 2 WF10 instances. Like WF1(master1/slave2) <--> WF2(master2/slave1). If more servers are needed configure 2 more WF10 instances like WF3(master3/slave4) <--> WF4(master4/slave3) and so on. No matter what's written in Artemis doc, have just one master per slave. Especially for replicated journal as I believe, no one really tested to have more slaves. Also in EAP 7 documentation is mentioned to use only one slave per master.

                   

                  Another thing, the term "colocated" as I'm using means that we manually configure slave server with some master (from another master-slave pair) in one WF10 instance. Such configuration is really tedious and annoying so Artemis dev did a try to make it simpler and that's why new attributes shared-store-colocated and replication-colocated were added. I'll explain below.

                   

                  (3) shared-store-colocated:

                  configure one master and one slave in the same standalone-full-ha.xml

                  Difference between this approach and approach (1) is that you can only configure one slave. I never tried this approach.

                   

                  (4) replication-colocated?

                  what is the use case of this approach?

                  If we do data replication, why do we need a master and a slave in the same wildfly instance?

                   

                  - shared-store-colocated, replication-colocated are new attributes in Artemis. They allow to ask another master in cluster to create (colocated) slave for me. The idea is to start a few WF10 instances where Artemis masters will form cluster. Then if replication/shared-store-colocated is configured, each master will create slave for other master in cluster. Administrator does not have to configure it manually.

                  This is pretty new feature and I know there are issues. For EAP 7 it's not supported at this moment.

                   

                  Also you're correct that it does not make sense to have one master-slave pair in one WF10 instance. If WF10 crashes then this master-slave pair is dead as well. Slave must be always in another JVM (on another machine).

                  • 6. Re: Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)
                    Wayne Wang Apprentice

                    Hi Mirek,

                     

                    Based on what you have described, the reliable way to set up HA for messaging is shared-store which I agree very much. The data replication has many potential issues.

                     

                    You have pointed out that the recommended configuration is a pair of wildfly instances with each instance configured to be master/slave

                     

                    (1) for two wildfly instances: it is WF1 (master1/slave2) and WF2 (master2/slave1)

                    (2) for four wildfly instances: you will add WF3 (master3/slave4) and WF4 (master4/slave3)

                    and so on

                     

                    The above configuration is for a real cluster environment where all wildfly instances are active, but activemq servers are only active in one of the pair. So if WF1 is down, WF2 will pickup messages in WF1. This also holds true for WF3 and WF4 etc.

                     

                    Would it not be a problem if both WF1 and WF2 are down while message processing is not completed ? I do not see how WF3 / WF4 can access the un-processed messages from WF1/WF2.

                     

                    My scenario is a bit different since I only allow one wildfly instance to be active. In my previous test,

                    When I set up two wildfly instances, I will have WF1 (master1/slave2) and WF2 (master2/slave1)

                    When I set up three wildfly instances, I could set it up as:

                     

                    WF1 (master1/slave2)

                    WF2 (master2/slave1)

                    WF3 (master3/slave2)


                    If I start up WF1, WF2, WF3 in that order, then WF1 will be HA singleton provider, also only the activemq of master1 is active.

                    If I shut down WF1 while it is in the middle of message processing, then WF2 will be HA singleton provider, and it will pick up the un-processed message left by WF1.

                    If I restart WF1, and shut down WF2 while it is in the middle of message processing, then WF3 will be HA singleton provider, and it will pickup the un-processed message left by WF2

                     

                    This is fine.

                     

                    If I restart WF2, and shut down WF3 while it is in the middle of message processing, the WF1 will be HA singleton provider, and it probably can NOT pick up the un-processed message left by WF3 since it does not access the folder where the activemq of master3 is located. Is that right?

                     

                    In my previous test, I actually set up a scenario of three wildfly instances with each instance having a shared-store-master pointing to a folder as the messaging server for this wildfly instance, and two shared-stored-slave pointing to messaging servers corresponding to other two wildfly instances.

                     

                    WF1 (master1/slave2/slave3)

                    WF2 (master2/slave1/slave3)

                    WF3 (master3/slave1/slave2)

                     

                    In this setup, I was able to do the following:

                    If I start up WF1, WF2, WF3 in that order, then WF1 will be HA singleton provider, also only the activemq of master1 is active.

                    If I shut down WF1 while it is in the middle of message processing, then WF2 will be HA singleton provider, and it picked up the un-processed message left by WF1.

                    If I restart WF1, and shut down WF2 while it is in the middle of message processing, then WF3 will be HA singleton provider, and it picked up the un-processed message left by WF2

                    If I restart WF2, and shut down WF3 while it is in the middle of message processing, the WF1 will be HA singleton provider, and it picked up the un-processed message left by WF3

                    Then I can restart WF3


                    The whole procedure of server maintenance is now complete.

                     

                    We may eventually go with two wildfly instances, then WF1 (master1/slave2) and WF2 (master2/slave1) is sufficient. However, if we need to configure three or more WF instances, then we need to know how to make it work in our scenario (HA singleton for the whole application).

                     

                    Thanks,

                     

                    Wayne

                    • 7. Re: Backup wildfly 10 instance is unable to process requests (data-replication) after the master is restarted (not active)
                      Miroslav Novak Master

                      Would it not be a problem if both WF1 and WF2 are down while message processing is not completed ? I do not see how WF3 / WF4 can access the un-processed messages from WF1/WF2.

                       

                      Correct, WF3 and WF4 cannot access un-messages. Only way how to do a maintenance here, is to shutdown only one of the WF servers -> update config -> start it and never shutdown more WF10 instances at once.

                       

                      If I restart WF2, and shut down WF3 while it is in the middle of message processing, the WF1 will be HA singleton provider, and it probably can NOT pick up the un-processed message left by WF3 since it does not access the folder where the activemq of master3 is located. Is that right?

                      Correct, master 3 has no backup so all messages on master3 cannot be processed.

                       

                      WF1 (master1/slave2/slave3)

                      WF2 (master2/slave1/slave3)

                      WF3 (master3/slave1/slave2)

                       

                      In this setup, I was able to do the following:

                      If I start up WF1, WF2, WF3 in that order, then WF1 will be HA singleton provider, also only the activemq of master1 is active. ...

                       

                      I believe that with shared store is safe to have more slaves per master as there is just one journal and only server with journal file lock which can access it. If 3 WF10 servers are needed then topology as you described is ok. For sure that it will work when something bad happens, you can try to kill (kill -9 pid_of_wf10) one or two of WF10 instances and check that all messages can be processed.

                       

                      If you need more than 3 instances then I think there will be problem how to add it. In this case you need to configure WF4 with master4 and add slave4 to WF1,2,3. Number of slaves in each WF instance is increasing. For this reason for scaling up might be more easy scale by 2 new WF3 and WF4 as I described above. Both approaches have its benefits and drawbacks.

                       

                      Thanks,

                      Mirek