12 Replies Latest reply on Aug 1, 2018 9:23 AM by bpogace

Cluster with singleton deployment - backup server going on Timeout

bpogace Jul 19, 2018 11:13 AM

Hi everyone,

I'm having an issue while trying to have a 2-server (standalone) singleton deployment. That is that while one server runs as the elected singleton provider, the second will be up in a pending (pre-deploying) status.
The problem occurs after 300 seconds, where the second (pending) server logs this error:

16:43:28,949 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
    ("core-service" => "management"),
    ("management-interface" => "http-interface")
]'

followed after 5 seconds by some TimoutExceptions (WFLYCTL0190):

the first two by "operation" boottime-controller-initializer-step
the others are related to each of the standard-sockets socket-bindings.
the last ones are related to each of the interfaces

The server then will become unresponsive and when tried to do a failback, both servers will go down.

Surely this is related to the singleton deployment and the 300 seconds timeout are a default value, but I wonder (and I cannot easily find) if there is a configuration or a way to have the second server not go on timeout.
Best regards,
Besian

1. Re: Cluster with singleton deployment - backup server going on Timeout

pferraro Jul 19, 2018 12:38 PM (in response to bpogace)

Which version of WildFly is this?
Actions
2. Re: Cluster with singleton deployment - backup server going on Timeout

bpogace Jul 20, 2018 4:18 AM (in response to pferraro)

Hi Paul,
Sorry, I put it only in the tags... it's 10.1.0.Final
Actions
3. Re: Cluster with singleton deployment - backup server going on Timeout

pferraro Jul 20, 2018 6:53 PM (in response to bpogace)

Can you reproduce the issue with the latest release, i.e. 13.0.0.Final?
There were a number of critical fixes to singleton deployments in WF11.
Actions
4. Re: Cluster with singleton deployment - backup server going on Timeout

bpogace Jul 25, 2018 6:38 AM (in response to pferraro)

Hi Paul

I can confirm this occurs also on WF 13.0.0.Final.
More to that this is the only time I have faced a "FATAL" log in Wildfly:
> FATAL [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0056: Server boot has failed in an unrecoverable manner; exiting. See previous messages for details.
Any ideas on this?
Best regards,
Besian
Actions

5. Re: Cluster with singleton deployment - backup server going on Timeout

pferraro Jul 28, 2018 4:45 PM (in response to bpogace)

bpogace I'm not able to reproduce the issue against WF master.

Here's a snippet from the log of the server with a singleton deployment (test.war) who initially started as a backup, then after killing the server hosting the primary singleton deployment after about 13 minutes, becomes primary.

16:18:19,504 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-7) WFLYSRV0027: Starting deployment of "test.war" (runtime-name: "test.war")
16:18:19,578 INFO  [org.wildfly.extension.clustering.singleton] (MSC service thread 1-5) WFLYCLSNG0001: Singleton deployment detected. Deployment will reset using default policy.
16:18:19,581 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-5) WFLYSRV0028: Stopped deployment test.war (runtime-name: test.war) in 2ms
16:18:19,582 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0027: Starting deployment of "test.war" (runtime-name: "test.war")
16:18:19,606 INFO  [org.jboss.as.server] (DeploymentScanner-threads - 1) WFLYSRV0010: Deployed "test.war" (runtime-name : "test.war")
16:31:18,853 INFO  [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0003: node2 elected as the singleton provider of the jboss.deployment.unit."test.war".installer service
16:31:18,856 INFO  [org.wildfly.clustering.server] (ChannelCommandDispatcherFactory - 2) WFLYCLSV0001: This node will now operate as the singleton provider of the jboss.deployment.unit."test.war".installer service
16:31:18,918 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN000094: Received new cluster view for channel ejb: [node2|2] (1) [node2]
16:31:18,922 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN100001: Node node1 left the cluster
16:31:18,935 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN000094: Received new cluster view for channel ejb: [node2|2] (1) [node2]
16:31:18,936 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN100001: Node node1 left the cluster
16:31:18,949 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN000094: Received new cluster view for channel ejb: [node2|2] (1) [node2]
16:31:18,950 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN100001: Node node1 left the cluster
16:31:18,957 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN000094: Received new cluster view for channel ejb: [node2|2] (1) [node2]
16:31:18,958 INFO  [org.infinispan.CLUSTER] (thread-14,ejb,node2) ISPN100001: Node node1 left the cluster
16:31:18,971 INFO  [org.infinispan.CLUSTER] (stateTransferExecutor-thread--p19-t15) [Context=client-mappings] ISPN100007: After merge (or coordinator change), recovered members [node2] with topology id 7
16:31:18,972 INFO  [org.infinispan.CLUSTER] (stateTransferExecutor-thread--p17-t19) [Context=default] ISPN100007: After merge (or coordinator change), recovered members [node2] with topology id 7
16:31:19,048 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 79) WFLYUT0021: Registered web context: '/test' for server 'default-server'

Perhaps you can say more about your deployment? Is this a compound deployment (i.e. ear)? or simple deployment (e.g. war, jar, etc)?

If using a compound deployment, where is your singleton deployment descriptor located?

Do you see something akin to line #2 in your log?

6. Re: Cluster with singleton deployment - backup server going on Timeout

bpogace Jul 30, 2018 4:03 AM (in response to pferraro)

Hi Paul,
Thanks for the reply.

I guess this is WF13 since I cannot find the line in the 10.1.0 version, but anyway, here are the logs you were asking for:

09:44:43,859 INFO  [org.wildfly.extension.clustering.singleton] (MSC service thread 1-5) WFLYCLSNG0001: Singleton deployment detected. Deployment will reset using default policy.
09:44:43,933 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-6) WFLYSRV0028: Stopped deployment test.war (runtime-name: test.war) in 73ms
09:44:43,934 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-3) WFLYSRV0027: Starting deployment of "test.war" (runtime-name: "test.war")
09:44:44,143 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 74) WFLYCLINF0002: Started default cache from server container
09:44:44,143 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 79) WFLYCLINF0002: Started client-mappings cache from ejb container
09:44:44,192 INFO  [org.wildfly.clustering.server] (DistributedSingletonService - 1) WFLYCLSV0003: controller elected as the singleton provider of the jboss.deployment.unit."test.war".installer service
09:44:44,194 INFO  [org.wildfly.clustering.server] (DistributedSingletonService - 1) WFLYCLSV0001: This node will now operate as the singleton provider of the jboss.deployment.unit."test.war".installer service
09:44:44,195 INFO  [org.jboss.as.server] (ServerService Thread Pool -- 43) WFLYSRV0010: Deployed "test.war" (runtime-name : "test.war")

My singleton deployment descriptor is a jboss-all.xml file located in test.war/WEB-INF/ with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<jboss xmlns="urn:jboss:1.0">
    <singleton-deployment xmlns="urn:jboss:singleton-deployment:1.0"/>
</jboss>

At this point, is it ok if I ask for your working configuration (standalone-full-ha.xml file example for both master and slave)? It could be something that we might be missing there.

Best regards,

Besian

7. Re: Cluster with singleton deployment - backup server going on Timeout

pferraro Jul 30, 2018 8:28 AM (in response to bpogace)
bpogace wrote:
I guess this is WF13 since I cannot find the line in the 10.1.0 version, but anyway, here are the logs you were asking for:
You did say that you had the same problem on WF13? The critical log message that you should look for is this:

16:18:19,578 INFO [org.wildfly.extension.clustering.singleton] (MSC service thread 1-5) WFLYCLSNG0001: Singleton deployment detected. Deployment will reset using default policy.

Starting with WF11, the singleton deployment process was refactored to interact with the deployment chain in a different way (since the rollback of the deployment process proved problematic). Rather than halting the deployment chain in the middle (like in WF8-10.x), and only proceeding when elected primary, the entire deployment chain is either initiated or not, depending on whether elected primary.
Can you paste the logs from your WF13 run?

My singleton deployment descriptor is a jboss-all.xml file located in test.war/WEB-INF/ with the following content:
<?xml version="1.0" encoding="UTF-8"?> <jboss xmlns="urn:jboss:1.0"> <singleton-deployment xmlns="urn:jboss:singleton-deployment:1.0"/> </jboss>

That works just as well.

At this point, is it ok if I ask for your working configuration (standalone-full-ha.xml file example for both master and slave)? It could be something that we might be missing there.

I used the default standalone-ha.xml.
Actions
8. Re: Cluster with singleton deployment - backup server going on Timeout

bpogace Jul 30, 2018 8:38 AM (in response to pferraro)

Hi again Paul,

Can you paste the logs from your WF13 run?
Yes, the logs I posted are from WF13. The logging is different in WF10 and yes, the problem occurs in both versions.
I used the default standalone-ha.xml.
I think that should do fine for the configuration parts I want to check, so if it's not a problem could you please share?

Thanks,
Besian
Actions
9. Re: Cluster with singleton deployment - backup server going on Timeout

pferraro Jul 30, 2018 4:22 PM (in response to bpogace)

bpogace wrote:
Can you paste the logs from your WF13 run?
Yes, the logs I posted are from WF13. The logging is different in WF10 and yes, the problem occurs in both versions.
Can you attach the logs from WF13 of the server that times out?

I used the default standalone-ha.xml.
I think that should do fine for the configuration parts I want to check, so if it's not a problem could you please share?

I've tried a simple war containing your jboss-all.xml with both the default standalone-ha.xml and default standalone-full-ha.xml, and do not see any failure waiting for server stability on WF master.
Actions
10. Re: Cluster with singleton deployment - backup server going on Timeout

bpogace Aug 1, 2018 8:32 AM (in response to pferraro)
Hello Paul,

I did a few tests on both WF10 and WF13.
WF10:
I managed to have a configuration where the slave Server doesn't go on timeout, but on the other hand, the master still has the issue and I cannot figure out why since they both have a very similar configuration. I am attaching both of them for you to kindly check, thanks.

WF13:
It seems like in WF13 the issue is not present, but here the issue is different (and that is why I am still working with WF10 as well): when the backup becomes the singleton provider and the master boots up again, the later will be in a waiting state for the singleton (artemis debug message "failed to lock position: 1") but differently from when the slave is in this state it won't reply to http requests with code 404 (which is my desired behavior). Maybe I should write in a different post about this.

I'd gladly hear more of your thoughts on this.

Best regards,
Besian

srv1-standalone-full-ha.xml.zip 6.1 KB

srv2-standalone-full-ha.xml.zip 6.1 KB
Actions
11. Re: Cluster with singleton deployment - backup server going on Timeout

pferraro Aug 1, 2018 8:48 AM (in response to bpogace)

bpogace wrote:
WF10:
I managed to have a configuration where the slave Server doesn't go on timeout, but on the other hand, the master still has the issue and I cannot figure out why since they both have a very similar configuration. I am attaching both of them for you to kindly check, thanks.
This is a known issue with WF10. I don't recommend using singleton deployments on WF10.

WF13:
It seems like in WF13 the issue is not present, but here the issue is different (and that is why I am still working with WF10 as well): when the backup becomes the singleton provider and the master boots up again, the later will be in a waiting state for the singleton (artemis debug message "failed to lock position: 1") but differently from when the slave is in this state it won't reply to http requests with code 404 (which is my desired behavior). Maybe I should write in a different post about this.
I'm not sure i completely understand. Open a separate thread and we'll discuss there.
Actions
12. Re: Cluster with singleton deployment - backup server going on Timeout

bpogace Aug 1, 2018 9:23 AM (in response to pferraro)

Here is the issue created: [WF13] Different behavior in http-response between Master and Slave in a shared-store messaging configuration
Actions

Go to original post