8 Replies Latest reply on Jun 26, 2019 2:01 PM by cweiler

Wildfly HA in Openshift

cweiler Jun 21, 2019 11:35 AM

Hi,

I'm trying to enable HA in Openshift, but failing miserably...

So far, I got tips from various sites:

- Getting Started with JBoss EAP for OpenShift Container Platform - Red Hat Customer Portal

- High Availability Servlets with EAP 7 and OpenShift - Red Hat Developer Blog

- haexample/standalone-openshift.xml at master · markeastman/haexample · GitHub

- GitHub - jgroups-extras/jgroups-kubernetes: JGroups discovery protocol for Kubernetes

Foreground:

- standalone-full-ha.xml

/subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp)
/subsystem=jgroups/stack=tcp/protocol=MPING:remove()
/subsystem=jgroups/stack=tcp/protocol=kubernetes.KUBE_PING:add(add-index=0)

- environment variables

JGROUPS_PING_PROTOCOL=kubernetes.KUBE_PING
KUBERNETES_NAMESPACE= project
KUBERNETES_LABELS=app=application

- exposed ports

7600 e 8888

In logs I have that kube_ping is asking openshift api:

FINE [org.jgroups.protocols.kubernetes.stream.BaseStreamProvider] (thread-4,null,null) InsecureStreamProvider opening connection: url [https://172.30.0.1:443/api/v1/namespaces/project/pods?labelSelector=app%3Dapplication

Running curl on this request from pods terminal returns all pods running.

But, no metter how many pods is running, no one register each other in cluster.

Any help?

1. Re: Wildfly HA in Openshift

belaban Jun 19, 2019 10:26 AM (in response to cweiler)

Have you checked that you have proper authorization to do this? E.g.
cat <<EOF | kubectl apply -f -
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: jgroups-kubeping-pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]

---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: jgroups-kubeping-api-access
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: jgroups-kubeping-pod-reader
subjects:
- kind: ServiceAccount
name: jgroups-kubeping-service-account
namespace: $TARGET_NAMESPACE

---
1 of 1 people found this helpful
Actions
2. Re: Wildfly HA in Openshift

cweiler Jun 21, 2019 11:33 AM (in response to belaban)
Hi belaban, yeah, permissions are ok, thanks.

But, I evolved in my issue. Now the cluster is comunicating each other. The problem was that port 7600 are opening, by default, in 127.0.0.1 interface, so I solved this with:

/interface=kubernetes:add(nic=eth0) /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp/:write-attribute(name=interface,value=kubernetes)

Despite that, I can't get session fail over to work.

I added tag in web.xml:

<distributable/>

No luck with that. Load balance is working, I get diferent pods on different sessions, but when I kill the pod, session is lost.
1 of 1 people found this helpful
Actions
3. Re: Wildfly HA in Openshift

pferraro Jun 24, 2019 10:08 PM (in response to cweiler)
You will want to modify the interface used for all socket bindings used by the protocol stack. The stack mentioned in one of the links also uses the jgroups-tcp-fd socket binding to configure the socket used by FD_SOCK. If FD_SOCK can't communicate with other nodes, it will assume it has crashed - which explains why your application's session don't failover.
e.g.
/socket-binding-group=standard-sockets/socket-binding=jgroups-tcp-fd:write-attribute(name=interface,value=kubernetes)
1 of 1 people found this helpful
Actions
4. Re: Wildfly HA in Openshift

cweiler Jun 25, 2019 4:45 PM (in response to pferraro)
Hi pferraro, thanks for helping.

I added your suggestion to my script.

I managed to enable verbosity on logs and get the problem:
Quote from the Servlet 3.0 specification (chapter 8.2.3 "Assembling the descriptor from web.xml, webfragment.xml and annotations"):
"viii. The web.xml resulting from the merge is considered <distributable> only if all its web fragments are marked as <distributable> as well."
And, we DO have web fragments. :facepalm:

So, this is our final configs to enable HA:

1. Grants permission
oc policy add-role-to-user view system:serviceaccount:{namespace}:default -n {namespace}

2. Config Wildfly
/subsystem=messaging-activemq/server=default/cluster-connection=my-cluster:write-attribute(name=reconnect-attempts,value=10) /interface=kubernetes:add(nic=eth0) /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp/:write-attribute(name=interface,value=kubernetes) /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp-fd/:write-attribute(name=interface,value=kubernetes) /subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp) /subsystem=jgroups/stack=tcp/protocol=MPING:remove() /subsystem=jgroups/stack=tcp/protocol=kubernetes.KUBE_PING:add(add-index=0)

3. Config Openshift app deployment
spec: template: metadata: labels: app: applicationname spec: containers: env: - name: JGROUPS_PING_PROTOCOL value: kubernetes.KUBE_PING - name: KUBERNETES_NAMESPACE value: namespace - name: KUBERNETES_LABELS value: app=applicationname

4. Config application
Add <distributable/> tag to each and every web.xml and web-fragment.xml

Obs. No exposed ports mapping necessary

Finnaly, session fail over working (tested on Wildfly 17 / OKD 3.11).

One last issue. When session is transferred to other pods its takes a while to respond, something about 15s TTFB, but only on first request, other sessions are immediate after the first. This is expected and OK or I can improve that?
Actions
5. Re: Wildfly HA in Openshift

pferraro Jun 26, 2019 8:23 AM (in response to cweiler)

cweiler wrote:
Quote from the Servlet 3.0 specification (chapter 8.2.3 "Assembling the descriptor from web.xml, webfragment.xml and annotations"):
"viii. The web.xml resulting from the merge is considered <distributable> only if all its web fragments are marked as <distributable> as well."
And, we DO have web fragments. :facepalm:

I'll make sure we clarify this better in our documentation, as you are the first person to be tripped up by this particular idiosyncrasy.

[Finally], session fail over working (tested on Wildfly 17 / OKD 3.11).
Congrats!

One last issue. When session is transferred to other pods its takes a while to respond, something about 15s TTFB, but only on first request, other sessions are immediate after the first. This is expected and OK or I can improve that?
What do you mean exactly by "when the session is transferred to other pods [it] takes a while to respond"?
Sessions are replicated to other node for each request - so it's unclear to me what event preceded the request that took 15s to respond.
Are you talking about this the initial request for a given session? or following some kind of failover (i.e. scaling down or killing a pod)? Or do you mean after a new pod is started (i.e. scaling up)?
Actions
6. Re: Wildfly HA in Openshift

cweiler Jun 26, 2019 11:28 AM (in response to pferraro)
pferraro escreveu:

What do you mean exactly by "when the session is transferred to other pods [it] takes a while to respond"?
Sessions are replicated to other node for each request - so it's unclear to me what event preceded the request that took 15s to respond.
Are you talking about this the initial request for a given session? or following some kind of failover (i.e. scaling down or killing a pod)? Or do you mean after a new pod is started (i.e. scaling up)?

Ok, let me try to explain this.

First, details, I tested 2 scenarios:
Start 1 pod, than request a new deploy (1 pod to 1 pod session fail over) (Obs. readiness probe configured);
Start 2 pods, than scale down 1.

Steps to reproduce, both scenarios:
1. start 2 different sessions to the same pod;
2. kill this pod;
3. refresh session 1, about 15s TTFB;
4. wait session 1 became responsive;
5. refresh session 2, instantly.
Actions
7. Re: Wildfly HA in Openshift

pferraro Jun 26, 2019 12:52 PM (in response to cweiler)

cweiler I suspect the load balancer is trying to send the refresh request to the same node, which is not longer in the cluster. It likely retries a number of times before selecting an alternate load balancing target. This should be evident by looking through the logs of the load balancer. You may want to try the number of retries in haproxy as well as the reducing the connect timeout.

I doubt issue is with WF itself - as both failure types (shutdown (i.e. scale down) and killing the pod) should be immediately detected by the FD_SOCK protocol.
Actions

8. Re: Wildfly HA in Openshift

cweiler Jun 26, 2019 2:01 PM (in response to pferraro)

Thanks pferraro, found this on playbooks:

#file haproxy.cfg.j2
defaults
    mode                    http
    log                    global
    option                  httplog
    option                  dontlognull
#    option http-server-close
    option forwardfor      except 127.0.0.0/8
    option                  redispatch
    retries                3
    timeout http-request    10s
    timeout queue          1m
    timeout connect        10s
    timeout client          300s
    timeout server          300s
    timeout http-keep-alive 10s
    timeout check          10s
    maxconn                {{ openshift_loadbalancer_default_maxconn | default(20000) }}

I doubt issue is with WF itself - as both failure types (shutdown (i.e. scale down) and killing the pod) should be immediately detected by the FD_SOCK protocol.

Agree! Will research about it on net.

Thank you!

Go to original post