-
1. Re: Wildfly HA in Openshift
belaban Jun 19, 2019 10:26 AM (in response to cweiler)1 of 1 people found this helpfulHave you checked that you have proper authorization to do this? E.g.
cat <<EOF | kubectl apply -f -
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: jgroups-kubeping-pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: jgroups-kubeping-api-access
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: jgroups-kubeping-pod-reader
subjects:
- kind: ServiceAccount
name: jgroups-kubeping-service-account
namespace: $TARGET_NAMESPACE
---
-
2. Re: Wildfly HA in Openshift
cweiler Jun 21, 2019 11:33 AM (in response to belaban)1 of 1 people found this helpfulHi belaban, yeah, permissions are ok, thanks.
But, I evolved in my issue. Now the cluster is comunicating each other. The problem was that port 7600 are opening, by default, in 127.0.0.1 interface, so I solved this with:
/interface=kubernetes:add(nic=eth0) /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp/:write-attribute(name=interface,value=kubernetes)
Despite that, I can't get session fail over to work.
I added tag in web.xml:
<distributable/>
No luck with that. Load balance is working, I get diferent pods on different sessions, but when I kill the pod, session is lost.
-
3. Re: Wildfly HA in Openshift
pferraro Jun 24, 2019 10:08 PM (in response to cweiler)1 of 1 people found this helpfulYou will want to modify the interface used for all socket bindings used by the protocol stack. The stack mentioned in one of the links also uses the jgroups-tcp-fd socket binding to configure the socket used by FD_SOCK. If FD_SOCK can't communicate with other nodes, it will assume it has crashed - which explains why your application's session don't failover.
e.g.
/socket-binding-group=standard-sockets/socket-binding=jgroups-tcp-fd:write-attribute(name=interface,value=kubernetes)
-
4. Re: Wildfly HA in Openshift
cweiler Jun 25, 2019 4:45 PM (in response to pferraro)Hi pferraro, thanks for helping.
I added your suggestion to my script.
I managed to enable verbosity on logs and get the problem:
Quote from the Servlet 3.0 specification (chapter 8.2.3 "Assembling the descriptor from web.xml, webfragment.xml and annotations"):
"viii. The web.xml resulting from the merge is considered <distributable> only if all its web fragments are marked as <distributable> as well."
And, we DO have web fragments. :facepalm:
So, this is our final configs to enable HA:
1. Grants permission
oc policy add-role-to-user view system:serviceaccount:{namespace}:default -n {namespace}
2. Config Wildfly
/subsystem=messaging-activemq/server=default/cluster-connection=my-cluster:write-attribute(name=reconnect-attempts,value=10) /interface=kubernetes:add(nic=eth0) /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp/:write-attribute(name=interface,value=kubernetes) /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp-fd/:write-attribute(name=interface,value=kubernetes) /subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp) /subsystem=jgroups/stack=tcp/protocol=MPING:remove() /subsystem=jgroups/stack=tcp/protocol=kubernetes.KUBE_PING:add(add-index=0)
3. Config Openshift app deployment
spec: template: metadata: labels: app: applicationname spec: containers: env: - name: JGROUPS_PING_PROTOCOL value: kubernetes.KUBE_PING - name: KUBERNETES_NAMESPACE value: namespace - name: KUBERNETES_LABELS value: app=applicationname
4. Config application
Add <distributable/> tag to each and every web.xml and web-fragment.xml
Obs. No exposed ports mapping necessary
Finnaly, session fail over working (tested on Wildfly 17 / OKD 3.11).
One last issue. When session is transferred to other pods its takes a while to respond, something about 15s TTFB, but only on first request, other sessions are immediate after the first. This is expected and OK or I can improve that?
-
5. Re: Wildfly HA in Openshift
pferraro Jun 26, 2019 8:23 AM (in response to cweiler)cweiler wrote:
Quote from the Servlet 3.0 specification (chapter 8.2.3 "Assembling the descriptor from web.xml, webfragment.xml and annotations"):
"viii. The web.xml resulting from the merge is considered <distributable> only if all its web fragments are marked as <distributable> as well."
And, we DO have web fragments. :facepalm:
I'll make sure we clarify this better in our documentation, as you are the first person to be tripped up by this particular idiosyncrasy.
[Finally], session fail over working (tested on Wildfly 17 / OKD 3.11).
Congrats!
One last issue. When session is transferred to other pods its takes a while to respond, something about 15s TTFB, but only on first request, other sessions are immediate after the first. This is expected and OK or I can improve that?
What do you mean exactly by "when the session is transferred to other pods [it] takes a while to respond"?
Sessions are replicated to other node for each request - so it's unclear to me what event preceded the request that took 15s to respond.
Are you talking about this the initial request for a given session? or following some kind of failover (i.e. scaling down or killing a pod)? Or do you mean after a new pod is started (i.e. scaling up)?
-
6. Re: Wildfly HA in Openshift
cweiler Jun 26, 2019 11:28 AM (in response to pferraro)pferraro escreveu:
What do you mean exactly by "when the session is transferred to other pods [it] takes a while to respond"?
Sessions are replicated to other node for each request - so it's unclear to me what event preceded the request that took 15s to respond.
Are you talking about this the initial request for a given session? or following some kind of failover (i.e. scaling down or killing a pod)? Or do you mean after a new pod is started (i.e. scaling up)?
Ok, let me try to explain this.
First, details, I tested 2 scenarios:
- Start 1 pod, than request a new deploy (1 pod to 1 pod session fail over) (Obs. readiness probe configured);
- Start 2 pods, than scale down 1.
Steps to reproduce, both scenarios:
1. start 2 different sessions to the same pod;
2. kill this pod;
3. refresh session 1, about 15s TTFB;
4. wait session 1 became responsive;
5. refresh session 2, instantly.
-
7. Re: Wildfly HA in Openshift
pferraro Jun 26, 2019 12:52 PM (in response to cweiler)cweiler I suspect the load balancer is trying to send the refresh request to the same node, which is not longer in the cluster. It likely retries a number of times before selecting an alternate load balancing target. This should be evident by looking through the logs of the load balancer. You may want to try the number of retries in haproxy as well as the reducing the connect timeout.
I doubt issue is with WF itself - as both failure types (shutdown (i.e. scale down) and killing the pod) should be immediately detected by the FD_SOCK protocol.
-
8. Re: Wildfly HA in Openshift
cweiler Jun 26, 2019 2:01 PM (in response to pferraro)Thanks pferraro, found this on playbooks:
#file haproxy.cfg.j2 defaults mode http log global option httplog option dontlognull # option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 300s timeout server 300s timeout http-keep-alive 10s timeout check 10s maxconn {{ openshift_loadbalancer_default_maxconn | default(20000) }}
I doubt issue is with WF itself - as both failure types (shutdown (i.e. scale down) and killing the pod) should be immediately detected by the FD_SOCK protocol.
Agree! Will research about it on net.
Thank you!