8 Replies Latest reply on Jun 26, 2019 2:01 PM by Claudio Weiler

    Wildfly HA in Openshift

    Claudio Weiler Newbie

      Hi,

       

      I'm trying to enable HA in Openshift, but failing miserably...

       

      So far, I got tips from various sites:

      - Getting Started with JBoss EAP for OpenShift Container Platform - Red Hat Customer Portal

      - High Availability Servlets with EAP 7 and OpenShift - Red Hat Developer Blog

      - haexample/standalone-openshift.xml at master · markeastman/haexample · GitHub

      - GitHub - jgroups-extras/jgroups-kubernetes: JGroups discovery protocol for Kubernetes

       

      Foreground:

       

      - standalone-full-ha.xml

      /subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp)
      /subsystem=jgroups/stack=tcp/protocol=MPING:remove()
      /subsystem=jgroups/stack=tcp/protocol=kubernetes.KUBE_PING:add(add-index=0)

       

      - environment variables

      JGROUPS_PING_PROTOCOL=kubernetes.KUBE_PING
      KUBERNETES_NAMESPACE= project
      KUBERNETES_LABELS=app=application

       

      - exposed ports

      7600 e 8888

       

      In logs I have that kube_ping is asking openshift api:

      FINE  [org.jgroups.protocols.kubernetes.stream.BaseStreamProvider] (thread-4,null,null) InsecureStreamProvider opening connection: url [https://172.30.0.1:443/api/v1/namespaces/project/pods?labelSelector=app%3Dapplication

       

      Running curl on this request from pods terminal returns all pods running.

       

      But, no metter how many pods is running, no one register each other in cluster.

       

       

      Any help?

        • 1. Re: Wildfly HA in Openshift
          Bela Ban Master

          Have you checked that you have proper authorization to do this? E.g.

          cat <<EOF | kubectl apply -f -

          kind: ClusterRole

          apiVersion: rbac.authorization.k8s.io/v1

          metadata:

            name: jgroups-kubeping-pod-reader

          rules:

          - apiGroups: [""]

            resources: ["pods"]

            verbs: ["get", "list"]

           

          ---

           

          apiVersion: rbac.authorization.k8s.io/v1beta1

          kind: ClusterRoleBinding

          metadata:

            name: jgroups-kubeping-api-access

          roleRef:

            apiGroup: rbac.authorization.k8s.io

            kind: ClusterRole

            name: jgroups-kubeping-pod-reader

          subjects:

          - kind: ServiceAccount

            name: jgroups-kubeping-service-account

            namespace: $TARGET_NAMESPACE

           

          ---

          1 of 1 people found this helpful
          • 2. Re: Wildfly HA in Openshift
            Claudio Weiler Newbie

            Hi belaban, yeah, permissions are ok, thanks.

             

            But, I evolved in my issue. Now the cluster is comunicating each other. The problem was that port 7600 are opening, by default, in 127.0.0.1 interface, so I solved this with:

             

            /interface=kubernetes:add(nic=eth0)
            /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp/:write-attribute(name=interface,value=kubernetes)

             

            Despite that, I can't get session fail over to work.

             

            I added tag in web.xml:

             

            <distributable/>

             

            No luck with that. Load balance is working, I get diferent pods on different sessions, but when I kill the pod, session is lost.

            1 of 1 people found this helpful
            • 3. Re: Wildfly HA in Openshift
              Paul Ferraro Master

              You will want to modify the interface used for all socket bindings used by the protocol stack.  The stack mentioned in one of the links also uses the jgroups-tcp-fd socket binding to configure the socket used by FD_SOCK.  If FD_SOCK can't communicate with other nodes, it will assume it has crashed - which explains why your application's session don't failover.

              e.g.

              /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp-fd:write-attribute(name=interface,value=kubernetes)  
              1 of 1 people found this helpful
              • 4. Re: Wildfly HA in Openshift
                Claudio Weiler Newbie

                Hi pferraro, thanks for helping.

                 

                I added your suggestion to my script.

                 

                I managed to enable verbosity on logs and get the problem:

                Quote from the Servlet 3.0 specification (chapter 8.2.3 "Assembling the descriptor from web.xml, webfragment.xml and annotations"):

                "viii. The web.xml resulting from the merge is considered <distributable> only if all its web fragments are marked as <distributable> as well."

                And, we DO have web fragments. :facepalm:

                 

                So, this is our final configs to enable HA:

                 

                1. Grants permission

                oc policy add-role-to-user view system:serviceaccount:{namespace}:default -n {namespace}

                 

                2. Config Wildfly

                /subsystem=messaging-activemq/server=default/cluster-connection=my-cluster:write-attribute(name=reconnect-attempts,value=10)
                
                /interface=kubernetes:add(nic=eth0)
                /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp/:write-attribute(name=interface,value=kubernetes)
                /socket-binding-group=standard-sockets/socket-binding=jgroups-tcp-fd/:write-attribute(name=interface,value=kubernetes)
                
                /subsystem=jgroups/channel=ee:write-attribute(name=stack,value=tcp)
                /subsystem=jgroups/stack=tcp/protocol=MPING:remove()
                /subsystem=jgroups/stack=tcp/protocol=kubernetes.KUBE_PING:add(add-index=0)

                 

                3. Config Openshift app deployment

                spec:
                  template:
                    metadata:
                      labels:
                        app: applicationname
                    spec:
                      containers:
                        env:
                        - name: JGROUPS_PING_PROTOCOL
                          value: kubernetes.KUBE_PING
                        - name: KUBERNETES_NAMESPACE
                          value: namespace
                        - name: KUBERNETES_LABELS
                          value: app=applicationname

                 

                4. Config application

                Add <distributable/> tag to each and every web.xml and web-fragment.xml

                 

                Obs. No exposed ports mapping necessary

                 

                Finnaly, session fail over working (tested on Wildfly 17 / OKD 3.11).

                 

                One last issue. When session is transferred to other pods its takes a while to respond, something about 15s TTFB, but only on first request, other sessions are immediate after the first. This is expected and OK or I can improve that?

                • 5. Re: Wildfly HA in Openshift
                  Paul Ferraro Master

                  cweiler  wrote:

                  Quote from the Servlet 3.0 specification (chapter 8.2.3 "Assembling the descriptor from web.xml, webfragment.xml and annotations"):

                  "viii. The web.xml resulting from the merge is considered <distributable> only if all its web fragments are marked as <distributable> as well."

                  And, we DO have web fragments. :facepalm:

                   

                   

                   

                  I'll make sure we clarify this better in our documentation, as you are the first person to be tripped up by this particular idiosyncrasy.

                   

                   

                  [Finally], session fail over working (tested on Wildfly 17 / OKD 3.11).

                  Congrats!

                   

                  One last issue. When session is transferred to other pods its takes a while to respond, something about 15s TTFB, but only on first request, other sessions are immediate after the first. This is expected and OK or I can improve that?

                  What do you mean exactly by "when the session is transferred to other pods [it] takes a while to respond"?

                  Sessions are replicated to other node for each request - so it's unclear to me what event preceded the request that took 15s to respond.

                  Are you talking about this the initial request for a given session?  or following some kind of failover (i.e. scaling down or killing a pod)?  Or do you mean after a new pod is started (i.e. scaling up)?

                  • 6. Re: Wildfly HA in Openshift
                    Claudio Weiler Newbie

                    pferraro  escreveu:

                     

                     

                    What do you mean exactly by "when the session is transferred to other pods [it] takes a while to respond"?

                    Sessions are replicated to other node for each request - so it's unclear to me what event preceded the request that took 15s to respond.

                    Are you talking about this the initial request for a given session? or following some kind of failover (i.e. scaling down or killing a pod)?  Or do you mean after a new pod is started (i.e. scaling up)?

                     

                     

                    Ok, let me try to explain this.

                     

                    First, details, I tested 2 scenarios:

                    • Start 1 pod, than request a new deploy (1 pod to 1 pod session fail over) (Obs. readiness probe configured);
                    • Start 2 pods, than scale down 1.

                     

                    Steps to reproduce, both scenarios:

                    1. start 2 different sessions to the same pod;

                    2. kill this pod;

                    3. refresh session 1, about 15s TTFB;

                    4. wait session 1 became responsive;

                    5. refresh session 2, instantly.

                    • 7. Re: Wildfly HA in Openshift
                      Paul Ferraro Master

                      cweiler I suspect the load balancer is trying to send the refresh request to the same node, which is not longer in the cluster.  It likely retries a number of times before selecting an alternate load balancing target.  This should be evident by looking through the logs of the load balancer.  You may want to try the number of retries in haproxy as well as the reducing the connect timeout.

                       

                      I doubt issue is with WF itself - as both failure types (shutdown (i.e. scale down) and killing the pod) should be immediately detected by the FD_SOCK protocol.

                      • 8. Re: Wildfly HA in Openshift
                        Claudio Weiler Newbie

                        Thanks pferraro, found this on playbooks:

                         

                        #file haproxy.cfg.j2
                        defaults
                            mode                    http
                            log                    global
                            option                  httplog
                            option                  dontlognull
                        #    option http-server-close
                            option forwardfor      except 127.0.0.0/8
                            option                  redispatch
                            retries                3
                            timeout http-request    10s
                            timeout queue          1m
                            timeout connect        10s
                            timeout client          300s
                            timeout server          300s
                            timeout http-keep-alive 10s
                            timeout check          10s
                            maxconn                {{ openshift_loadbalancer_default_maxconn | default(20000) }}

                         

                        I doubt issue is with WF itself - as both failure types (shutdown (i.e. scale down) and killing the pod) should be immediately detected by the FD_SOCK protocol.

                        Agree! Will research about it on net.

                         

                         

                        Thank you!