1 Reply Latest reply on Oct 22, 2019 9:29 AM by rhusar

    Jboss HA 3 node cluster issue – 2 coordinators after restart

    ct333083

      We are running Redhat single sign Red Hat Single Sign-On 7.3.1.GA (WildFly Core 6.0.12.Final-redhat-00001) on openshift cluster.  It is setup to run as a deployment with 3 replicas.

      When containers randomly restart or they are forced to restart by the operator, on some occasions (not always) I observed that cluster sometimes ends up in a state where I have two jgroup subgroups defined with 2 coordinators.

      In the problem I am documenting I have following instances running:

       

      tstsso-rhsso-5d4b857989-x2jlb

      tstsso-rhsso-5d4b857989-nkk2m

      tstsso-rhsso-5d4b857989-vvjjg

       

      All the instances were restarted, which resulted in creation of new kubernetes PODs with new instance names. Start time of individual instance varied by couple seconds.

      I am attaching logs from those instances as well as the configuration file.

       

      The way I interpret those logs is the nodes established 2 coordinators which formed 2 following jgroup subgroups:

      -rhsso-5d4b857989-x2jlb, -rhsso-5d4b857989-nkk2m, -rhsso-5d4b857989-vvjjg

      and

      -rhsso-5d4b857989-nkk2m, -rhsso-5d4b857989-vvjjg

       

      Once cluster gets in such state, it never recover from it.

      When similar state was observed in other environments, users reported having issues with Redhat SSO. I believe that it is the case with this cluster as well, I just don't have any clients using it. As I noted, when pod gets deleted and re-created, it changes the name. I am not sure whether it is source of the problem, so I rather mention it.

       

      I would like to understand whether the issue is related to my configuration or it is some sort of bug.

      Any help, directions and/or suggestions how to investigate the issue further are appreciated.