We are running Redhat single sign Red Hat Single Sign-On 7.3.1.GA (WildFly Core 6.0.12.Final-redhat-00001) on openshift cluster. It is setup to run as a deployment with 3 replicas.
When containers randomly restart or they are forced to restart by the operator, on some occasions (not always) I observed that cluster sometimes ends up in a state where I have two jgroup subgroups defined with 2 coordinators.
In the problem I am documenting I have following instances running:
All the instances were restarted, which resulted in creation of new kubernetes PODs with new instance names. Start time of individual instance varied by couple seconds.
I am attaching logs from those instances as well as the configuration file.
The way I interpret those logs is the nodes established 2 coordinators which formed 2 following jgroup subgroups:
-rhsso-5d4b857989-x2jlb, -rhsso-5d4b857989-nkk2m, -rhsso-5d4b857989-vvjjg
Once cluster gets in such state, it never recover from it.
When similar state was observed in other environments, users reported having issues with Redhat SSO. I believe that it is the case with this cluster as well, I just don't have any clients using it. As I noted, when pod gets deleted and re-created, it changes the name. I am not sure whether it is source of the problem, so I rather mention it.
I would like to understand whether the issue is related to my configuration or it is some sort of bug.
Any help, directions and/or suggestions how to investigate the issue further are appreciated.
standalone-openshift.xml.zip 6.5 KB