Some initial thoughts on this.
1) If a HASingleton is in the middle of starting its service, and then a view change event comes in that leads it to stop its service, the thread that is trying to do the stop should block to allow the start to complete cleanly. And vice versa. How long its willing to block before moving on and stopping the service should be a configurable parameter.
2) Some kind of callback mechanism would be useful, so that services that potentially take a long time in start/stop can periodically call back to confirm if they should continue. For example, a deployer deploying everything the deploy-hasingleton directory could call back after completing each deployment to check if it should continue.
3) We want to avoid the situation where multiple nodes become coordinator of a one-server group and then have to merge. Perhaps a slightly longer timeout in the PING protocol in our standard protocol stacks would help?
Certainly the current deployment logic is too tightly coupled to the view change notification. This issue probably really should be very much related to the configurable coordinator selection RFE. If a node has decided its the master and is doing a deployment, this is a strong vote to remain the coordinator when another subgroup joins. Of course this still leaves open the issue of having two clusters with singletons joining. Even in this case though, the resolution is to have better integration with the coordination selection protocol itself so that out of synch decisions based on an asychronous view change events don't trigger long running deployment tasks.
I don't view that 2 can be practially implemented today. If we know the longest singleton deployment takes N minutes, we should be able to configure the cluster layer to guarentee that once a selection is made, there will be no attempt to change this for N minutes. If after this time it turns out the two singleton clusters need to merge, then there is an orderly merge with a shutdown of one singleton and stability on this decision for another N minutes.
I view the current singleton problem as representative of trying to layer in semantics on top of msgs emitted from the jgroups stack without due consideration of how the clustered applications will behave. We need to make better decisions about whether a behavior is an inherent extension to the associated jgroups protocol vs a behavior that can reliably be built off of an asychronous event notification that may be out of date as soon as its received.
This is why many of the cluster service event notifications were originally synchronous. The problem was that this was not done properly to ensure the event notification was an extension of the protocol stack that could not cause deadlock or other bad side-effects.
One thing I have thought about in the past is a whether the higher level clustering features could be more rigorously defined in terms of a statemachine extesion to the jgroups protocol stack so that one could understand the cluster behavior and analyze its state, failures, and transitions much more meaningfully than one can do today by trying to merge N logs from M clustering layers.
As discussed above, full solution requires configurable policies that allow more complex interaction between the various nodes deploying the singleton. JIRA issue is http://jira.jboss.com/jira/browse/JBAS-2499 .
For 4.0.4, work done for http://jira.jboss.com/jira/browse/JBAS-2499 has changed the threading behavior of HASingleton deployments in a way that should eliminate the issue of 2 threads simultaneously starting/stopping the singleton:
1) Singleton MBean depends on ClusterPartition; thus by time singleton is started, partition is started and node knows if it is master.
2) Singleton registers its replicant with DRM.
3) *On the same thread* DRM calls back to singleton, notifying it of its own key change; singleton uses the callback to check that it is the master and starts its service.
4) Partition detects another node; *on the JG thread* the singleton is informed by DRM of key change; singleton uses callback to see it is no longer master and tries to stop its service. Simultaneous start and stop leads to issue reported in JBAS-73.
DRM queues all key change notifications, which are then dispatched to listeners by a separate handler thread. Thus, the two notifications in steps 3 and 4 above would come from this single handler thread. It is no longer possible for them to occur simultaneously.
An obvious implication of this is that if the start service process in #3 takes a long time, the process of stopping the service will have to wait until it completes, no matter how long that is.