9 Replies Latest reply on Dec 9, 2005 6:45 PM by Scott Stark

    JBAS-2539 Deadlock in accessing DistributedReplicantManagerI

    Brian Stansberry Master

      Discussion of http://jira.jboss.com/jira/browse/JBAS-2539

      An example of a deadlock this recently contributed to:

      User simultaneously redeploys an ear on both nodes of a 2 node cluster. EAR contains a service that functions like the HASingletonDepoyer -- deploys packages in a folder when it's node is the master.

      1) Node A is undeploying the EAR, sends a message to Node B telling it that the deployer service is removed.
      2) Node B has already redeployed the service and is busy deploying other packages in the EAR.
      3) Node B receive's A's removal message on the JGroups message handler thread and notifies the HASingletonController for its copy of the deployer service. The controller decides the local service is master. It begins deploying packages. Within the sync block in notifyKeyListeners().
      4) Now there are two threads on B simultaneously doing deployments -- the regular scanner thread, and the message handler thread from JGroups.
      5) org.jboss.system.ServiceController prevents concurrent threads doing deployments, so the JGroups thread blocks until the normal deployments are done.
      6) Normal deployments can't finish because they can't register themselves with DRM as listeners, because the JGroups thread holds the monitor on the collection. Deadlock.

      I propose to attack this in 2 stages. As I commented in the JIRA issue, the synchronization in notifyKeyListeners is both serializing the use of the method and coordinating the access to the listener collection. So...

      1) Change the way the use of the method is serialized by making the method synchronized or blocking on a mutex. The entire method is basically inside the sync block anyway. Of course add a fat comment. Then synchronize on the listener collection only long enough to copy out a list of callees.

      2) At some later point, determine if the calls to the method need to be serialized. If not, remove that synchronization.

      The first fix will help with the *exact* issue detailed above, although I would expect their code would deadlock at some future point as well.