9 Replies Latest reply on Dec 9, 2005 6:45 PM by starksm64

    JBAS-2539 Deadlock in accessing DistributedReplicantManagerI

    brian.stansberry

      Discussion of http://jira.jboss.com/jira/browse/JBAS-2539

      An example of a deadlock this recently contributed to:

      User simultaneously redeploys an ear on both nodes of a 2 node cluster. EAR contains a service that functions like the HASingletonDepoyer -- deploys packages in a folder when it's node is the master.

      1) Node A is undeploying the EAR, sends a message to Node B telling it that the deployer service is removed.
      2) Node B has already redeployed the service and is busy deploying other packages in the EAR.
      3) Node B receive's A's removal message on the JGroups message handler thread and notifies the HASingletonController for its copy of the deployer service. The controller decides the local service is master. It begins deploying packages. Within the sync block in notifyKeyListeners().
      4) Now there are two threads on B simultaneously doing deployments -- the regular scanner thread, and the message handler thread from JGroups.
      5) org.jboss.system.ServiceController prevents concurrent threads doing deployments, so the JGroups thread blocks until the normal deployments are done.
      6) Normal deployments can't finish because they can't register themselves with DRM as listeners, because the JGroups thread holds the monitor on the collection. Deadlock.

      I propose to attack this in 2 stages. As I commented in the JIRA issue, the synchronization in notifyKeyListeners is both serializing the use of the method and coordinating the access to the listener collection. So...

      1) Change the way the use of the method is serialized by making the method synchronized or blocking on a mutex. The entire method is basically inside the sync block anyway. Of course add a fat comment. Then synchronize on the listener collection only long enough to copy out a list of callees.

      2) At some later point, determine if the calls to the method need to be serialized. If not, remove that synchronization.

      The first fix will help with the *exact* issue detailed above, although I would expect their code would deadlock at some future point as well.

        • 1. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
          starksm64

          From my "@todo there is still too much synrhonization on keyListeners", I really think the synchronization on the keyListeners is just unnecessary since the change to the use of the ConcurrentReaderHashMap. It can stay on the registerListener/unregisterListener because of the non-atomic step of having to add/remove an ArrayList, but the synchronization in notifyKeyListeners is just uneeded. So what if a listener is added or removed during dispatch?

          You can make a copy as you indicate and that is certainly better than what exists now. All it really does is shorten the period over which list changes may affect the targets of the current notification, and is really not needed in my view.

          The other issue that we have cleaned up in other uses is that we should not be doing any dispatches into arbitrary layers from the jgroups notification thread. The same should be done here. Notifications should be ocurring from another thread so that the jgroups stack is not blocked.

          • 2. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
            brian.stansberry

            Are you aware of any intent to use the synchronized block in notifyKeyListeners to serialize calls to the method? Or was it just there to coordinate access to the collection?

            • 3. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
              starksm64

              I don't know of that intent, but its a possibility. Such serialization cannot be done from the jgroups callback though.

              • 4. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
                belaban

                I have a feeling Ovidiu or Scott fixed this (by removing the synchronization when iterating through the listeners) some time ago, are you looking at the latest code in head ?
                We could use a CopyOnWriteArray or ConcurrentReaderHashmap, then synchronization could go

                • 5. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
                  starksm64

                  No, its still there. Some synchronization was removed, but I only removed the minimum required for that particular deadlock. Now its time to completely fix this.

                  • 6. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
                    brian.stansberry

                    OK, without objection here's what I plan to do. (This approach is colored by a desire to get fixes in 4.0.4 but limited bandwidth before the release :)

                    1) Punt on the issue of whether or not calls through notifyKeyListeners() need to be serialized. The effect of the current code is to only allow a single thread to flow through the method; I'm going to make the method synchronized to keep that behavior. Add a big fat TODO comment to reconsider this. Personally I doubt calls through the method need to be serialized.

                    2) Remove synchronization on the keyListeners collection in notifyKeyListeners(). This fixes JBAS-2539.

                    3) Have all calls to notifyKeyListeners() that originate from the JGroups thread go through an async handler thread. These calls are easy to isolate, as they come from special purpose methods _add() and _remove().

                    Another option would be to have *all* calls to notifyKeyListeners go through the async handler thread and serialize them via the queue that thread processes, but then I'm adding asynchronous behavior to local calls, and right now don't have the bandwidth to properly consider implications of that.

                    • 7. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
                      belaban

                      We already have an async handler somewhere else, maybe there's some common code to be refactored ?
                      E.g. a service that can be used by all subsystems emitting notifications

                      E.g. a notification service, with its own thread pool ?
                      I son't suggest this needs to be done now, but at least we should put a JIRA issue on it.

                      • 8. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana

                         

                        "bstansberry@jboss.com" wrote:
                        OK, without objection here's what I plan to do. (This approach is colored by a desire to get fixes in 4.0.4 but limited bandwidth before the release :)


                        4) Write a stress test that excercises the problem.
                        That way you won't have to fix the problem again in 6 months time. :-)

                        • 9. Re: JBAS-2539 Deadlock in accessing DistributedReplicantMana
                          starksm64

                          The minimum change I want to see is an asych handoff to a notification thread so that the jgroups stack is not blocked. Its got be trivial to design a key notification listener that turns around and causes a deadlock by modifying a distributed key either directly or indirectly.

                          Mark the issue as critical and until this has been tested the release does not go out.